Second only perhaps to e-mail, search engines are probably the most useful thing on the Internet-where would you go without them? How many URLs do you know about that you just happened to stumble upon, following links? Most of your bookmarks are probably the result of Googling, or of having used some other search engine.
You sure can't trust everything you see on the Internet. The page you're going to refer to should be, in some way, 'better' than other similar pages
But is Google God? Not yet, at least. And no other search engine is, either.
How many times have you wished that Google's 'I'm feeling lucky' button would take you to the exact page you were looking for? And how many times has that actually happened? Even if you think it did, there was probably a page out there that was better than what Google showed up when you pressed 'I'm feeling lucky'.
The idea is that a search engine needs to fetch the few specific pages from the billions available that interest you. It's a daunting task.
The problem gets broken into three parts. The first is the keywords you enter: how accurate are they? The second is in the way the search engines map the Net. And finally, there's the problem of people accurately tagging their Web pages so the search engine knows what a particular page is about.
We now have desktop search, video and image search, local search, and more, in addition to just the directory-based search of a few years ago. So where is search today, and where do we wish it to take us? Will that 'I'm feeling lucky' button become the magic button you've always wanted?
In what follows, we focus on the problem of searching for Web pages. At the outset, we must define what kind of pages we're looking for when we conduct a search. The most important criterion, of course, is relevance. The page that comes up should contain something very similar to (if not exactly) what you are looking for. But relevance is not the only thing. If you're doing a search on crocodiles, it's very likely that you're also interested in alligators; therefore, ideally, a search for 'crocodile' should show up results that have only the word 'alligator' in them, in addition to those that contain 'crocodile'.
There's more. How authoritative is a page? You sure can't trust everything you see on the Internet. The page you're going to refer to should be, in some way, 'better' than other similar pages.
A related criterion is popularity. You'd rather get your information from a popular source-assuming it is authoritative as well-than from an obscure source. Actually, these two are related, so we could just say we want pages that are authoritative and popular. Popular pages are more likely to be authoritative than unpopular ones for obvious reasons.
In sum, then, what is desired of the ideal search engine is that it displays not millions of pages that may or may not interest you, but a few authoritative, popular pages that address your need-in addition to other pages that may interest you, which you didn't think of when you typed in your search.
This is possible, firstly, if your search query is good; second, if the search engine determines your intent well; third, if the pages out there properly and honestly indicate to the search engine what they are about.
Now that's a lot of variables for a single search, and only if they come together every time you search will the results be of any use.
Good Search Queries
Technology is supposed to make life simpler. That includes making it unnecessary for us to rack our brains when we're conducting a search. There should be no need to be an 'expert Googler', if there were such a term. However, most of today's search engines actually do require that-that you frame your query well. If you're searching for a local place that serves up pizza, and you'd like to order online, you should be able to just type in 'pizza' and get your restaurant. Now that would entail some level of localisation capabilities built into the search engine, and also the fact that the engine should be able to determine what your intent is-namely, that you want to buy pizza.
Similarly, if you type in 'When was Britney born?', the search engine should be able to take that as a question and give you the answer. That also entails that the engine should be able to determine that you mean Britney Spears, not any other Britney out there. And as we mentioned earlier, returning popular results is important, and here, a popular page is much more likely to be about Britney Spears than about any other Britney.
At the time of going to print, a Google search on 'When was Britney born?' provides the answer, just as though what you typed in were a question. (Note that this happens when you use Google.com, not Google.co.in.) None of the other major search engines do, probably being thrown off by news of Britney Spears becoming a born-again Christian, and by the song Born to make you happy.
So is Google indeed God? It's premature to make such a conclusion, but the fact is that Google does, in some measure, seem to understand what you're looking for better than the other search engines do. In an ideal world, every search engine should be able to provide the answer to that question, just as if it were a question-and without the need for a specialised sub-engine that understands queries for dates!
How? The answer could just lie in AI (Artificial Intelligence), and the semantic mapping of the Web.
The Semantic Web?
If much of the problem with finding the right page lies in the information that the pages give out to search engine crawlers, the answer might just lie in the vision of Tim Berners-Lee called the Semantic Web, in the context of the WWW.
From Semanticweb.org, "the Semantic Web is a vision: the idea of having data on the web defined and linked in a way that it can be used by machines." It is, essentially, a project that aims to create a more intelligent Web by annotating pages on the Web with their semantics (meaning), in a manner understandable by computers (or search engine spiders). Thus, if a spider knew what a page was about, it would return more relevant results-or so the idea goes.
Let's take the Britney example again: there is no way for a spider or any other automated agent to know that Britney Spears is a person; that something like "2nd December 1981" is a date; or even that any given person has a special date called a birth date.
Using XML and other technologies, this information can be made explicit in the page that contains these elements. And what a great help that would be! If your search engine could understand things such as birthdates and people, and if pages could declare themselves as being descriptive of a particular person and a particular date, it would be much easier to find an answer to "When was Britney born".
As of now, the Semantic Web is only a vision, with its proponents and detractors. There are various technical reasons for this, which we cannot go into within the scope of this article. We'll just have to wait and watch to see if it takes off.
Emergent semantics is described as "a self-organising alternative to the Semantic Web that does not require any recoding of the data currently available online. Based on successful experiments with communities of robots, emergent-semantic technology is built on the principles of human learning." This is being worked on by Sony Computer Science Laboratory. Could this 'emergent semantics' be a viable alternative to the Semantic Web?
The Web is not viewed for the collection of documents that it is: it is viewed as a lot of individual documents taken separately
In November 2004, an article by Junko Yoshida and R Colin Johnson described emergent semantics as extracting the meaning of Web documents from the manner in which people use them. The scheme would harness the human communication and social interaction among peer-to-peer file sharers, database searchers and content creators to append the semantic dimension to the Web automatically, instead of depending on the owner of each piece of data to tag it.
Sony argues that this latter method-of the owner of each Web page, for example, tagging the page with its meaning-is similar to attempting AI by writing 'if-then' statements about everything in the world.
So how would emergent semantics help automatically tag documents? A basic explanation is that the meaning of a document is taken from the browsing paths of all the people that browse that document. A browsing path, of course, is the path that you take while following document links-from Classical.com to, say, Beethoven.com, to Mp3.com. Since you visited Classical.com before and Mp3.com after visiting Beethoven.com, there is an indication that Beethoven.com has something to do with 'classical' and with MP3s. Now take the behaviour of all the users who've visited Beethoven.com and there you have it-the page is about Beethoven, Beethoven is a person, the page has classical music on it, and so on!
|Search Engine Optimisation
|Search engine optimisation (SEO) is the process of effectively modifying a Web site so it shows up high in the results page by a search engine. Basically, an SEO'd site is more spider-friendly than a non-SEO'd one.
What happens when a spider visits a Web page? It looks at the title tag, the Meta tag, the 'alt' tag in images on the site, and so on. All these need to be filled in, and filled in well. 'Well' here means that they should be accurate reflections of what the site is about, and that they should be dense enough for the spider to get sufficient information from them.
There are, of course, several unscrupulous ways of boosting search engine rankings. For example, a site may put in a lot of hidden text that will drive users to the page-a pornographic site may include hidden text about, say, "Windows", so that the page shows up along with 'Windows' results when you're looking for info on Windows. Similarly, one may repeat keywords hundreds of times on a page, so that the site ranks high for that keyword.
Take cloaking, for example. A popular definition says cloaking is "the process by which your site can display different pages under different circumstances. It is primarily used to show an optimised page to the search engines and a different page to humans." Most search engines penalise a site if and when they discover it is using cloaking.
SEO is big business, especially because of the exponential growth of the number of pages on the Web.
Latent Semantic Indexing
Talking about 'semanticising' the Web, another important technique is Latent Semantic Indexing (LSI). With a regular keyword search, a document, looked at from the search engine's point of view either contains a keyword or doesn't. There's no middle ground. And each document stands alone-there's no interdependence between documents. The Web is not viewed for the collection of documents that it is: it is viewed as a lot of individual documents taken separately.
In LSI, the regular recording of what words a document contains is done first. The important addition is that it examines the document collection as a whole to look for other documents that may contain the same words in a certain document. What does this do? Essentially, if two documents have a lot of words in common, they are 'semantically close'. ('Semantics' means 'meaning'.) And if two documents don't have many words in common, they are semantically distant.
Now when you perform a search on a database that has been indexed by the LSI method, the search engine looks for documents that semantically match the keywords. For example, in the semantic system, we're talking about, 'crocodile' and 'alligator' are pretty close, so a search on 'crocodile' would also bring up pages that contain only 'alligator' with no mention of 'crocodile'.
If search engines were to use LSI, they would be more powerful. Think about the fact that a search engine looks for pages that contain all your keywords. If you were to enter twenty keywords into a regular search engine, you'd get very few results-but LSI shows the reverse behaviour. If you enter more search terms into a search engine that has LSI'd the Web, it's likely to find more, not less, documents of relevance-for the simple reason that it would bring up closely related documents for each keyword.
You could then filter the results according to relevance, which would provide feedback to the engine about what you think best matches your query. This, combined with personalisation (see the 'Personalisation' section in this article for more), would lead to your results getting much better over time.
LSI could also help archivists-those who categorise documents into classes based on their contents. If there were an LSI system in place, a document would already have attributes assigned to it, in the sense of "this document is about such-and-such a topic." The archivist would only have to add to or subtract from the list of attributes, instead of having to make up a list from scratch.
In the real world, it was reported on 5th February 2005 on Seobook.com, "Many people have been noticing a wide shuffle in search relevancy scores recently. Some of those well in the know attribute this to latent semantic indexing, which Google has been using for a while, but recently increased its weighting."
AI And NLP
Coming back to how AI could help, the answer is that AI could help determine one's intent when one feeds in keywords, and it could help in understanding the contents of a page as well. When both these happen, what you have is a smarter search engine. This could happen through implementations of NLP (Natural Language Processing).
We've talked about NLP earlier in Digit (June 2003): it's basically a method by which a computer processes something said in a 'natural language' such as English, as opposed to a computer language, and comes up with something intelligent. That is, when you apply NLP to something like "the world is round", the machine would have an internal representation of that fact. That sentence would not remain four, un-understood words, but would mean something to the machine. They would mean that something called 'the world' has a property, and that that property is called 'round'. The system could also, if the knowledge has been fed in, know what round means; it might be able to deduce that the world is therefore in some way similar to a ball, and so on.
In the future, then, if search engines come up with NLP implementations, we'd have a situation in which they would understand what you're saying. When you ask "When was Britney born", they would know that you're talking about a person called Britney; that you need a date as an answer; and that date is connected with that person called Britney; and finally, that the date in question is a birth date. Note that Google seems to be doing something like this already!
On the other side, what about the pages that have the answer to the question? Those pages, when they're crawled or 'spidered' have to have NLP applied to them too, at least in a rudimentary way. If a Web page says "Britney Spears' birthday is the 1st of January", the NLP-enabled spider would be able to deduce, as in the example above, that there is a person called Britney Spears, that that person has a property called a birth date, and that the value of that property is "1st January". The search engine would then serve up the page.
Of course, it's not as simple as it sounds. Conceptually, it is-but NLP is one of the hardest AI problems. However, Tom Mitchell, former president of the AAAI (American Association of Artificial Intelligence), said in November 2003 that in three to five years, we could have something like this.
Mitchell said that we can already develop computer software that can examine a Web page, and find names of people, dates and locations.
"It can't read text and understand it in the level of detail people can, but already, it can read text and can say, 'Oh, this is the name of a person' with about 95 per cent accuracy and, 'Oh, this is a location; this is a date'," he said.
Researchers have written programs that can find names and job titles of people mentioned on a Web site. For example, such programs can find "Jane Smith, Vice President of Marketing," or "Joe Jones, CEO," according to Mitchell. He even goes so far as to say that you can go to a search engine and type in, "Show me a list of universities that offer meteorology as a major and order them by student-to-faculty ratio"!
Is that an exaggeration? Coming from a former president of the AAAI, we're tempted to think not. However, AI claims are traditionally exaggerated, and there have been brilliant people in the past who predicted that a computer would become as intelligent as an average human being by 1980. AI claims, therefore, sometimes need to be taken with a pinch of salt.
An example of a current search engine that has AI claims to fame is Accoona. From the "Artificial Intelligence" link on Accoona: "Accoona Artificial Intelligence is a Search Technology that understands the meaning of search queries beyond the conventional method of matching keywords. This user-friendly technology, merging online and offline information, delivers more relevant results and enhances the user experience…
"Accoona's AI uses the meaning of words to get you better searches. For example, when you type five keywords in a traditional search engine, you're going to get every page that has all five keywords, no more, no less. With Accoona's AI Software, which understands the meaning of the query, the user will get many additional results. Accoona's AI also super-targets your search. For example, within a query of five keywords, Accoona AI allows the user to highlight one keyword, and will rank the search results starting by every page where the meaning of that one keyword is more important than the meaning of the other four keywords."
So does Accoona live up to the hype? We typed in "When was Britney born?", again, into Accoona, and first appeared sponsored results about Born shoes. The top results all brought up the song Born to make you happy. No AI at all here. Accoona doesn't parse what we type into it, and it doesn't recognise our query as a question-leave alone supplying the answer.
We can already develop computer software that can examine a Web page, and find names of people, dates and locations
Director, Center for Automated Learning and Discovery, School of Computer Science, Carnegie Mellon University
What's throwing so many engines astray is the song Born to make you happy, which almost always shows up before links containing the birthdate. Another reason is that common words are ignored, so that means 'when' and 'was' are ingored-precluding the possibility of parsing the query. This illustrates what plagues search engines today: they search mostly on keywords, with little attention to what your query means or what a page is trying to say.
This is changing; Google already uses LSI, and is giving it more and more importance (see section 'Latent Semantic Indexing' for more).
Reviews of Accoona are mixed, but the general consensus is that the database is not large enough yet for it to replace any of the biggies. And in our experience, the artificial intelligence doesn't show up. The time isn't ripe yet-and perhaps we'll see better AI implementations in the years to come.
Some argue that the future of search engines lies in personalisation. And that means giving up some of our personal data, such as browsing habits, to the search engines. People paranoid about sharing such information might not like the idea; they're the same people who wouldn't want to use GMail because GMail 'reads' your mails. But personalisation could just be the way to go.
Google recently launched the personalised version of its search, at http://labs.google. com/personalized. You create a profile of your interests, and Google returns your results based on those, and more importantly, Google remembers you over sessions if you log in with your Google account. It is thus able to offer you better and better results over time, because, simply speaking, each time you conduct a search and click on a link, you're giving it a better idea of who you are and what it is that you're likely to want displayed as search results. This is rudimentary AI at work. For example, if you've often conducted searches on home improvement and on glass, a personalised search engine is more likely to interpret your "windows" keyword as meaning glass windows, rather than the operating system.
Craig Silverstein, a data mining researcher and now a director at Google, has a quote about personalised search: "It's clear that a list of links, though very useful, doesn't match the way people give information to each other... How can the computer become more like your friend when answering your questions? That means giving direct answers to questions, extracting data from online sources rather than giving links to Web pages. It also means doing a better job of divining what the searcher is looking for, tailoring results more closely to what, based on past experience, appear to be the user's particular interests."
There are some in the know who argue that clustering, not personalisation, will be the future of search. A search engine called Clusty currently clusters search results admirably-a search on "Britney Spears" brought forth the following clusters: "Britney Spears pictures (53 sites), Nude (28 sites), Artists, Art (24 sites), Fan (17 sites), MP3 (10 sites), and so on. And these clusters actually had relevant documents in them! The 'Pictures' cluster linked to pictures sites, the 'Fan' cluster led to fan sites, and so on.
Of course, you might ask, "Why not supply all the keywords-such as 'Britney Spears Fan Sites'-in the first place?" The answer is that the clusters bring out related things of interest that you might miss out with your keywords.
In other words, you might not know exactly what you're looking for. In this example, the 'Federline' cluster contained 10 documents-Kevin Federline is Britney's husband, which we didn't know until we'd conducted this search on Clusty. And naturally, someone interested in Britney is likely to be interested in the related topic of her husband as well.
Clusty's clusters-on the left-are often useful, but when we asked it when Britney was born, Born to make you happy dominated the results
Seo-scoop.com has this to say about clustering: "Clustering related search terms into groups has already been implemented by several smaller search engines such as Vivisimo's Clusty, but the other major search engines (including Google) will likely soon follow."
And here's a quote from Tony Philipp, executive VP of Vivisimo: "The challenge that you have right now is not information overload. It's information overlook."
You might be familiar with a popular alternative search engine out there called Teoma. Although 'Authority' isn't a word one would normally associate with search engines, Teoma's selling point is its Authority rankings. As they claim: "The Teoma difference is authority. A lot of the players in the ever-evolving search space talk about relevance. But what do they really do to achieve this Holy Grail? And what do they offer to prove their claims? The truth is, not much. Teoma has invented a whole new approach to search, and this allows us to achieve our mission of providing the best search results on the Web… Teoma adds a new dimension and level of authority to search results through its breakthrough approach, known as Subject-Specific Popularity.
We asked Teoma about DDR2. The first non-sponsored result is from a well-known publication. The 'Refine' options seem intelligent, too
"Instead of ranking results based upon the sites with the most links leading to them, Teoma analyses the Web as it is organically organised-in naturally-occurring communities that are about the same subject... To determine the authority-and thus the overall quality and relevance-of a site's content, Teoma uses Subject-Specific Popularity, which ranks a site based on the number of same-subject pages that reference it, not just general popularity."
So does it work? Are Teoma results really more relevant and useful than Google's? We threw "When was Britney born?" at it (again), and surprisingly, there was no link to the star's birth date in any of the top ten results. However, Teoma also has a 'Refine' feature that lets you zero in on what you want, and in the 'Refine' section, we found the following: 'Britney Spears Biography'; 'What Is Britney Spears' Brother's Name'; 'How Old Is Britney Spears'; 'Britney Spears Life'; and 'Britney Childhood'. Unfortunately, clicking 'How Old Is Britney Spears' and 'Britney Spears Life' didn't provide the answer either.
So what is Teoma good at? It doesn't seem to be too good at answering questions, as we've seen, but when you're looking for general information on a subject, Teoma brings up obscure results that are way down in the Google list-taking Google as the benchmark. For example, a Google search on 'DDR2' brought up product pages first; a Teoma search brought up an article from theinquirer.net, a popular site, as the first result. Close to the top was also a page from lostcircuits.com, a site we found was a good source for hardware news. Google or Teoma-it's your call!
SEO And The Ultimate Search Engine
Everything we've discussed so far-emergent semantics, LSI, AI, NLP, clustering, personalisation, authority ranking-are things that search engines already have experimented with, or will soon experiment with. Who does what first is what matters when it comes to competition in the search space. The ideal search engine is just something that brings as many of these techniques together as possible. It will feature personalisation; it will cluster your results…
The important thing is that search engine optimisation (SEO) (See box 'Search Engine Optimisation'), as it exists today, should become redundant. SEO is all the rage-just search for 'search engine optimisation' and you'll get an idea of how big the business really is. There are all those sites out there struggling to get noticed, to get to the top.
On the other hand, what do you want?
You want the sites that are most relevant, authoritative, and popular, along with related sites that may bring related topics to your attention. Peoples' habits aren't going to change in a long while-we'll most likely continue with our limited keyword habits. So are your interests at conflict with that of the search engine optimisers?
It may seem so. Pages that are not search-engine friendly, but of more interest to you, will, as of today, appear below those that have been optimised. And that's not what you want. You-the information-seeker-want SEO to be redundant. That is also the ultimate goal of Google, or of any other search engine.
It's when SEO vanishes that we'll know that the ultimate search engine has arrived. Until then-happy Googling!