Image Ready

If only you could find that video where…well, multimedia search is only a question of when, not how

We shall not begin this with “a picture is worth a thousand words.” Sorry, we just did. Well, whatever we start with, the relevance of that adage here is threefold: first, we mostly search for text; second, 99 per cent of the actual data on the Web is audio-visual (AV); and third, we’re going to be searching for non-textual information more and more. Witness the exploding popularity of online video, for example.

The Big Picture
The problem is simple enough: search engines do the text spidering, and all that remains in that domain is how to produce more relevant results. Even if we still mostly search for text, the fundamental problem there-or its solution-won’t change. Now when a search engine like Google or Yahoo! gives you image results, it’s based on metadata and tags: someone has to have spelt out in some way or the other that a certain image depicts Britney Spears (with her hair in place) and not her grandmother. What when such information isn’t available? What when you pose in front of the Taj Mahal holding a cigarette, and the image is tagged “Taj Mahal”: would you find it by searching for yourself-or for “cigarette”? What when Bill Gates meets up with our IT minister, and you want a picture of the minister? There is, quite literally, an infinity of such questions. Fact is, there’s tons of AV content out there, and only things like tags are used to search it-at least in the case of the big search engines.

Looking At The Words
February 8, 2007, it was announced that the European Commission had granted funding to a research project that would be led by a search technology firm called Fast Search and Transfer (FAST). Without going into the specifics, FAST will now be working to innovate search solutions across all types of content, including multimedia.

From a FAST white paper, roughly, standard search is built on four pillars: “treating a load of information as discrete documents; using a distributed architecture for optimal performance; taking free text and metadata to derive each document’s ‘meaning’; and using a relevancy model to best match each query to the information in the index.

“Multimedia search faces basic challenges to this paradigm: ‘documents’ are not clearly defined; particularly large data volumes are involved; what each object talks about is not simple to extract; and relevancy models need to take more variables into account.”
Let’s take a look at each of these challenges.

Looking At The Picture
First, “documents” are not clearly defined. A single news programme can be divided into many parts, with each reportage clip as a separate part. Second, there’s a lot of AV information out there, in comparison to text: only 1 per cent of information-in terms of data-is text, by some estimates. Third and most important, knowledge extraction from text is relatively simple. Yes, it’s not simple at all, but relatively, it is. After all, a machine can be trained to classify stuff with words like “stocks” and “shares” as financial, but how difficult it can get to classify an untagged video as one featuring Britney Spears! And how much more difficult it gets when she goes bald!

Finally, think about relevance: how easy it is to classify a document as relevant when it contains a lot of your keywords, and how easy (and wrong) it is to say that a picture of a bald Britney is relevant when you’re searching for wallpapers! (We’re assuming you don’t want bald women on your Desktop.)

It is primarily these hurdles that stand in the way of your being able to ask your PVR software, “Show me the episode of Seinfeld where Kramer takes the beer cans to Michigan.” Solutions have been mulled upon and are being worked out, and you’d do well-very well-to take a look at Demystifying Multimedia Search, at

To see how one thing is similar to another requires intelligence
of some sort, and that’s at the heart of why AV search
is so difficult

“Visual Search Has Arrived!” “Try It Now!” screams, a site that claims to look inside images instead of just at the tags and metadata. (We added the exclamation marks.) It’s just starting off, and you can see that the database is very small-search for “beautiful” under “People”, and you get some random pictures. Search for “ugly” under “People” again, and you get just one picture-of a skeleton hugging a girl. Mouse over the skeleton, and it says “ugly face.” How long until we get beautiful and ugly people when we search on those terms respectively?

Song By Tune
A vision for AV search different from that of FAST is a P2P model, and is called SAPIR (Search In Audio Visual Content Using Peer-to-peer Information Retrieval). It is an IBM project. In SAPIR, end-users are visualised as peers that can produce AV content from mobile devices. This content will be indexed by super-peers across a scalable P2P network to enable content searches in real-time. The idea is that if you supply a ringtone of a song-which is similar to the song itself-that ringtone will be added to the database-and a similarity search will allow other users to find the song based on its tune. To be more precise, when you, as a leaf node on the P2P network, supply a hint, those who provide similar hints will get the full song as the result of their search.

Riya’s sole search result for “ugly” under “people”. Yes, it is ugly, but there’s only one result!

As the SAPIR page puts it, “An ability to assess similarity lies close to the core of cognition. The sense of sameness is the very backbone of our thinking. An understanding of problem solving, categorisation, memory retrieval, inductive reasoning, and other cognitive processes require that we understand how humans assess similarity.” Ah, Artificial Intelligence! It should have been obvious, really-to see how one thing is similar to another does require intelligence of some sort, and is not something machines can easily do. And that’s at the heart of why AV search is so difficult. Again, you’d do well to visit for a proper understanding of SAPIR’s vision and problems.

Mimicking The Eye
Yet another one in this space is LTU Technologies. LTU’s technology can distinguish between “duplicate,” “cloned,” and “similar” images. Duplicates are identical copies; cloned images are those modified somewhat, such as by being stretched; and similar images are, well, similar. Like a Boeing and an F16-they’re both airplanes.

LTU says they copied the human visual system. Visit for the details. We’ll quote a little: “The analyzer learns object profiles, refines its sense of what an object ‘looks like’ and, therefore, continuously enriches its internal knowledge base.” Artificial Intelligence is, clearly, fundamental to the scheme. The concept of similarity comes in here as well, like in SAPIR.

A Hard Problem
It all seems to have happened over the past year. 16 November 2006, it was announced that NEC had made a “technological breakthrough” in finding TV programmes on your PVR or chapters on your DVD. They call it “topic division technology.” Here, patterns from the AV data of a programme are analysed, and all the data related to a specific search term can be brought up. The example at goes like this: “When searching for Major League Baseball player Suzuki Ichiro, the technology uses keywords like ‘baseball,’ ‘United States,’ and ‘outfielder’ to locate and play video footage of Ichiro.” Naturally, a huge database of terms is required, which is what will link “Ichiro” to “outfielder” and “baseball”. In addition, the system would have to have been trained on tons of videos to recognise what a baseball game is, and where the outfielder is.

It’ll Happen
The need for AV search will-you guessed it-only increase as more people upload images and videos to the Web. In retrospect, it all happened very quickly-where was Flickr or YouTube or Yahoo! Photos just three years ago?

With almost all accessible content being AV, it is, obviously, only a matter of time before AV search evolves to the level of text search. We should mention that AV search will always use metadata to complement the hard technology. What’s not clear is whether it’ll happen gradually-or whether there will be a killer app. We’re betting on the former.

Ram Mohan Rao