The magic behind the science

The past few thousand words were dedicated to tell you about the hardware that’s used in AR. Before that we have been talking about the needs, significance and history of AR. We even talked about the different ways in which AR is implemented. It would have become obvious to you that the technology is not simple. At various levels, we are in need of different techniques to achieve the goal. In the following few pages, we talk about the technologies that bind together the various parts to produce the mesmerizing effects of AR. We would go through each type one after another.


We have answered the when, who, what and why of augmented reality. It’s time now to answer the how

The past few thousand words were dedicated to tell you about the hardware that’s used in AR. Before that we have been talking about the needs, significance and history of AR. We even talked about the different ways in which AR is implemented. It would have become obvious to you that the technology is not simple. At various levels, we are in need of different techniques to achieve the goal. In the following few pages, we talk about the technologies that bind together the various parts to produce the mesmerizing effects of AR. We would go through each type one after another.

Projection based AR

We start off with projection based AR because the number of technologies and complexities of their interrelationships are the minimum here. In case of projection based AR, one has to deal with only a few parameters. The primary point of concern, as we have already said, is the creation of image sequence which is to be projected. Depending on the needs, quite a number of tools can be used. Let us begin with non-interactive AR.

Non-interactive AR

If the projection surface is just a plane surface, it is easy to generate the video sequence. The software required for the same would depend, once again on the specific need. If the need is only images, basic image processing tools along with video editing tools suffice.

For more advanced uses which have captivating effects, more tools are required. Let us reconsider the case of projection on a cube with an edge length of 2 meters. It is difficult to simply imagine a sequence which would look good on the cube without actually testing it. Let us not forget that the orientation of the cube matters – it may be kept on one of its sides or tilted on an edge or be placed on one of the corners. The variety of options depends on the projection surface and its complexity.

In such cases 3D modelling and animation software can prove immensely useful. Most such software can simulate any 3D environment and allow the designer to find out the best image sequences. They are also useful when experimentation space is not large enough. Since the software can produce animation with 3D objects, they can also help in generation of the actual light sequence to be projected. 3D animation software in no interactive projection based AR can serve as a WYSIWYG tool for the result.

Interactive AR

In cases where projection based AR is used for interactive purposes, the target surfaces are mostly planar and the projection is not a light sequence that needs to be designed with a video editor or 3D animation tool well before an event; instead it is produced via software in real-time. For the interaction to happen, at least a basic recognition system needs to be in place. The most important piece of technology which brings life to the augmentation depends on recognition systems, which we would learn about in a short while.

Location based AR

Next in line in terms of technical complexity is the location based AR. There are multiple factors at play. Location based AR apps take inputs from various sources. The information sources typically include:

•    GPS - It allows the app to pinpoint the location of the device.

•    Acceleration sensor - to find out how much you moved and in which direction. This prediction allows the app to update the view before the GPS data is received. Also, since accuracy of GPS is limited by approximately a couple of meters, it serves you when you move below that threshold.

•    Compass - To tell the direction you are looking in.

•    Orientation sensor - It tells the application how much the device is tilted. This helps predict the angle of view from the ground or in other terms, it helps figure out the location and angle of horizon relative to the device camera.
The application helps locate points of interest on your screen by performing the following steps:

1 Depending on the location received by the GPS system, the app requests its server to send a list of points-of-interest nearby along with their coordinates.

2 It then reads the direction you are looking in (from the compass) and the angle at which you are looking.
3 Combining these two pieces of information, it loads the data about the points-of-interest that it received from the server and overlays it on the camera view.

May we remind again that most location based AR apps do not understand what they see – they do not recognize the buildings and mountains (or anything at all). All they do is to calculate your view relative to other points in your surroundings and place them on the screen.

Trigonometry is what puts it all together. The role of AR application is simple – it has to place the information overlay at the correct places. The accuracy of a location based AR app depends on the accuracy of data received from different sensors. If the GPS data is not available, location based AR apps cannot work properly. At the same time, an error in any other sensor would result in a shift of the augmentation layer (depending of what has gone kaput).

Recognition based AR

This is the most interesting breed on the list. The technologies that form the base here can aid other AR implementations. It also happens to be the most researched area for the same reason. Technologies involved in recognition based AR are as vivid as the use-cases. Recognition based AR depends on semantic understanding of the object in front of the camera and then replacing it with another object on the output display, or overlaying extra information related to the object.

The multidimensional usage of recognition systems involve various disciplines and quite a significant part of that (if not all) is useful for AR. Before we start off, it would be good to keep in mind that not all technologies pertaining to recognition systems apply to a single recognition based AR implementation.

The process of Image Registration allows the faint mountain range in the background to be recognised

Image registration

Image registration is a process of merging different sets of data of the same object into one image. The different data sets can be different images of the same object taken at different times, from different angles and may include data received from other sources such as infrared view. When these different sets of data are merged into one, it may reveal a number of attributes which are not visible normally.

The process of registration involves more than one images. One of the images is called the target image and the other as reference image. During the registration process, the target image is transformed in a way that it aligns with the reference image. The transformation can be done in multiple ways, for each of them is best suited to certain types of images. The registration process can try to align target and reference images based on quite a number of registration methods. Some are:

1 Intensity based: When entire portions of images can be matched.

2 Feature based: When particular parts of images such as lines, contours etc. can be matched.

3 Transformations based: Rotated or scaled versions of images.

4 Multi-modality based: When images produced by different types of sensors are joined together. 
The above list of registration methods is not exhaustive and there are many which are given a name by combining more than one of them. e.g. spatial registration method involves intensity, feature as well as transformation based methods at the same time.

All these methods of image registrations are imperfect. Though one of the prime concerns of image registration is to reduce errors. An error in matching the patterns in the received images can result in the target image overlaid on the reference image with a planar difference between two points which represent the same part of image (it would be appear as if a displaced layer of the same image has been put over the actual one).

Right from merging more than one images of the same object to showing more than one object on screen, this technology plays a role in recognition based AR. If the images received from more than one sensors are not merged properly, the application fails to serve its purpose.

Remember how Terminator recognised objects such as guns and even the jacket sizes

Object recognition

It is very difficult for a person for look at a washing machine and confuse it with a television. Though humans can recognize a large number of objects easily, the recognition techniques are in their infancy for computers. Object recognition has its roots in artificial intelligence, pattern recognition and is also one of the driving factors for the image registration process. Interestingly enough the technology has been imagined repeatedly by filmmakers, right from The Terminator to Transformers.

Most object recognition methods heavily depend on objects that are already known. The larger the number of objects known to the system, better the chances of a correct object recognition. Let us first have a closer look at what recognition methods can act as hints in forming the complete understanding of the object. We would start with simple detection techniques and advance as we move forward.

Looking for gradients

Any curved surface reflects light as gradients as reflection varies in a continuous fashion across the surface. The gradient produced by a cylindrical object is different from that of a conical object. Observing images received via the camera to study the gradients helps in figuring out the overall structure of the object. The variation of gradients based on the lighting conditions for each type of shape which build the object altogether pose a major challenge here.

Comparing grayscale versions

Grayscale versions of images are easier to compare than the colored versions. Think of it like this: pictures of a pencil box when taken under green light looks different than picture of the same pencil box taken in blue light. But when you convert both the images into grayscale (i.e. black and white) the pictures do not look that much different. While to the human eye, the difference is trivial even in colour, it makes a significant difference to a computer-based algorithm.

Looking for edges

If you have used any sophisticated image editing program (such as Photoshop, GIMP, Paint.NET etc.), you would know of an effect which highlights edges in the pictures. A similar approach serves really well in detecting objects. Unlike gradients which vary under lighting conditions, edges do not. In most normal lighting conditions, location and size of edges do not change.

The process of detecting line of sight

Comparing the edges in the image received via the camera with entries in the database of images with similar edge patterns can reveal the nature of object. For example, the edge patterns of a table are going to be different than that of a chair. Change in the position of viewing angle however requires that the edge templates of known objects should also be available from various angles.

Search in large databases of object models

This is more like an exhaustive search which would be done on a large set of object models. In this method, the image is submitted for a search in a database which contains the models for various objects from multiple angles. This method is synonymous to the exhaustive search method in other realms of data management and processing.

Feature interpretation search

If you aren’t clear with data structure concepts yet, this might not make much sense. In this method, first of all, features are searched for in the image e.g. search for a ‘window’, a ‘door’ and a ‘chimney outlet’ in a picture of a ‘house’ should yield hits. Thereafter, comparisons are run for the identical placements of such features in the database of known objects. In our example, the object might just get recognized as a house. The search is done on the nodes of a feature tree. The root node of this tree is kept empty (indicating that no feature was matched). All child nodes in the tree contain all features found in the parent plus one more. As more and more shapes are found, the traversal goes down the tree to find the correct feature set identified in the image.

It is notable that features can be built by nested patterns. e.g. A window can be rectangular in shape with two rectangular panes. If each pane contains a circle drawn over it for decorative purposes, then the windows can be identified as a rectangle made of two rectangles with a circle inside each. In such cases the successful recognition of the window would depend on the algorithm and the nodes available in the feature search tree.

Yet another hurdle in this case is the size and relative position of other objects. Suppose that the feature interpretation based search has confirmed that the object in front of the camera is a house but it also identifies a shoe in the image which happens to be larger than the house itself, and then such a match would suggest that the house is actually a toy kept besides a real shoe. For the recognition algorithm to correctly peg this, it would require more than feature interpretation.

Hypothesization and testing

In this technique, various features are detected in the image first. The list of features is then searched against the list of features in known object models. The overall location of the features is also sent as a parameter for search. From the resulting set, the object model is then projected in software to decide whether the projection of the object returned as a search result would look similar to what is being seen by the camera. If the hypothetical image (projection of the object model received as a search result) is sufficiently similar to the image as seen by the camera, then the hypothesis is accepted and the object is recognised as the one received in search results.

Scale Invariant Feature Transformation (SIFT)

SIFT works by first storing key points (features) of an object and storing them in a database. This is done before SIFT is employed for object recognitions. When the object has to be recognized, the features are detected along with their positions and alignments and searched in the database.


This technique applies to those properties of an object which are not subject to a change with the position of camera being changed. The condition applies mostly to images of planar objects.

There are a few other recognition techniques such as Geometric Hashing, Pose clustering, and Speed up Robust Features (SURF). Unfortunately their understanding requires some complex mathematical theories which we cannot go in for the purposes of this FastTrack.

Challenges of Object Recognition

The biggest challenge in Object Recognition happens to be the wide array of possibilities available during the pattern search. In all the techniques we mentioned, a database with all the known object models is needed. However a database of object models of one technique does not serve well for others. For example, the database for searching through features in an image will be different from the one which uses grayscale images for comparison.

Databases would again have to store the images of different objects from different angles in various formats for supporting more than one type of algorithm for pattern matching. While it increases the time taken for object recognition, it improves the accuracy and brings down the probability of error because none of the techniques are deemed perfect.

Further difficulty is posed by the requirement to identify partly visible objects. Going back to our case of detecting a house - from some viewpoints, it is possible to achieve only limited visibility of the house - e.g. if you look at a house from the window of another house, you may get only limited visibility of the object. When the visibility of an object is obstructed, identification becomes more difficult. For this, the comparison algorithm sitting between the camera and object database has to be more clever.

In many cases when more than one objects are present in front of the camera, one object would obstruct the view of another. Under such circumstances, it is needed to detect both objects and figure out which object is obstructing the view of another. The simple method for such comparisons is to see which object provides a non-obstructing (complete) view and which one does not. Adding more objects in the mix may eventually bring out an erratic behavior from the algorithm.

Interpreting the video feed

It is one thing to recognize an object and it is another to keep recognizing multiple objects multiple times every second when more can appear any time and current ones can disappear, move and rotate. Object recognition, although forms the base of the recognition based AR, it is still half the story told. The other half lies in interpreting the video feed.

Videos are constructed by taking pictures at a very fast rate. For a video to be watchable, it should exceed the rate of 25 frames per second. Translate this requirement for AR application and it means that the application should not take more than 1/25th of a second to identify objects. The creation of augmentation layer must also take place within the same time frame.
The constraints call in for innovation of better hardware as well as more efficient algorithms at all levels. Ideally an algorithm using dedicated graphics hardware can do pattern recognition at a much faster rate than a completely CPU based algorithm. One method used for recognition based AR is to identify the object once and track its movement as it happens. The comparison of previous image frame with current one can suggest displacement and rotation of objects. In many cases a 3D wireframe model is present within the application to make the comparison which helps understand the differences better.

Building 3D models

By now, it would be obvious that a large part of recognition based AR implementation depends on the 3D models of objects. Building a sufficiently large database of object data is required to provide a better AR experience. While images are easier to insert (all you need to do is to take pictures), 3D models can be difficult. One needs to design the model of the object before it could be fed into the database. Sure, there are tools available for the job. But the variation of design for even simple objects would lead to an enormous number of models. Modelling all of that into 3D is a daunting task. However with tools such as VideoTrace (http://punchcard.com.au/), the amount of effort reduces.

Together all these technologies and methods help in recognising the object that lies in front of the camera. Recognising the object is the core need of recognition based AR. The second part - overlaying the object with another object or presenting extra information about the object is easier. It involves creating a transparent layer on top of the actual view and then filling it with required information.

Detecting the line of vision

This technique needs special kind of scanners or algorithm based cameras to track eye movement. It can be used for both augmented as well as virtual reality systems in various purposes. Detecting the line of vision can help relocate the augmentation layer on the right position. This is especially useful when the user is completely placed in the augmented space. This can include both HMDs and car windscreens. While in case of HMDs, the entire visible area can be augmented, a car windscreen would cover most of the visible space. Tracking the line of vision can help locate important information right in front of the eye rather than at a fixed location.

The outline allows this object to be recognised by the algorithm as a car

So there you have it. To recap you have seen how the augmentation process takes place. The application pinpoints where it is, considers the angle, orientation, direction and recognises what it’s looking at based on various techniques described above. It then overlays information about what it recognises to complete the augmentation process. Brilliant isn’t it? And what kind of wondrous applications can this result into? Check out the next chapter.