MathJax

Thursday, July 5, 2012

Unveiling The Technology Behind Leapmotion


The information written in this article are based on guesses using sources from the Internet, including scientific articles and demonstrations videos. Mirrors seems to be not used but other simpler mechanisms as cited below near Fig.5.

Introduction

Last month I have been surprised like everybody else while watching the leapmotion video. Many of the famous people working in the NUI area have doubts that it's a fake.
The technology used in leapmotion is 100x more accurate than the kinect and uses only 2% of the CPU, a complete not usual break to today's technologies.
Since the launch, I have been searching in my free time for the possible technology being used inside Leapmotion. And it appears that I have found how they managed to release such great device.

Search and Elimination

I have searched for all possible technologies which can recognize gestures:

  • Ultra-Sound: can be used, but in that level, the precision can't even reach in best cases less than 1cm.
  • Electric Field Sensors: Are not precise too, can't detect non-conductive objects.
  • Structured light: Kinect-like ? but kinect is less accurate, and not as responsive as the leap.
You can find more guesses here and here
If not one of all these, what can the technology be ?

Tracking the tiny details

The leap have released many videos to demonstrate their gadget, after watching them carefully we can guess the limits of the system and from them guess the composition of the system.

Let's start with this photo:
Fig.1: David Holz demonstration

As you can see in Fig.1, the hand of David is a set of dynamic points, but also it doesn't appear to be an exact human hand. Fingers and the palm of the hand are fitted inside some sort of eggs. Which means only something: what we see is just a model of the hand, not the raw input.

Having a model, means also that there is no such complete information from the input. And this answers one of my first questions :
How have they managed to get all the surface of 3D objects ? Even the one which is not facing the sensor ?
Engadget is the first website to release another useful information detail: the nature of the sensors used.
Fig2. Sensors used in Leapmotion are just bare VGA cameras !

Having cameras as sensors confirms the use of a model to show the hand. Because cameras can only see a surface.


A precision of 0.01 mm ? This is another important detail. That precision coming from the data provided from bare cameras mean that there is more hidden data to be resolved in usual RGB color space. 0.01 mm = 10 µm which is not very far from the infrared wavelength.
Here we can no more speak about pixels, but only frequencies, The resolution space where the information will be calculated is the frequency domain after a Fourier Transform.
(And this is how I see it: if you can resolve an equation in a "usual" space, just find another space where it will become easy. Even in imaginary spaces like Complex Space etc. Reminds you about sci-fi movies and parallel space ? If you have a problem difficult to solve in you life, move it to another one and once you've the solutions get them back after a small transformation. ;-)


The Missing Link !

During my M.Sc in Human-Computer Interaction in Telecom Bretagne here in France, I have learned that it takes about 30 years for new invention to go from simple research to mainstream products which called by Bill Buxton the long nose of innovation. So if it's here now then it should have been for so long.
Searching inside the cloud of innovations I've seen in the past, I remembered a company which have made a lot of buzz last year, during this same period, it's called Lytro.
Lytro have made a new product which let's you take pictures then refocus on any object in posteriori. 
Here comes the missing link, if we are able to refocus objects, we'll be able to detect 3D depth from unfocused objects, the blur from unfocused objects is a continue function only limited by the light's wavelength and sensors accuracy resolved in the frequency space already mentioned. 

Once I have the right keywords, now I can search inside the research papers base. And it appeared that a lot of papers explained the concept in details.

Depth of scene from depth of field

The first paper I've found is published in 1982, 30 years from now confirming again the Buxton's Law of big Nose.

To be able to imagine the concept, take your smartphone, open the photo taking app, now touch the screen and you'll be able to focus any object in the scene. The phone has a function that moves the lenses to make an object looks fine, the function just selects where is the region with the minimum fuzz.
Now, image if we have the inverse of this, the lenses are just fixed to a predefined Focus distance and Depth of Field, images looks clear only when they are in that DoF, but if they are farther or nearer, they look fuzzy. Using the inverse function found in any cameras, you can predict the depth from the fuzz.

"Along each geometric ray between the image plane and the lens, the image moves from being in relatively poor focus, to a point of best focus, and then back to being out of focus. Thus if we could trace along the path of each incoming ray to find the point of exact focus then we could recover the shape of the 3D world."

That was the concept, but in real world you'll need to go deeper with mathematics and technical details.


 
Fig.3: Images with different DoF from a single shot [levin2007


The simple key formula for Distance to an image point used by Alex Paul Pentland is :
\( D=\frac{Fv_0}{v_0-F-\sigma f} \)
where:
\(v_0\): distance between the lens and the image plane
\(f\): f-number of the lens system
\(F\): focal length of the lens system
\(\sigma\): the spatial constant of the point spread function (radius of blur circle) which is the only unknown variable.

Because differing aperture size causes differing focal errors, the same point will be focused differently in the two images. The critical fact is that the magnitude of this difference is a simple function of only one variable: the distance between the viewer and the imaged point. To obtain an estimate of depth, therefore, we need only compare corresponding points in the two images and measure this change in focus.
\( k_1\sigma_2^2 + k_2 ln \sigma_2 + k_3 = ln F_1(\lambda) - ln F_2(\lambda) \)

The difference in localized Fourier power is a monotonic increasing function of the blur in the second image. Or by the first equation, the distance to the imaged point is a monotonic decreasing function of the difference in the localized Fourier Power.

And as this post is not a formal scientific article, I won't put a lot of math, but instead the reference to them. You only need to retain that with some bricolage you can get depth from focus and defocus and to know in details how this can be made, you can start reading [Pentlend87,Pentlend89] and [Xiong93] then follow all the new work coming out of these papers.

How is this used in the leapmotion ?

Fig.4: The optical system of multi-focus scene capture as explained by Pentland.


The leapmotion uses ~3 cameras, each cam should see the same picture frame to remove the need of calibration as in Fig.4, so a basic system of mirrors and lenses is needed. As you can see here, the scene  image enters the half-silvered mirrors system and is divided into 3 areas. Each one has a lens with different focal point. The resulting pictures transmitted by cameras are similar to the ones shown above in Fig.3 but simultaneously and in Real-Time.
In opposition to stereovision mechanism, this optical system removes any need of massive computations to calibrate the image and construct a disparity map and match objects.

Fig.5: The possible system used in leapmotion
Update: The Leapmotion can even not use a mirror system to generate the disparity map. Other than depth from defocus, there is other mechanisms which can provide ultraprecise depth variation detection.


Acceleration of the computation

After all this, we know that he uses many cameras to get the surface, and the resolution space is the frequency domain. But how have he managed to get a CPU use of 2% according to ExtremeTech ?
The Leapmotion is declared to use about %2 of the processor. This can be made very easily if we precompute all values of the main function and stores them in a cache. Then instead of using the CPU, we only read and use the values directly.

The post-Leapmotion era

The introduction of devices with such precision and accuracy and in the same time built on simple mathematical models makes a break at two levels:
  1. The way input should be handled in today's computers and operating systems
  2. The events and how to be routed inside apps, widgets, daemons.. (post-events abstraction era ?)
The first point is mainly a reorganisation of the input subsystem into a more dynamic way, we should not forget that Leapmotion is just 3 cameras + some magic mathematical formulas, I see the math as filter  to a bare video input from 3 input devices which brings us with more information than meets the eye. Any combination of new "filters" which mix input devices can bring more wonderful "sources".

The second point indicates that we are now standing on the edges of the old model of standard and prefixed input events. The model where widgets is by default subscribed to a keyboard/mouse events "or similar", taking it on focus then spreading it to upper widgets if they don't consume it.

Future applications or "Toolboxes" need a new model that allow the subscription to new "equivalent sources" with the ability for some either to subscribe or to cede the control of its internal functions access to third party managers eliminating even the need of providing a CLI/GUI/...
This part is discussed in part in the last post and needs effective working prototypes.

Linux is the only system where it will be possible to try a new concept of input handeling similar to how StreamInput is currently discussed. An input mechanism where the sources are chosen in cascade, optimised then compiled in input pipes and provided for upper level use. We may only lack enough bravery and will to change the way everything now works but with some help from mainstream kernel developers, the khronos group and people working on input and drivers factorisation from LII-ENAC, this can one day see the light.