MathJax

Thursday, July 5, 2012

Unveiling The Technology Behind Leapmotion


The information written in this article are based on guesses using sources from the Internet, including scientific articles and demonstrations videos. Mirrors seems to be not used but other simpler mechanisms as cited below near Fig.5.

Introduction

Last month I have been surprised like everybody else while watching the leapmotion video. Many of the famous people working in the NUI area have doubts that it's a fake.
The technology used in leapmotion is 100x more accurate than the kinect and uses only 2% of the CPU, a complete not usual break to today's technologies.
Since the launch, I have been searching in my free time for the possible technology being used inside Leapmotion. And it appears that I have found how they managed to release such great device.

Search and Elimination

I have searched for all possible technologies which can recognize gestures:

  • Ultra-Sound: can be used, but in that level, the precision can't even reach in best cases less than 1cm.
  • Electric Field Sensors: Are not precise too, can't detect non-conductive objects.
  • Structured light: Kinect-like ? but kinect is less accurate, and not as responsive as the leap.
You can find more guesses here and here
If not one of all these, what can the technology be ?

Tracking the tiny details

The leap have released many videos to demonstrate their gadget, after watching them carefully we can guess the limits of the system and from them guess the composition of the system.

Let's start with this photo:
Fig.1: David Holz demonstration

As you can see in Fig.1, the hand of David is a set of dynamic points, but also it doesn't appear to be an exact human hand. Fingers and the palm of the hand are fitted inside some sort of eggs. Which means only something: what we see is just a model of the hand, not the raw input.

Having a model, means also that there is no such complete information from the input. And this answers one of my first questions :
How have they managed to get all the surface of 3D objects ? Even the one which is not facing the sensor ?
Engadget is the first website to release another useful information detail: the nature of the sensors used.
Fig2. Sensors used in Leapmotion are just bare VGA cameras !

Having cameras as sensors confirms the use of a model to show the hand. Because cameras can only see a surface.


A precision of 0.01 mm ? This is another important detail. That precision coming from the data provided from bare cameras mean that there is more hidden data to be resolved in usual RGB color space. 0.01 mm = 10 µm which is not very far from the infrared wavelength.
Here we can no more speak about pixels, but only frequencies, The resolution space where the information will be calculated is the frequency domain after a Fourier Transform.
(And this is how I see it: if you can resolve an equation in a "usual" space, just find another space where it will become easy. Even in imaginary spaces like Complex Space etc. Reminds you about sci-fi movies and parallel space ? If you have a problem difficult to solve in you life, move it to another one and once you've the solutions get them back after a small transformation. ;-)


The Missing Link !

During my M.Sc in Human-Computer Interaction in Telecom Bretagne here in France, I have learned that it takes about 30 years for new invention to go from simple research to mainstream products which called by Bill Buxton the long nose of innovation. So if it's here now then it should have been for so long.
Searching inside the cloud of innovations I've seen in the past, I remembered a company which have made a lot of buzz last year, during this same period, it's called Lytro.
Lytro have made a new product which let's you take pictures then refocus on any object in posteriori. 
Here comes the missing link, if we are able to refocus objects, we'll be able to detect 3D depth from unfocused objects, the blur from unfocused objects is a continue function only limited by the light's wavelength and sensors accuracy resolved in the frequency space already mentioned. 

Once I have the right keywords, now I can search inside the research papers base. And it appeared that a lot of papers explained the concept in details.

Depth of scene from depth of field

The first paper I've found is published in 1982, 30 years from now confirming again the Buxton's Law of big Nose.

To be able to imagine the concept, take your smartphone, open the photo taking app, now touch the screen and you'll be able to focus any object in the scene. The phone has a function that moves the lenses to make an object looks fine, the function just selects where is the region with the minimum fuzz.
Now, image if we have the inverse of this, the lenses are just fixed to a predefined Focus distance and Depth of Field, images looks clear only when they are in that DoF, but if they are farther or nearer, they look fuzzy. Using the inverse function found in any cameras, you can predict the depth from the fuzz.

"Along each geometric ray between the image plane and the lens, the image moves from being in relatively poor focus, to a point of best focus, and then back to being out of focus. Thus if we could trace along the path of each incoming ray to find the point of exact focus then we could recover the shape of the 3D world."

That was the concept, but in real world you'll need to go deeper with mathematics and technical details.


 
Fig.3: Images with different DoF from a single shot [levin2007


The simple key formula for Distance to an image point used by Alex Paul Pentland is :
\( D=\frac{Fv_0}{v_0-F-\sigma f} \)
where:
\(v_0\): distance between the lens and the image plane
\(f\): f-number of the lens system
\(F\): focal length of the lens system
\(\sigma\): the spatial constant of the point spread function (radius of blur circle) which is the only unknown variable.

Because differing aperture size causes differing focal errors, the same point will be focused differently in the two images. The critical fact is that the magnitude of this difference is a simple function of only one variable: the distance between the viewer and the imaged point. To obtain an estimate of depth, therefore, we need only compare corresponding points in the two images and measure this change in focus.
\( k_1\sigma_2^2 + k_2 ln \sigma_2 + k_3 = ln F_1(\lambda) - ln F_2(\lambda) \)

The difference in localized Fourier power is a monotonic increasing function of the blur in the second image. Or by the first equation, the distance to the imaged point is a monotonic decreasing function of the difference in the localized Fourier Power.

And as this post is not a formal scientific article, I won't put a lot of math, but instead the reference to them. You only need to retain that with some bricolage you can get depth from focus and defocus and to know in details how this can be made, you can start reading [Pentlend87,Pentlend89] and [Xiong93] then follow all the new work coming out of these papers.

How is this used in the leapmotion ?

Fig.4: The optical system of multi-focus scene capture as explained by Pentland.


The leapmotion uses ~3 cameras, each cam should see the same picture frame to remove the need of calibration as in Fig.4, so a basic system of mirrors and lenses is needed. As you can see here, the scene  image enters the half-silvered mirrors system and is divided into 3 areas. Each one has a lens with different focal point. The resulting pictures transmitted by cameras are similar to the ones shown above in Fig.3 but simultaneously and in Real-Time.
In opposition to stereovision mechanism, this optical system removes any need of massive computations to calibrate the image and construct a disparity map and match objects.

Fig.5: The possible system used in leapmotion
Update: The Leapmotion can even not use a mirror system to generate the disparity map. Other than depth from defocus, there is other mechanisms which can provide ultraprecise depth variation detection.


Acceleration of the computation

After all this, we know that he uses many cameras to get the surface, and the resolution space is the frequency domain. But how have he managed to get a CPU use of 2% according to ExtremeTech ?
The Leapmotion is declared to use about %2 of the processor. This can be made very easily if we precompute all values of the main function and stores them in a cache. Then instead of using the CPU, we only read and use the values directly.

The post-Leapmotion era

The introduction of devices with such precision and accuracy and in the same time built on simple mathematical models makes a break at two levels:
  1. The way input should be handled in today's computers and operating systems
  2. The events and how to be routed inside apps, widgets, daemons.. (post-events abstraction era ?)
The first point is mainly a reorganisation of the input subsystem into a more dynamic way, we should not forget that Leapmotion is just 3 cameras + some magic mathematical formulas, I see the math as filter  to a bare video input from 3 input devices which brings us with more information than meets the eye. Any combination of new "filters" which mix input devices can bring more wonderful "sources".

The second point indicates that we are now standing on the edges of the old model of standard and prefixed input events. The model where widgets is by default subscribed to a keyboard/mouse events "or similar", taking it on focus then spreading it to upper widgets if they don't consume it.

Future applications or "Toolboxes" need a new model that allow the subscription to new "equivalent sources" with the ability for some either to subscribe or to cede the control of its internal functions access to third party managers eliminating even the need of providing a CLI/GUI/...
This part is discussed in part in the last post and needs effective working prototypes.

Linux is the only system where it will be possible to try a new concept of input handeling similar to how StreamInput is currently discussed. An input mechanism where the sources are chosen in cascade, optimised then compiled in input pipes and provided for upper level use. We may only lack enough bravery and will to change the way everything now works but with some help from mainstream kernel developers, the khronos group and people working on input and drivers factorisation from LII-ENAC, this can one day see the light.



Tuesday, December 13, 2011

Input Pipelines and Sensors Flow Soup

Fig 1. Input Layer Abstraction

By convention, Input devices are the peripherals which detect a human user input. And sensors are those which detect the environmental variation, related or not to the human.
Despite the existence of other definitions, both Input devices and sensors are needed to provide rich interaction with the user and allowing him to accomplish the tasks he want in less time.

Input devices can be the keyboard and the mouse and almost every application supports them but the definition also include touch and multitouch screens, touchpad, joysticks, etc.
From a wider point of view, keyboard and mouse seems to be two constant: K1, K2. They are modeled in operating systems using Finite State Machines and they are routed to the application without too much alteration of their original content. (Except the transformation of raw codes to ascii or unicode ones for example)

The "Why"
Operating systems, and I'll take Linux as an example, have abstraction layers to handle the input. Every peripherals generating key clicks is seen as a keyboard. For touchpads, mice, touchscreens, pen, etc they are attached to a virtual pointer device, and are handled as a conventional mouse.
Device drivers may generate other information (finger blob size, blob orientation, etc.) but they are all eliminated and don't reach the application.

Last years, we have faced the emergence of an 1.unlimited number of input devices and sensors, each of them provide very different information. And in the other side, 2.rich application, exposing a lot of features, or which have more dimensional variables than ones found in a single input device (3d object control in 3d environment).

With these 2 poles, the problem becomes obvious:
Why is the operating system preventing us from using all devices capabilities inside our rich applications ?

Ok, let's fix this and answer the question with another :
Do you have another solution for mapping input events to application features ?


Mapping Input events
Before speaking about mapping, we should study the input itself even taking few examples.
Touchpads have their input considered as a compatible form of mouse input. But their brute information comes very noisy. When using direct input your pointer moves in a similar way of a Parkinson disease patient. The developers of X.org have added a "filter" which smooths the movement (named "Response-augmented Exponentially Weighted Moving Average Filter").
Speaking about mouse input, the X.org add also another "filter" that alter the smoothed input to accelerate its movement, so you a user don't need to slide a lot of times to move the pointer in bigger screens.
And let's imagine that this flow represented in Fig 2. finishes as an input to control the camera in a 3D scene:

Fig 2. Input Flow to control a 3D scene


From the last representation, the input system can be seen as a flow and in each point, the input got transformed before being routed to the application. The 3D scene itself takes that input and maps it internally to control the camera view.
The problem in this case is that, what if you want that your management meets the standard and won't be considered as a hack? You can't avoid the virtual pointer, and you can't avoid this predefined set of filters. You can't also select the manner of which the input can be mapped inside the application ( mouse(x,y) → scene(x,y) or → scene(y,z) etc.)
Another problem is that, some filters are inside the input device driver, other are inside the Input layer, and the rest belongs to the application. And all of them are hardly written and can't be changed or remapped without recompilation of the full stack.

Any time we have a new device or new events to support in the system, we keep asking this question:
How we will fit the new device input inside this flow with minimum effort and the least loss of information ?  


Rethinking All the Stack
In 2009, I have discovered all of this and I have found some efforts to simplify input management. Just having the ability to imagine a possible change is a lot of bravery... For a system developer who only use a keyboard, why he should think about rewriting everything ? That's a lot of work man !


From the last figures, we can imagine that filters should be extracted outside of device drivers, and from any predefined flow in a system, and we can play with them to modify and transform raw input to any useful form.
The flow becomes a general graph, where a sensor input can control how much a filter transforms a device stream to another. For real-time and other needs we may add also a universal timing to master how much each filter takes time for his computation.

Fig 3. Input Flow as represented by StreamInput Khronos Model.


What about applications ?
Legacy applications takes only two sort of input : Keyboard and Mouse. These two inputs still very limited to what is possible to do. But We have already have filters which route information through them including keyboard mapping transformers (possibility to write in Arabic/French/Chinese with the same keyboard), mutlitouch injectors in legacy apps like Ginn (included in ubuntu), or whatever else.

But wait, shouldn't we rethink the application itself ? Why they don't expose their functionalities through a Software Bus, where we have the freedom to connect filters' input to the specific action to be performed?
Fig 4. Rethinking application input


By doing this the biggest work will become the mapping. And we will need to search best ways to do it to perform user goals in better ways.



Scientific background ?
A lot of scientific research have studied these problems. But each of them tried to fix either the input flow, or the best mapping. But not all the cycle from and to the user.

For the Input configuration you can see the work of Pierre Dragicevic : iCon.
Another interesting thesis is the one by Rami Ajaj (PhD in French)
Theoretical and architectural support for input device adaptation paper.

A lot of other notable exist but I can't cite it all. 

And for a deep study of mapping and to understand the need of filters, it is mandatory to study what is an input device, a sensor, do some morphological analysis of their design space, study actual standards implemented in operating systems and describing a large spectrum of devices and usages

(I have skipped many research areas related to this subject to keep this post clear and simple.)