Sunday, August 18, 2013

Applications as canvas, rethinking how we design applications on Linux

In this article, I will try to imagine a new way to build applications on Linux. Starting from questions like: How applications can be easier to tune, to refactor and support the new platforms or input devices? Does Linux applications have problems that should be spotted? Does using a Software Bus improve or make a radical change of the way we currently develop?

The philosophy of the UNIX system

Ken Thomson has taken a lot of decisions while designing UNIX, he wanted to create a powerful system composed by writing small software pieces that "do one thing and do it well" and which can be connected together using pipes in order to accomplish a more complex tasks. 
Building short, simple, clear, modular, and extendable code has allowed an easy maintaining and understanding of how the full system work.
Linux has inherited such powerful philosophy, and this is why it is used everywhere in servers. When something bad happens, you know the cause and can fix the broken component. You can also write scripts that simplify things.
However, GUI applications in Linux don't apply the UNIX philosophy, each application is an independent island itself.

Two Unix Commands, connected using a pipe

What if have the possibility to extend this concept to existant GUI applications? We will try to present some of the scenarios we have got and we will limit this preliminary study to creative tools for designers.

Scenario 1: Scripting GUI Applications

Graphical applications can only be accessed through their interface proposed by the original main developers. The devices used are mostly the keyboard and the mouse, and if an advanced user wants to do a repetitive or a specific advanced task, he needs to use macro recording tools.
Macro recording tools are not available inside every application, same external applications can solve this like "snippets" but we still need the GUI interface to be available

Why we still external tools that depends on the GUI to automate things? Isn't better that we have the possibility to automate applications without relying on the proposed interface?

Many advanced tools in Linux, like inkscape and gimp, tried to provide this: a non GUI interface that can be accessed via scripts. But sometimes this interface lacks many of the functions, and we can't see what we are doing in live until we open output files.

import dbus

bus = dbus.SessionBus()
ro = bus.get_object("org.inkscape",
print ro.ellipse(0,0,100,100)
print ro.ellipse(100,100,200,200)
print ro.ellipse(50,50,150,150)
print ro.select_all()

Using a simple DBus API like above, we have all the benefits united, we can access the application from outside, script it, and we have the possibility to see the result live on the application canvas, we can modify script options, or directly manipulate objects for operations that is easy to perform without a script. DBus has many language bindings, so we can use any interface we want.
Going in the direction of a DBus solution, is like opening a Pandora Box.

Scenario 2: Interactive programming

In the previous scenario, we speak about scripting applications, but what if we combine some advanced script, add a GUI interface, and make them configurable ?
We have then the possibility to create "plugins", but which lay in another process, written maybe in another language, and can do a lot more things.
We can write a small script, see the result on the canvas, and modify parameters values to eliminate developers blind coding.

Interactive Coding applied on Inkscape objects. We can see the result directly when modifying objects properties.

More advanced tools can emerge if we add operating system architecture knowledge, like "magic lenses" concept. In this example we know many things about the application including its window, its objects, its functions, we provide semi-transparent window that can modify inkscape internal objects.

If we push more this concept, we can force a simple drawing application to act like a graph plotting app. We can even use it as an animation tool by modifying objects properties like position, color, size etc.

We can see the animation directly, and we can export a new image each time. The set of image can be combined to create a video or any animation.

This animation is created using inkscape by modifying stars properties, exporting each frame, and finally creating a full animation from the set of images.

Scenario 3: Factorisation of application programming

More than creating these advanced plugins for application, we can push the concept to the extreme. What if we eliminate the application default GUI and have the ability to show another interface we want and which is only connected to the core app through dbus calls ?
This may make some of you think about the ubuntu HUD
We can use an application using another interface and providing more functionalities, by taking its menu in the case of ubuntu HUD out of it, and communicating through DBus.

Solving tools inconsistency

Why in hell we need such a thing? A first argument is by asking the closest designer you know about the tools he uses. Most of times he will cite a lot of tools, most tools share a common base of functionalities (which is drawing..) but each one add other functions, like sketching templates, exporting a mockup scenario, vector drawing, animation making, ...

CAD Tools, mockup tools, animation tools, proposes the basic functionalities of drawing, in addition to a more advanced ones depending on the final use.

What if we can be able to share functionalities in a core-app, but make the interface pluggable depending on the use ? If we want to draw, we show a simple drawing GUI on top of the canvas. If we want sketching we show another one with sketching templates. If we want to animate, we show a timeline of animation.
The benefit here is that a huge amount of code is shared and developers will need to write less code.

Solving one device inconsistency

Designer use a lot of tools to do their job, but most of them use a creative suite built by one company. This means that the interaction logic behind the tools is almost the same. The icons are also the same. If someone use one tool, he won't be lost using another.
In the free/open source world, tools are created by different communities. And so are the icons, the interaction design, the logic etc. If you learn a tool, you really need to invest another big amount of time learning another. So the problem here rely in the interface and the interaction.
What to do if we want to solve this problem? A company which builds the platform and wants to invest in this, can hire a designer to create GUI for a set of applications, and they will share the same logic. Linking the interfaces with applications should be very easy if they export all their methods on DBus. It will be a functionalities-matching job.

I want here to cite libdbusmenu by Canonical, they have sorted to grab any application menu, show it where they wants, and they use other rules for matching like searching for the menu name. In recent releases they added "a scope" to provide fuzzy matching. As they have do this only for menus, they can do the same for more internal functionalities.
Getting rid of the main GUI interface and providing new ones by the company is the extreme case of such scenario.

Adding more input options

Some time ago, we wanted to add multitouch functionality to Ubuntu Maverick. At the time, a lot of things needed to be developed, we have some devices supported and emitting events from the Kernel, but applications just ignore all of them.
We had really a lot of discussions on what's the best way to route events through the layers of X then through libraries until reaching applications. We have to think about the raw multitouch events as well as gesture events. At the time, a quick solution to show to the world what we are doing is to develop Ginn, a gesture injection tool that works without the need of support from libraries, and without the modification of the target application itself.

The solution was quite simple:
Get gesture event, get the active app, Read the wishes of the user (configuration file), Convert these advanced events into something the application already understand: keyboard taps and mouse clicks.

That simple solution allowed us to show to the world the beauty of what we are doing. But we were limited to mouse and keyboard shortcuts. Beside the performance issues, just imagine the power of matching such events directly to internal applications functionalities through DBus.

Using hand angles detected by the Leapmotion device to move objects inside inkscape.

Solving platform inconsistency

Many users have more than one computing device, they can have a computer, a tablet, a phone, a TV. Software in each of the platforms is different. But the tasks we do on each of the platforms share a common basis of tasks.
Let's get back to creative and drawing applications example, we can use a drawing application on a desktop computer, and use another one on a tablet. The usual computer interface will be unusable on the tablet, as the input modalities and goals are different. What we still have are the application logic which stay the same: drawing, image filters, image operations (crop, resize, ...)

Do we need to create a new application for each platform? Or just one and show a new GUI and interaction model for each one? What if we want to start something on a device and complete/view it on another?

What will remain is just the canvas which will serve as the feedback for the operations, along with the included algorithms. All of the functionalities will be exposed through the DBus Software Bus.
The GUI interface will be an additional layer displayed on top of the application, it changes depending on the used platform.
Application developer will not be asked for creating specific interfaces but the platform development team, specially if they are targeting many platforms should think about this factorization of the development. No new application should be coded from scratch, but just interfaces matched to application internal functions published on the bus.

Scenario 4: Application composer

What if we have a platform that have a lot of applications exporting their functions on the bus. In some cases, a user wants to accomplish a task that use functions dispersed in many applications. An application composer is a meta-application that can accomplish this complex tasks. It connects the output of an application to the input of another in a similar way to CLI scripts. But here the "running mode" can be visible.

Concept image showing an application composer using a set of GUI applications in order to accomplish a bigger task.

I want to get data from a Calligra Sheets table, draw these data in an application, export to an image, draw a new set of data, export to another image, ... combine images into an animation. This pipeline can be abstracted and launched by an application composer which will connect applications themselves to accomplish the bigger task. 

Scenario 5: Interactive Documentation

A new way of building applications needs a new way of building documentation.Now there are two famous ways to create tutorials:
1. Record videos of a users using the software,
2. Take screenshots of the steps and write an accompanying article.

In the two cases user needs to switch back and forth between his software and the tutorial. This can be a problem for novice users as they can lost the step they are in, or fed up by pausing and playing the video each time.

What if we can create a tutorial which is aware of the current step of the user?

Application aware Documentation

In DBus, we don't only have methods export, we can also connect to signals and get more information from the application. We can create a tutorial by showing a small action to be done by the user, and then waiting for the signal of it being done. And moving to the next action.
This frees the user from switching between the application and the tutorial, and avoid being lost in a lot of information.

Generating usual documentation

We still can have the old way of documentation but that can be generated by automatically taking screenshot of steps or by my video recording. If the GUI or the icons change, we regenerated the tutorial.

Concept of recording the steps of a tutorial, using a specific GUI, and generating a video or textual information.

(This article is currently a draft that stayed at this state a long time.
Please be free to help improving the concept and idea by some brainstorming or/and critics
I may add more information that I have in an independent paper with some testing code in the coming days.)

Things that still need to be discussed:
Standardization, Drawbacks, "microcloud" (per House), ...

Thursday, July 5, 2012

Unveiling The Technology Behind Leapmotion

The information written in this article are based on guesses using sources from the Internet, including scientific articles and demonstrations videos. Mirrors seems to be not used but other simpler mechanisms as cited below near Fig.5.


Last month I have been surprised like everybody else while watching the leapmotion video. Many of the famous people working in the NUI area have doubts that it's a fake.
The technology used in leapmotion is 100x more accurate than the kinect and uses only 2% of the CPU, a complete not usual break to today's technologies.
Since the launch, I have been searching in my free time for the possible technology being used inside Leapmotion. And it appears that I have found how they managed to release such great device.

Search and Elimination

I have searched for all possible technologies which can recognize gestures:

  • Ultra-Sound: can be used, but in that level, the precision can't even reach in best cases less than 1cm.
  • Electric Field Sensors: Are not precise too, can't detect non-conductive objects.
  • Structured light: Kinect-like ? but kinect is less accurate, and not as responsive as the leap.
You can find more guesses here and here
If not one of all these, what can the technology be ?

Tracking the tiny details

The leap have released many videos to demonstrate their gadget, after watching them carefully we can guess the limits of the system and from them guess the composition of the system.

Let's start with this photo:
Fig.1: David Holz demonstration

As you can see in Fig.1, the hand of David is a set of dynamic points, but also it doesn't appear to be an exact human hand. Fingers and the palm of the hand are fitted inside some sort of eggs. Which means only something: what we see is just a model of the hand, not the raw input.

Having a model, means also that there is no such complete information from the input. And this answers one of my first questions :
How have they managed to get all the surface of 3D objects ? Even the one which is not facing the sensor ?
Engadget is the first website to release another useful information detail: the nature of the sensors used.
Fig2. Sensors used in Leapmotion are just bare VGA cameras !

Having cameras as sensors confirms the use of a model to show the hand. Because cameras can only see a surface.

A precision of 0.01 mm ? This is another important detail. That precision coming from the data provided from bare cameras mean that there is more hidden data to be resolved in usual RGB color space. 0.01 mm = 10 µm which is not very far from the infrared wavelength.
Here we can no more speak about pixels, but only frequencies, The resolution space where the information will be calculated is the frequency domain after a Fourier Transform.
(And this is how I see it: if you can resolve an equation in a "usual" space, just find another space where it will become easy. Even in imaginary spaces like Complex Space etc. Reminds you about sci-fi movies and parallel space ? If you have a problem difficult to solve in you life, move it to another one and once you've the solutions get them back after a small transformation. ;-)

The Missing Link !

During my M.Sc in Human-Computer Interaction in Telecom Bretagne here in France, I have learned that it takes about 30 years for new invention to go from simple research to mainstream products which called by Bill Buxton the long nose of innovation. So if it's here now then it should have been for so long.
Searching inside the cloud of innovations I've seen in the past, I remembered a company which have made a lot of buzz last year, during this same period, it's called Lytro.
Lytro have made a new product which let's you take pictures then refocus on any object in posteriori. 
Here comes the missing link, if we are able to refocus objects, we'll be able to detect 3D depth from unfocused objects, the blur from unfocused objects is a continue function only limited by the light's wavelength and sensors accuracy resolved in the frequency space already mentioned. 

Once I have the right keywords, now I can search inside the research papers base. And it appeared that a lot of papers explained the concept in details.

Depth of scene from depth of field

The first paper I've found is published in 1982, 30 years from now confirming again the Buxton's Law of big Nose.

To be able to imagine the concept, take your smartphone, open the photo taking app, now touch the screen and you'll be able to focus any object in the scene. The phone has a function that moves the lenses to make an object looks fine, the function just selects where is the region with the minimum fuzz.
Now, image if we have the inverse of this, the lenses are just fixed to a predefined Focus distance and Depth of Field, images looks clear only when they are in that DoF, but if they are farther or nearer, they look fuzzy. Using the inverse function found in any cameras, you can predict the depth from the fuzz.

"Along each geometric ray between the image plane and the lens, the image moves from being in relatively poor focus, to a point of best focus, and then back to being out of focus. Thus if we could trace along the path of each incoming ray to find the point of exact focus then we could recover the shape of the 3D world."

That was the concept, but in real world you'll need to go deeper with mathematics and technical details.

Fig.3: Images with different DoF from a single shot [levin2007

The simple key formula for Distance to an image point used by Alex Paul Pentland is :
\( D=\frac{Fv_0}{v_0-F-\sigma f} \)
\(v_0\): distance between the lens and the image plane
\(f\): f-number of the lens system
\(F\): focal length of the lens system
\(\sigma\): the spatial constant of the point spread function (radius of blur circle) which is the only unknown variable.

Because differing aperture size causes differing focal errors, the same point will be focused differently in the two images. The critical fact is that the magnitude of this difference is a simple function of only one variable: the distance between the viewer and the imaged point. To obtain an estimate of depth, therefore, we need only compare corresponding points in the two images and measure this change in focus.
\( k_1\sigma_2^2 + k_2 ln \sigma_2 + k_3 = ln F_1(\lambda) - ln F_2(\lambda) \)

The difference in localized Fourier power is a monotonic increasing function of the blur in the second image. Or by the first equation, the distance to the imaged point is a monotonic decreasing function of the difference in the localized Fourier Power.

And as this post is not a formal scientific article, I won't put a lot of math, but instead the reference to them. You only need to retain that with some bricolage you can get depth from focus and defocus and to know in details how this can be made, you can start reading [Pentlend87,Pentlend89] and [Xiong93] then follow all the new work coming out of these papers.

How is this used in the leapmotion ?

Fig.4: The optical system of multi-focus scene capture as explained by Pentland.

The leapmotion uses ~3 cameras, each cam should see the same picture frame to remove the need of calibration as in Fig.4, so a basic system of mirrors and lenses is needed. As you can see here, the scene  image enters the half-silvered mirrors system and is divided into 3 areas. Each one has a lens with different focal point. The resulting pictures transmitted by cameras are similar to the ones shown above in Fig.3 but simultaneously and in Real-Time.
In opposition to stereovision mechanism, this optical system removes any need of massive computations to calibrate the image and construct a disparity map and match objects.

Fig.5: The possible system used in leapmotion
Update: The Leapmotion can even not use a mirror system to generate the disparity map. Other than depth from defocus, there is other mechanisms which can provide ultraprecise depth variation detection.

Acceleration of the computation

After all this, we know that he uses many cameras to get the surface, and the resolution space is the frequency domain. But how have he managed to get a CPU use of 2% according to ExtremeTech ?
The Leapmotion is declared to use about %2 of the processor. This can be made very easily if we precompute all values of the main function and stores them in a cache. Then instead of using the CPU, we only read and use the values directly.

The post-Leapmotion era

The introduction of devices with such precision and accuracy and in the same time built on simple mathematical models makes a break at two levels:
  1. The way input should be handled in today's computers and operating systems
  2. The events and how to be routed inside apps, widgets, daemons.. (post-events abstraction era ?)
The first point is mainly a reorganisation of the input subsystem into a more dynamic way, we should not forget that Leapmotion is just 3 cameras + some magic mathematical formulas, I see the math as filter  to a bare video input from 3 input devices which brings us with more information than meets the eye. Any combination of new "filters" which mix input devices can bring more wonderful "sources".

The second point indicates that we are now standing on the edges of the old model of standard and prefixed input events. The model where widgets is by default subscribed to a keyboard/mouse events "or similar", taking it on focus then spreading it to upper widgets if they don't consume it.

Future applications or "Toolboxes" need a new model that allow the subscription to new "equivalent sources" with the ability for some either to subscribe or to cede the control of its internal functions access to third party managers eliminating even the need of providing a CLI/GUI/...
This part is discussed in part in the last post and needs effective working prototypes.

Linux is the only system where it will be possible to try a new concept of input handeling similar to how StreamInput is currently discussed. An input mechanism where the sources are chosen in cascade, optimised then compiled in input pipes and provided for upper level use. We may only lack enough bravery and will to change the way everything now works but with some help from mainstream kernel developers, the khronos group and people working on input and drivers factorisation from LII-ENAC, this can one day see the light.