2011-06-13

Dev

Unleash the power of Kinect for Windows SDK!

This post is a translation of https://blogs.msdn.com/b/eternalcoding/archive/2011/06/14/fr-prenez-le-contr-244-le-avec-kinect-pour-windows-sdk.aspx

Introduction

Microsoft Research has released the first beta of Kinect for Windows SDK. You can find resources and download the SDK at:

https://research.microsoft.com/kinectsdk

This SDK also install drivers for the Kinect sensors. However, be aware that the SDK will only install on Windows 7 (x86 and x64).
The first point worth noting is that the latter is actually available in two versions: one for C++ developers and one for managed developers.

So there is no jealousy. As I prefer managed environment, my samples will be presented in C # (matter of taste only).

Regarding the licensing mode, this version released by Microsoft Research, is free for private use (basically, anything that is not commercial).

Architecture

Basically, the Kinect sensors will send a set of three streams:

Image stream can be displayed like with any other camera (for example to do augmented reality). The Kinect video sensor can return a stream with 2 resolutions: one at 640×480 (at 30 frames per second) and one at 1280×1024 (but at 15 frames per second).

The depth stream is the determining factor in our case. It will indeed add to each pixel a depth defined by the sensor. So in addition to the 2D position of each pixel (and color) we now have depth. This will greatly simplify the writing of shapes detection algorithms.

A third stream is sent from the sensor: it is the audio stream from the four microphones (more on this subject at the end of the article).

Therefore, the key point here concerns the ability of Kinect to give us three-dimensional data. Using the NUI library (which comes with the SDK and stand for Natural User Interfaces) you will be able to detect the presence of humans in front of the sensor. Kinect can “see” up to 4 peoples and accurately track both of them.

When Kinect precisely follows a person, it can provide a skeleton made up of key points detected on the user:

As shown in this diagram of the Vitruvian Man, there are 20 key points (which we call joints) that are detected and tracked by the NUI library.

For best results, it is necessary to stand between 4 and 11 feet from the sensor. Beyond these limits, the sensors accuracy decreases quickly. It is also not possible (yet) to follow a user sitting in front of his computer.

Getting started

To use Kinect for Windows SDK in your .Net application, you only have to reference Microsoft.Research.Kinect.dll.

Then you have two new namespaces, one for accessing video streams and skeletons and one for audio.

      
          using Microsoft.Research.Kinect.Nui;
        
          using Microsoft.Research.Kinect.Audio

To initialize the NUI library, you must instantiate an object from the Runtime class and configure the streams you want to receive:

      
          kinectRuntime = new Runtime();
        
          kinectRuntime.Initialize(RuntimeOptions.UseDepthAndPlayerIndex | RuntimeOptions.UseSkeletalTracking | RuntimeOptions.UseColor);

In our example, we initialize the library with the support of depth stream, video stream and tracking of skeletons.

Video buffer

To use the video stream, you must first define the awaited format, To do so, we’ll ask the library to retrieve the data in a given resolution with a defined pixel format:

      
          kinectRuntime.VideoStream.Open(ImageStreamType.Video, 2, ImageResolution.Resolution640x480, ImageType.Color);

Here, we ask for a resolution of 640×480 with a RGB pixel format. It is also possible to request a resolution of 1280×1024 (with lower performances) and pixel can be in YUV format. Resolution as mentioned earlier has an impact on performances. This is not the case of pixel format so it can be chosen according to what suits better for your application.

Subsequently, to be informed of the availability of each image, you must subscribe to an event of the library:

      
          kinectRuntime.VideoFrameReady += kinectRuntime_VideoFrameReady;

In the handler of this event, we can simply produce a BitmapSource to display it in a WPF application:

      
          public class ColorStreamManager
        
          {
        
              public BitmapSource ColorBitmap { get; private set; }
        
              public void Update(ImageFrameReadyEventArgs e)
        
              {
        
                  PlanarImage Image = e.ImageFrame.Image;
        
                  ColorBitmap = BitmapSource.Create(Image.Width, Image.Height, 96, 96, PixelFormats.Bgr32, null, Image.Bits, Image.Width * Image.BytesPerPixel);
        
              }
        
          }

Depth buffer

Besides the video stream, Kinect can send a stream coming from the infrared sensor that gives depth data.

The initialization is similar to that of the video stream:

      
          kinectRuntime.DepthStream.Open(ImageStreamType.Depth, 2, ImageResolution.Resolution320x240, ImageType.DepthAndPlayerIndex);
        
          kinectRuntime.DepthFrameReady += kinectRuntime_DepthFrameReady;

The depth data are stored as arrays of 16 bits integers. The depth stream can be retrieved at 320×240 or 80×60.

The 13 high-order bits of each pixel represent the distance from the depth sensor to the closest object, in millimeters.
The 3 low-order bits of each pixel represent the index of the tracked user who is visible at the pixel’s x and y coordinates.

So if you want to view the depth stream while coloring the areas occupied by humans, it is possible to use this code:

      
          void ConvertDepthFrame(ImageFrameReadyEventArgs e)
        
          {
        
              depthFrame32 = new byte[e.ImageFrame.Image.Width  e.ImageFrame.Image.Height  4];
        
              byte[] depthFrame16 = e.ImageFrame.Image.Bits;
        
              for (int i16 = 0, i32 = 0; i16 < depthFrame16.Length && i32 < depthFrame32.Length; i16 += 2, i32 += 4)
        
              {
        
                  // R?cup?ration de l’utilisateur courant
        
                  int user = depthFrame16[i16] & 0x07;
        
                  // Profondeur (en mm)
        
                  int realDepth = (depthFrame16[i16 + 1] << 5) | (depthFrame16[i16] >> 3);
        
                  // Profondeur->Intensit?
        
                  byte intensity = (byte)(255 – (255 * realDepth / 0x0fff));
        
                  depthFrame32[i32] = 0;
        
                  depthFrame32[i32 + 1] = 0;
        
                  depthFrame32[i32 + 2] = 0;
        
                  depthFrame32[i32 + 3] = 255;
        
                  switch (user)
        
                  {
        
                      case 0: // no one
        
                          depthFrame32[i32] = (byte)(intensity / 2);
        
                          depthFrame32[i32 + 1] = (byte)(intensity / 2);
        
                          depthFrame32[i32 + 2] = (byte)(intensity / 2);
        
                          break;
        
                      case 1:
        
                          depthFrame32[i32] = intensity;
        
                          break;
        
                      case 2:
        
                          depthFrame32[i32 + 1] = intensity;
        
                          break;
        
                      case 3:
        
                          depthFrame32[i32 + 2] = intensity;
        
                          break;
        
                      case 4:
        
                          depthFrame32[i32] = intensity;
        
                          depthFrame32[i32 + 1] = intensity;
        
                          break;
        
                      case 5:
        
                          depthFrame32[i32] = intensity;
        
                          depthFrame32[i32 + 2] = intensity;
        
                          break;
        
                      case 6:
        
                          depthFrame32[i32 + 1] = intensity;
        
                          depthFrame32[i32 + 2] = intensity;
        
                          break;
        
                      case 7:
        
                          depthFrame32[i32] = intensity;
        
                          depthFrame32[i32 + 1] = intensity;
        
                          depthFrame32[i32 + 2] = intensity;
        
                          break;
        
                  }
        
              }
        
          }

This stream can be extremely useful to detect shapes. Thus, it is possible to monitor and detect hands or fingers movements to produce new ways to interact with the PC.

Skeleton tracking

One of the big strengths of Kinect for Windows SDK is its ability to discover the skeleton of joints of an human standing in front of the sensor. And unlike the hacks that have sprung up on Internet (like OpenNI), the Kinect for Windows SDK incorporates a very fast recognition system and requires no training to use. This is the result of a long training of a learning machine. Microsoft Research has given many examples to the recognition system to serve its apprenticeship.

So once you pass behind the sensor (at the right distance of course), the NUI library will discover your skeleton and will raise an event with useful data about it.

To enable skeleton tracking system, you must activate the depth stream and handle the appropriate event:

      
          kinectRuntime.DepthFrameReady += kinectRuntime_DepthFrameReady;

In the handler for this event, we can loop through all the skeletons found by the system:

      
          void kinectRuntime_SkeletonFrameReady(object sender, SkeletonFrameReadyEventArgs e)
        
          {
        
              SkeletonFrame skeletonFrame = e.SkeletonFrame;
        
              foreach (SkeletonData data in skeletonFrame.Skeletons)
        
              {
        
                  if (data.TrackingState == SkeletonTrackingState.Tracked)
        
                  {
        
                      foreach (Joint joint in data.Joints)
        
                      {
        
                          switch (joint.ID)
        
                          {
        
                              case JointID.HandLeft:
        
                                  if (joint.Position.W > 0.6f) // Quality check
        
                                      leftHandGestureRecognizer.Add(joint.Position.ToVector3());
        
                                  break;
        
                              case JointID.HandRight:
        
                                  if (joint.Position.W > 0.6f) // Quality check
        
                                      rightHandGestureRecognizer.Add(joint.Position.ToVector3());
        
                                  break;
        
                          }
        
                      }
        
                      return;
        
                  }
        
              }
        
          }

Several points are noteworthy here:

The NUI library cannot track more than 2 skeletons. It is the property TrackingState == SkeletntrackingState.Tracked that defines whether a skeleton is ‘tracked’ or not. The untracked skeletons only give their position.
Each joint has a Position property that is defined by a Vector4: (x, y, z, w). The first three attributes define the position in camera space. The last attribute (w) gives the quality level (between 0 and 1) of the position. This allows you to filter and take only the data that the library is almost certain.

Each skeleton has a property TrackingID which remains the same on every frame. This allows us to uniquely identify the skeletons between each call.
Each joint is identified by an enum which define its reference position (hands, head, etc.).

It is also possible to pool the current skeletons with the SkeletonEngine.GetNextFrame() method.

Finally, the NUI library provides an algorithm for filtering and smoothing incoming data from the sensor. Indeed, by default, the skeletons data are sent without smoothing or filtering. However, the Kinect depth sensor has not sufficient resolution to ensure consistent accuracy over time. Thus, the data seem to vibrate around their positions. To correct this problem, you can call this code:

      
          kinectRuntime.SkeletonEngine.TransformSmooth = true;
        
          var parameters = new TransformSmoothParameters
        
          {
        
              Smoothing = 1.0f,
        
              Correction = 0.1f,
        
              Prediction = 0.1f,
        
              JitterRadius = 0.05f,
        
              MaxDeviationRadius = 0.05f
        
          };
        
          kinectRuntime.SkeletonEngine.SmoothParameters = parameters;

As we can see, it is possible to smooth and correct data. Depending on what you need, you should manipulate these parameters to provide the best experience possible.

It is now your imagination’s turn to propose future applications. Thus, it is possible to look for gestures to control applications (the famous PowerPoint Jedi control) or even to make fun with augmented reality…

Kinect and sound

Kinect comes with a group of four microphones (microphone array) that capture sound at very high quality. Indeed, directly on the sensor, a signal processor (DSP) is used to remove background noise and cancel echo effects.

Moreover, thanks to its group of microphones, Kinect can provide the direction of the recorded sound source (beamforming). Then it becomes possible to know who is speaking in a meeting for example.

The Kinect for Windows SDK is also able to be a source for Microsoft.Speech API and thus it becomes possible to do voice recognition using Kinect.

So to use all of these services, simply instantiate an object of class KinectAudioSource:

      
          var source = new KinectAudioSource {SystemMode = SystemMode.OptibeamArrayOnly};

This instantiation can select microphone array mode or single microphone mode with or without echo cancellation (Audio Echo Cancellation: AEC).

To start capturing, we’ll just ask our KinectAudioSource to start providing us the audio stream:

      
          byte[] buffer = new byte[16000];
        
          var audioStream = source.Start();
        
          audioStream.Read(buffer, 0, buffer.Length);

The audio is in 16-kHz, 16-bit mono pulse code modulation (PCM).

To use the beamforming services, use this code:

      
          source.MicArrayMode = MicArrayMode.MicArrayAdaptiveBeam;

There are several ways to detect the sound beam. Here, we let the system itself selects the appropriate beam. It would also be possible to make the selection manually, or take only the central beam or use the average of all beams.

Then, the system can raise an event when the beam moves:

      
          source.BeamChanged += source_BeamChanged;

      
          static void source_BeamChanged(object sender, BeamChangedEventArgs e)
        
          {
        
              Console.WriteLine(“Angle : {0} radians”, e.Angle);
        
          }

The returned angle is in radians and is relative to the center of your Kinect. If you are in front of it: an angle of 0 indicates that the sound comes across the sensor, an angle <0 indicates="" that="" the="" sound="" comes="" from="" left="" and="" an="" angle=""> 0 indicates that the sound comes from the right.

At any time, it is also possible to request the current value of the angle by calling source.SoundSourcePosition:

      
          if (source.SoundSourcePositionConfidence > 0.9)
        
              Console.Write(“Position (radians): {0}”, source.SoundSourcePosition);

Like the positions of the skeletons joints, we can get a data level of quality to decide if we want to use it.

Regarding filters (anti-echo and noise reduction) the KinectAudioSource class provides updatable parameters to achieve the expected results.

Conclusion

So we could see that the Kinect for Windows SDK provides many tools to play with. We are about to discover a lot of new kinds of interaction between man and machine. Of course it lacks some high level tools (like a library of gestures), but it’s an easy bet that the next few weeks will show up many projects from the community and Microsoft (for example the Coding4Fun Kinect Toolkit on CodePlex).

To go further

Demonstrations : https://msdn.microsoft.com/fr-fr/windows/sdk-kinect-demo
Microsoft Research : https://research.microsoft.com/en-us/
Kinect on Channel9 : https://channel9.msdn.com/coding4fun/kinect/
Demonstration of the Light Saber on Channel9 :
https://channel9.msdn.com/coding4fun/kinect/Of-course-our-first-Kinect-for-Windows-SDK-Project-has-to-involve-a-Light-Saber

Eternalcoding

var life = new[] {"eat", "sleep", "code"}