I was recently working on a robotics project that requires the use of human gestures to navigate a robot remotely. Gesture recognition is one of the ways to help computers understand human body language, thus building a richer bridge between machines and humans than primitive text user interfaces or even GUIs (graphical user interfaces), which still limit the majority of input to keyboard and mouse and interact naturally.
Dataset
Most of the gesture detections studied done are using static image classification. I had wanted to be able to recognize more complex gestures with a series of motions, such as clockwise or anti-clockwise motion. For complex gestures that take up about 80 frames to execute, I could not find any available public data. Moreover, my intent was to use a depth camera, there are no available dataset based on Intel Realsense camera.
Considering the amount of model training time required, it is impractical to use all the 80 frames. We all know not every consecutive frames are required to represent the gesture and some frames could be discarded, especially if there are no significant variation between 2 consecutive frames. By trial and error, different combination of number of dropped frames and total number of frames to represent the gesture was examined by trial and error. 5 classes with 1000 set of 20 images each were generated using the Intel Realsense Depth camera to form the dataset. To ensure diversity in the dataset and prevent overfitting, the distance from the camera was randomly varied during recording. The dataset was separated in the ratio of 7:3 into the training set and independent test set.
Model
I have adopted a CNN-RNN-based network to train the data. Basically the input of 20 images will be processed by shared CNN model to extract respective features maps. Each of this feature map is fed into RNN-based network as a single time step. The input shape is (20, 128, 128, 3), batch size is 32, and number of epoch is 30. The Adam optimizer was used for optimization along with categorical cross-entropy as the loss function. Softmax is used for the last layer’s activation.
Ta da!
Here’s the outcome