We want to use computer vision method to track the hand. There are two possible angles for cameras
- cameras above the keyboard to track hands
- cameras beside the keyboard to track any touches on the surface.
And two types of devices we can use:
- depth cameras
there're 60fps cameras costing $110+ - RGB webcams
We tried RGB webcams on top first. So there're mainly 6 steps:
- camera speed up
- background subtraction
- find index finger
- determine track point (find fingertip)
- depth calculation
- integrating
We are currently working on Step 6, while step 1 proved to be unnessesary and even have worse effect(details in later chapters).
- We used threaded method to get better speed when cameras and our algorithm works together. (See reference)
- We tests the speed of two cameras working simultaneously, on the same USB bus, with different resolution. (see below the statistics)
need update 320240 182 640480 60
display | average fps | std fps |
---|---|---|
0 | 892.3 | 6.5 |
1 | 460.3 | 9.4 |
We spent a week trying different background subtraction algorithms:
*(item checked means it has been tried, vice versa) *
(sorted by time that we tried it)
- opencv's MOG2
- opencv's KNN
- PCA-based dynamic
- use PCA algorithm to find the moving object.
- PCA-based static
- use PCA algorithm to get the matrix that represent the background, then subtract it from the current frame.
- simple background subtraction (SBS)
- use the difference of current frame and average of previously obtained background frames to find the hand.
- dynamic simple background subtraction(D-SBS)
- an improved version of SBS, we use background images under different lighting conditions, and for each pixel the threshold is multiples of std.
- brightness filter
- use grayscale images to find the hand.
- skin color filter
- use skin color's histogram to find the hand.
- 42 algorithms in BGS library(43 algorithms)
After thorough consideration, skin-color-filter proved to be the best algorithm for our task. (See below details about the exploring process)
Generally speaking, outcome of simple subtract and static PCA-based are similar in both quality and speed. But lights and shadows strongly effect the quality of images. D-SBS partly solve this problem, skin color filter thoroughly solved it . Other algorithms are too slow or not robust. So skin color filter is our final choice.
We recorded videos to compare different algorithm. But due to the laptop wasn't delivered in first days and other reasons, the testing videos are changing in the whole iteration period. But every comparison is based on a same video.
- background (camera0 camera1 )
pure background with people moving around causing the light and shadows change over time. - moving (camera0 camera1)
The other is hands working on the keyboard(pointing and clicking) with different speed for 10s each.
- effect: too many noise when patterns change, but it learns.
- speed: 62(single test, 101secs)
- code here
- effect: less noise in background, but when hands in/out still big changes.
- speed: perceptible lag
- code here
- still can't use them with opencv3, need to be tested.
- there are hand tracking videos using CNT seems good.
- references:
- references:
- description: uses frames from now to before to form a big matrix, and apply it to the basic algorithm.
- effect: poor
- speed: very slow
- statistics here(720 frames requires almost 1 sec to do the math.)
- code here
- Method 1: uses SVD result matrix, U, S, V. Wrong in the end. I was thinking about decompose the new image with vectors from U, and only keep first nnz coefficients. But the result is a mess, so this idea is probably wrong, because the original samples were calculated with elements from U, S, and V.
- Method 2: uses background video to get the low_rank matrix. Calculate the average of low_rank, then subtract it from captured frame.
- Method 3: use many frames and their low_rank background. Find the frame that's most like captured frame, then subtract corresponding background from the captured frame.
- concerns: a lot of calculation and since there are hands, we need many many samples. (maybe only background samples will work? it's more like decrease the noise)
Subtract background image(mean) from captured image, do filtering. Initially, we used subtraction on RGB images, and get the following result:
- effect:
- code here but on commit with tag "try rgb subtraction first". bg-st-simple.py is to produce the outcome video, compare.py is to compare difference between RBG version and gray version's outcome. This is a screenshot when comparing: (Window "frame" is RBG version subtract Gray version. Frame1 is RGB version, Frame0 is gray version)
Then we realize that subtraction on grayscale images could be faster, so results are as below:
- effect:
- speed: average: 253.31fps, std: 4.78, times=10
- disadvantage: when background change, you have to manually resample it for 1~2 seconds.
- code here
After testing with PCA-BS and SBS, I find that lighting conditions effect result greatly. Initially, I thought of the following solutions:
-
use unreflective keyboards( put a mask maybe)
-
use another angle for cameras and detect any touch within certain area on keyboard instead of recognizing the whole hand.
But after discussion, we want our implementation working on any keyboard, so the first is out of consideration. The other angle requires user to use keyboard surface as trackpads, lifting up other fingers while touch with certain fingers. But currently, we want users to rest their fingers on the keyboard, and use index as pointer.
So, with Julian's advice, we have dynamic-SBS, which calculate threshold for each pixel separately. The threshold is 6 times standard deviation of background video.
- speed:
- effect: shadows are eliminated greatly(compare video: D-SBS SBS) however, when we use a new recorded video, with larger difference between light and shadows, part of skin are filtered out as well, while shadows under fingers remains.(video)
- code here
This method uses grayscale images, extract pixels above a threshold, filters small areas and find the big convex hull. Code refering to Project Shubham. Speed were tested and shown below.
- speed(below are tested with convex hull, speed without convex hull is higher than SBS) To Be Modified
frames per test | test times | average fps | std fps |
---|---|---|---|
500 | 10 | 134.6 | 6.86 |
- effect: video
Hand detection part is quite good with dark background. But index location need more work. - code here
Though D-SBS eliminated some influence of shadows, however, under certain lighting conditions, much part of the hand are also filtered away. So we tried skin color filter:
- Description We use HSV to decrease influence of lighting. Use hist information to pick out skin part.
- resources: http://www.benmeline.com/finger-tracking-with-opencv-and-python/
- Sampling For this algorithm, sampling for user's hand is important. We ask users to put several parts(inlude the darker skin and brighter skin and nails) of their hand at a certain place on keyboard sequentially.
- speed: average 178.15 std 8.51 raw camera speed 43.80
- effect: usable(though noise exists) video
- code: realtime version video version
We found a library implemented many algorithms.
- resources:
- result: most of the algorithms either are way too slow or have bad result. The code and comments are here
After background subtraction, we get an image where hands are white and background is black.
The algorithm of finding the index finger consists of 4 parts:
- Basics: Find the contours of the image, and select the biggest area as valid. Then find the convex hull and center of the contour.
- Get hand direction.
- find the left/right endpoint of wrist-line and left/right edge of hand.
- use the closer one to center as anchor and find the symmetric point on the edge of the other side.
- add the two vectors to get the opposite of hand direction.
- Use hand direction to find estimated thumb and index position on convex hull.
- we scan all the edges on convex hull clockwise from leftmost point.
- the first one who's longer than _thumb_l, and who's angle is smaller than _thumb_a is considered the thumb.
- the first one who's longer than _index_l, and who's angle is smaller than _index_a is considered the index.
- angle: we define vector from "center" to the testing point as "point direction". Angle refers to the angle of "point direction" and hand direction.
- Confirm: Deal with situations where index is not on convex hull.
- we noticed that middle fingers are sometime considered as index finger, especially when index finger moves closer to palm. So we decide to use "convex approaximation" to help with this situation.
- First get the approximation and the nearest point to estimated thumb and index on approximation's convex points, mark them as approx thumb and approx index.
- here we experience two versions of definition of "convex point".
- use center point of hand
- use direction changes
- here we experience two versions of definition of "convex point".
- then check points on approximation between approx thumb and approx index. If there're two concave in between, then those convex betweeen the concoaves are considered index.
- quality
- speed
Another idea is to find fingers directly from approximation. But considering the previous algorithm works out fine, so we didn't try this yet.
- get convex hull of approximation
- use concave to separate points into groups.
- estimate the width of each group, and how many fingers there are likely.
- find the second as index.
After we know the general position of index finger, we need to find a precise tracking point so that both cameras can track the same point and calculate depth. We tried 3 ways of doing that. We first crop a rectangle around the general index position. Then apply different algorithms on it to see which works best.
- Box
- Convex Hull
- Edge detection
- length of routes
method | x0 | y0 | x1 | y1 |
---|---|---|---|---|
box | 4245 | 5984 | 3940 | 6067 |
edge | 4153 | 6553 | 4178 | 7347 |
conv | 4309 | 6612 | 3978 | 7284 |
- average
method | x0 | y0 | x1 | y1 |
---|---|---|---|---|
box | 220.22927242 | 136.78680203 | 216.171929825 | 101.250877193 |
edge | 228.097292724 | 137.546531303 | 222.929824561 | 101.954385965 |
conv | 229.175972927 | 138.359560068 | 225.124561404 | 104.349122807 |
- standard deviation
method | x0 | y0 | x1 | y1 |
---|---|---|---|---|
box | 27.9720501541 | 47.0161461458 | 27.0126075463 | 45.025044192 |
edge | 28.1405075516 | 47.4805592132 | 27.1805268588 | 46.4812320932 |
conv | 28.0231563304 | 47.5557931032 | 27.0359460569 | 45.7308971634 |
method | ave | std |
---|---|---|
box | 0.000074422123307 | 0.000013260347968 |
edge | 0.000901667543981 | 0.000359573519302 |
conv | 0.000744257786477 | 0.000259432506332 |
contours and approx: .0010 palm: 0.00014 drawing: .0003 prlc: 0.0010 dep: 0.0003
preprocess: 1 background subtraction 2 palm direction 3 fingers and p_fingers 4