1. Introduction

HAR-Web is a web application that can be utilized to carry out the task of Human Activity Recognition in real time on the web using GPU-enabled devices. The web application is based on a micro service architecture where the project has been divided into 4 basic services, each running on a different port and deployed using Docker containers. I would like to add support for Kubernetes but as of now it doesn't support native hardware access like Docker-Swarm. One way of solving this would be to write a host device plugin like https://github.com/honkiko/k8s-hostdev-plugin but specifically for webcam access. If anyone has an idea on how to enable local webcam access on K8S, I would love to hear about it.

2. Microservices Architecture

2.1 Services

2.1.1 Frontend Service

Initial page for the app that gives the user the option to train their own model or use existing model and routes to the appropriate service accordingly. (plan to add presets)

2.1.2 Recorder Service

Video recording service written in NodeJS that allows user to record videos and then store it for training. By default the videos are recorded at 10 FPS and a total of 300 frames are recorded to generate a good enough amount of training data for each action.

2.1.3 Recognizer Service

Contains serialized model deployed using Flask and streams the predictions in real time after capturing video feed from a webcam in real time.

2.1.4 Training Service

Modifies the config file with the appropriate labels and model of choice and then starts the training process in the background which consists of generating heatmaps.

2.2 Architecture

Each service has been containerised using Docker. You can find the dockerfile for each service in their respective folder. In order to start the entire system together, we can simply use the docker-compose file which describes all the dependencies and commands to build and start each container.

Ports for the different services in the docker-compose file are listed below:

Service	Port
Frontend	5000
Recorder	5001
Trainer	5002
Recognizer	5003

3. Recognizer

3.1 Human Activity Recognition

Human Activity Recognition is a domain in Computer Vision that deals with identifying what action is being performed by a human entity in a video feed. Deep Learning approaches to carry out human activity recognition has been typically been tackled using 3D-CNNs, LRCNNs and also the widely adopted 2-Stream Architecture https://github.com/jeffreyyihuang/two-stream-action-recognition that uses both RGB-images and optical flow.

HAR-Web is based on the project by https://github.com/felixchenfy/Realtime-Action-Recognition that utilizes Human Pose Estimation to generate 2D-Skeletons and use skeleton coordinates to classify actions. A big advantage of this approach is the reduced computation needed to carrying out action recognition making it a much viable approach when it comes to identifying human actions in a real time basis.

3.2 Human Pose Estimation

Human 2D pose estimation deals with localization of different key human parts and using these localized points to construct a pose for the human. HAR-Web uses https://github.com/ildoonet/tf-pose-estimation to generate the 2D-Poses which is an implementation of OpenPose(written in Caffe) in TensorFlow.

The model developed by CMU uses Part-Affinity Fields ( https://arxiv.org/pdf/1611.08050.pdf ) and has a t stage, two branch process of generating the predictions for poses. In the first stage, a neural network is used to carry out two simultaneous predictions : A set 'S' of 2D confidence maps for body part locations and a set 'L' of 2D vector fields of part affinities, which encode the degree of association between different body parts.

The two branches use feature maps F generated using the first 10 layers of VGG-19 as their inputs, with the first branch predicting the set S and the second branch predicting the set L. Subsequent stages use these predictions and concatenate with the original feature map F and used iteratively to produce refined productions.

3.3 Skeleton Data To Actions

The skeleton generated by OpenPose has 18 joints and each joint has 2 coordinates(x,y) associated with it. Preprocessing is then done to :-

Scale the x and y coordinates as OpenPose has different scales for these.
Removal of joints on the head.
Get rid of frames with no necks or thigh detected.
Filling of missing joints

Features are then extracted by concatenating the skeleton data from a window of 5 frames at a time. The exact feature extraction has been described in the original report that also talks about specific feature selection that were the most effective for training.

A total feature vector of dimension 314 is created and reduced to 50 dimensions using PCA. This 50 dimension network is finally used to classify different actions using a neural network with 3 hidden layers of 100 nodes each.

4. Recorder

Built using opencv4nodejs, which is an API for native OpenCV for NodeJS. this service deals with recording the video of a person and stores it as frames for our training data. The base image that I used for creating the docker container that had a working and compatible version of opencv4nodejs is available here on DockerHub.

A big advantage of using opencv4nodejs over using simply Flask for a task like this was because it provides an asynchronous API that allows built in multithreading and doesn't have to rely on something like Flask-Threads to avoid non-blocking calls. This is important in our application because it allows us to save record and save the frames on two seperate threads and leads to performance gains.

5. Trainer

The Trainer microservice uses a config file to carry out a 5-step training process which goes from generating heatmaps to extracting specific features mentioned in 1.3, carrying out PCA to reduce dimensionality of the feature vector and training a neural network on it. The config file is modified once the user selects to train their own model and generates a pickle file for the trained classifier that serves our predictions on the Flask server. I have hardcoded HAR-Web to utilize 20-280 frames from the 300 frames that are recorded as training data but I plan on adding giving the user freedom to give exactly what number of frames to use and how many frames to record in the first place.

6. Frontend

This is a very basic service that serves the front page for the web app and routes to the particular service requested by the user accordingly.

You can view the service here http://35.240.156.121/ , which has been deployed on Google Cloud's K8S cluster.

7. Future Work

Ideally I would like to deploy this on a cloud provider like AWS where anyone can access the service to recognize the actions as groups of 'presets' or train their own classes and use it on the go

8. References and Links

Title	Link
2 Stream Convolution	http://papers.nips.cc/paper/5353-two-stream-convolutional
Temporal Segment Networks	https://link.springer.com/chapter/10.1007/978-3-319-46484-8_2
TS-LSTM	https://arxiv.org/abs/1703.10667
Human activity recognition from skeleton poses	https://arxiv.org/pdf/1908.08928v1.pdf
tf-pose-estimation	https://github.com/ildoonet/tf-pose-estimation
Realtime-Action-Recognition	https://github.com/felixchenfy/Realtime-Action-Recognition

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Frontend		Frontend
Recognizer		Recognizer
Recorder		Recorder
Trainer		Trainer
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frontend

Frontend

Recognizer

Recognizer

Recorder

Recorder

Trainer

Trainer

.gitignore

.gitignore

README.md

README.md

docker-compose.yml

docker-compose.yml

Repository files navigation

1. Introduction

2. Microservices Architecture

2.1 Services

2.1.1 Frontend Service

2.1.2 Recorder Service

2.1.3 Recognizer Service

2.1.4 Training Service

2.2 Architecture

3. Recognizer

3.1 Human Activity Recognition

3.2 Human Pose Estimation

3.3 Skeleton Data To Actions

4. Recorder

5. Trainer

6. Frontend

7. Future Work

8. References and Links

About

Releases

Packages

Languages

ChetanTayal138/HAR-Web

Folders and files

Latest commit

History

Repository files navigation

1. Introduction

2. Microservices Architecture

2.1 Services

2.1.1 Frontend Service

2.1.2 Recorder Service

2.1.3 Recognizer Service

2.1.4 Training Service

2.2 Architecture

3. Recognizer

3.1 Human Activity Recognition

3.2 Human Pose Estimation

3.3 Skeleton Data To Actions

4. Recorder

5. Trainer

6. Frontend

7. Future Work

8. References and Links

About

Resources

Stars

Watchers

Forks

Languages