Scalable Video Coding Based on Content and Gaze

This an experimental video codec that scales based on content (i.e. physical objects) and point of gaze (i.e. where the viewer is looking). Scalability, in this context, is the ability to reconstruct meaningful video information from partial decompressed streams, thereby helping video systems meet their client device processing power and network bandwidth requirements. In particular, this video codec provides quality scalability by adaptively varying the color pattern accuracies of regions in each frame. Regions where the viewer is looking or that are classified as foreground objects are reconstructed at higher color pattern accuracies.

Notable Features

Hiearchical block matching algorithm (HBMA) for motion estimation
RANSAC (random sample consensus) for global motion estimation
Hardware acceleration via SSE2 (Streaming SIMD Extensions 2) instructions and instruction-level parallelism (ILP) for mean absolute difference (MAD) calculations
- MAD is used as the error criterion in HBMA
Multithreaded I/O
- Encoder app has a main thread, a read thread, and a write thread
- Decoder app has a main thread and a read thread
- Thread-safe circular queue is used between a "producing" thread and a "consuming" thread, such as a read thread and a main thread, respectively, or a main thread and a write thread, respectively

Description

The encoder first performs block-based motion estimation and then segments the video into regions, which are sets of spatially-connected blocks with similar motion. Each region is classified either as part of the background or as a distinct moving (i.e. foreground) object. Afterwards, the encoder applies the Discrete Cosine Transform (DCT) and the resulting coefficients and identifier of each region are written to a stream.

encoder-visualizer_dynamic-background_fall.mov

In a typical streaming architecture, each region would be converted into quantized DCT coefficients based on bandwidth and gaze requirements, and the compressed data would be buffered for streaming. In such architectures, the continuous bandwidth sensing would control the quantization of the DCT coefficients of each region, with background transform blocks generally being more quantized than foreground transform blocks. The gaze requirements would determine which transform blocks need to be more clear, with transform blocks in the gaze area being less quantized than those outside of it. In these cases, both properties of bandwidth and gaze control would be communicated by the client to the streaming encoder.

For the sake of simplicity, I emulate the streaming feature. The encoder computes and stores all the transform coefficients, but the decoder will decide on the degree of quantization to apply based on where the user is gazing and whether a region corresponds to a foreground object or a part of the background. This approach allows me to focus on block-based motion estimation and segmentation without tackling the the complexities of streaming. I also emulate the gaze point, which is determined by the position of the mouse cursor.

decoder_dynamic-background_fall.mov

Video Model

Illumination model: constant intensity assumption
- Valid for spatially and temporally invariant ambient illumination sources and diffuse reflecting surfaces
- The surface reflectance of an object does not change as the object moves
- Describes neither moving shadows nor reflections due to glossy surfaces
Scene model: 2D
- All objects are flat and lay on the same image plane
- Objects are limited to moving in a 2D plane
Motion model: 2D
- Motion is translational
- Applies to camera and objects

Encoder Steps

Convert frame from the RGB color space to the YUV color space
Extract Y (i.e. luminance) channel
- All motion estimation is intensity-based and relies solely on the Y channel
Estimate block-wise motion
- Achieved by a variation of the hierarchical block matching algorithm (HBMA)
  - For more details, open the file motion.hpp and read the comment block at the top of the file and the comment block for the function EstimateMotionHierarchical
  - If Streaming SIMD Extensions 2 (SSE2) is supported on your platform, then a SSE2-based HBMA implementation (see function EstimateMotionHierarchical16x16Sse2 in file motion.hpp) is used to improve performance
Estimate global motion using random sample consensus (RANSAC)
- Global motion is assumed to be the camera motion
- Inlier group is assumed to be the background motion vectors
- Outlier group is assumed to be the foreground motion vectors
- For more details, open the file motion.hpp and read the comment block at the top of the file and the comment block for the function EstimateGlobalMotionRansac
Create foreground mask from outlier group
Improve spatial connectivity and remove noise in the foreground mask
- Achieved by applying the closing and opening morphological operators in the stated order
  - Structural element is rectangular
Segment foreground layer into regions
1. Apply k-means on foreground layer
  - Each feature vector consists of motion vector position and components
  - A cluster is not guaranteed to be spatially connected. Step 7.2 addresses this
2. Find the connected components of each cluster
  - Each connected component becomes a region
Compute the Discrete Cosine Transform (DCT)
- The frame in the RGB color space is divided into transform blocks, which undergo the DCT
- Each channel is processed independently
Write encoded frames to a file
- Each transform block is assigned a block type
- The background and each foreground region is mapped to a unique block type
- For every transform block, the block type and the DCT coefficients for each channel are written

Decoder Steps

Emulate bandwidth scalability and gaze control
- If a block is in the gaze region, then no quantization occurs and substeps below are skipped
1. The DCT coefficients of a block are quantized by dividing by either foreground-quant-step or background-quant-step (see decoder usage section), depending on whether the block belongs to the foreground or background
2. The resulting quantized values are rounded off, and the reverse process is applied to obtain the dequantized DCT coefficients
Compute the inverse DCT of dequantized blocks

Results

The videos I used for testing were from the 2014 IEEE Change Detection Workshop (CDW-2014) dataset, which can be found here: http://changedetection.net/.

The names of the result videos follow the format <executable-name>_<video-category>_<video-name>.mov.

For example, the name of the videos used in the description section are

encoder-visualizer_dynamic-background_fall.mov
decoder_dynamic-background_fall.mov.

encoder-visualizer_PTZ_continuous-pan.mov

decoder_PTZ_continuous-pan.mov

encoder-visualizer_PTZ_two-position-PTZ-cam.mov

decoder_PTZ_two-position-PTZ-cam.mov

Build Requirements

C++ compiler
- Must support at least C++17
CMake >= 3.2
OpenCV == 3.4.*

I have not built the project on other platforms besides the one used for my development environment:

Hardware: MacBook Pro (Retina, 13-inch, Early 2015)
Operating System: macOS Mojave (version 10.14.16)
C++ compiler: Clang == 10.0.01
CMake == 3.25.1
OpenCV == 3.4.16

However, there is a good chance that the project can be built on your platform without much hassle because I use CMake to generate the build system. CMake is best known for taking a generic project description (see file CMakeLists.txt) and generating a platform-specific build system.

To install OpenCV, follow the OpenCV installation instructions for your platform. Once installed, CMake can use the command find_package to find the OpenCV package and load its package-specific details.

Build Steps

The following build steps assume a POSIX-compliant shell.

The build system must be generated before the project can be built. From the project directory, generate the build system.

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE:STRING=Release ..

The -G option is omitted in cmake -DCMAKE_BUILD_TYPE:STRING=Release .., so CMake will choose a default build system generator type based on your platform. To learn more about generators, see the CMake docs.

If SSE2 is supported on your platform, then a SSE2-based HBMA implementation is used by default to improve performance. To disable the SSE2-based HBMA implementation and use the fallback implementation instead, execute cmake -D SVC_MOTION_SSE2:BOOL=OFF .., which additionally sets the cache variable SVC_MOTION_SSE2 to OFF.

After having generated the build system, go back to the project directory

cd ..

and build all the targets of the project.

cmake --build build --config Release

The built targets are placed in the directory build.

The executable targets are

encoder-visualizer
encoder
decoder

To build only a specific target, replace mytarget with the name of the target and execute:

cmake --build build --config Release --target mytarget

Usage

Each option provided at the command-line must have its name prefixed with "--" and have an associated argument following its name. Options must also be before positional parameters.

For example, the option kmeans-cluster-count, its associated argument of 12, and the video file path foreman.mp4, a positional parameter, would be passed to encoder like so:

./build/apps/encoder --kmeans-cluster-count 12 foreman.mp4

encoder

encoder writes the encoded video to the standard output stream stdout.

To run encoder with the default configuration and write the encoded video to a file, execute the following command, which redirects output from stdout to encoded_video_file_path:

./build/apps/encoder video_file_path > encoded_video_file_path

If you do not want to create a encoded video file but you still want to run the decoder on the encoder output, run encoder and decoder concurrently and connect the stdout of encoder to the stdin of decoder. Achieve this by executing the following command:

./build/apps/encoder video_file_path | ./build/apps/decoder

encoder-visualizer

To run encoder-visualizer with the default configuration, execute the following command:

./build/apps/encoder-visualizer video_file_path

encoder & encoder-visualizer

encoder and encoder-visualizer have the same options.

To see the name and type of each option, search for #options in apps/encoder.cpp. You'll see an array called opts in the function ParseConfig. Each element of the array corresponds to an option and contains the name and type of the option.

To see the default values of the options, search for #default-cfg in apps/encoder.cpp. You'll see some functions containing the default values.

If the SSE2-based HBMA implementation is being used, then the motion vector block dimensions and pyramid level count cannot be set and the default values are used.

decoder

The decoder reads encoded video from the standard input stream stdin.

To run decoder with the default configuration and read encoded video from a file, execute the following command, which redirects input from stdin to encoded_video_file_path:

./build/apps/decoder < encoded_video_file_path

If you want to run decoder on the output of encoder without creating an encoded video file, run encoder and decoder concurrently and connect the stdout of encoder to the stdin of decoder. Achieve this by executing the following command:

./build/apps/encoder video_file_path | ./build/apps/decoder

To see the name and type of each option, search for #options in apps/decoder.cpp. You'll see an array called opts in the function ParseConfig. Each element of the array corresponds to an option and contains the name and type of the option.

To see the default values of the options, search for #default-cfg in apps/decoder.cpp. You'll see some functions containing the default values.

Future Direction

Address oversegmentation by merging regions
Implement entropy coding
Derive prediction error images and compress them using the JPEG pipeline
Compress motion vectors using entropy coding
Adaptive tuning of certain parameters
Implement network streaming
Eliminate dependence on OpenCV

References

Multimedia Systems: Algorithms, Standards, and Industry Practices
- By Parag Havaldar and Gerard Medioni
Video Processing and Communications
- By Yao Wang, Jôrn Ostermann, and Ya-Qin Zhang

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
apps		apps
libs		libs
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apps

apps

libs

libs

.clang-format

.clang-format

.gitignore

.gitignore

CMakeLists.txt

CMakeLists.txt

LICENSE.txt

LICENSE.txt

README.md

README.md

Repository files navigation

Scalable Video Coding Based on Content and Gaze

Notable Features

Description

Video Model

Encoder Steps

Decoder Steps

Results

Build Requirements

Build Steps

Usage

encoder

encoder-visualizer

encoder & encoder-visualizer

decoder

Future Direction

References

About

Releases

Packages

Languages

License

fonzcastellanos/scalable-video-codec

Folders and files

Latest commit

History

Repository files navigation

Scalable Video Coding Based on Content and Gaze

Notable Features

Description

Video Model

Encoder Steps

Decoder Steps

Results

Build Requirements

Build Steps

Usage

encoder

encoder-visualizer

encoder & encoder-visualizer

decoder

Future Direction

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages