A Framework for Multisensory Foresight for Embodied Agents

Abstract

Predicting future sensory states is crucial for learning agents such as robots, drones, and autonomous vehicles. In this paper, we couple multiple sensory modalities with exploratory actions and propose a predictive neural network architecture to address this problem. Most existing approaches rely on large, manually annotated datasets, or only use visual data as a single modality. In contrast, the unsupervised method presented here uses multi-modal perceptions for predicting future visual frames. As a result, the proposed model is more comprehensive and can better capture the spatio-temporal dynamics of the environment, leading to more accurate visual frame prediction. The other novelty of our framework is the use of sub-networks dedicated to anticipating future haptic, audio, and tactile signals. The framework was tested and validated with a dataset containing 4 sensory modalities (vision, haptic, audio, and tactile) on a humanoid robot performing 9 behaviors multiple times on a large set of objects. While the visual information is the dominant modality, utilizing the additional non-visual modalities improves the accuracy of predictions.

Environment Setup

pip install -r requirements.txt

Dataset Preparation

Description: https://www.eecs.tufts.edu/~ramtin/pages/2014/CY101Dataset.html

Download: https://tufts.app.box.com/v/DeepMultiSensoryDataset

Preparation:

$ python ./data/make_data.py \
        --data_dir path-to-downloaded-data-directory \ 
        --out_dir path-to-output-data-directory \

Usage

$ python ./main.py  \
        --data_dir path-to-data\ # directory containing data
        --channels 3 \ # channels of input
        --height 64 \ # height of image
        --width 64 \ # width of image
        --output_dir path-to-checkpoint-dir \ # directory for model weight
        --pretrained_model path-to-checkpoint \ # filepath of a pretrained model to initialize from
        --sequence_length 10 \ # sequence length + context frames
        --context_frames 4 \ # the number of ground truth frames to pass in at start
        --model CDNA \ # model architecture to use - CDNA | DNA | STP
        --num_masks 10 \ # number of masks, usually 1 for DNA, 10 for CDNA, STP
        --device cuda  \ # device cuda | cpu
        --cdna_kern_size 5 \ # CDNA_KERN_SIZE
        --haptic_layer 16 \ # HAPTIC_LAYER
        --use_haptic \ # Whether or not to give the haptic to the model
        --behavior_layer 9 \ # number of chosen behaviors
        --use_behavior \ # Whether or not to give the behavior to the model
        --audio_layer 16 \ # AUDIO_LAYER
        --use_audio \ # Whether or not to give the audio to the model
        --vibro_layer 16 \ # VIBRO_LAYER
        --use_vibro \ # Whether or not to give the vibro to the model
        --aux \ # Whether or not to employ auxiliary tasks during training 
        --print_interval 100 \ # iterations to output loss
        --schedsamp_k  400 \ # The k hyperparameter for scheduled sampling, -1 for no scheduled sampling
        --batch_size 32 \ # batch size for training
        --learning_rate 0.001 \ # the base learning rate
        --epochs 30 \ # total training epochs

Architecture

The architecture of the proposed model, which consists of 4 feature encoders (left) and prediction heads (right) for 4 modalities, and 1 fusion module (middle) for merging representations of different modalities.

Pipeline of The Visual Prediction Module: the architecture of visual feature extractor (left), the architecture of visual prediction network (right).

Results

Dataset Visualization

Visualization of haptic (left), audio (middle) and vibrotactile (right) modalities when the robot drops a bottle

Training the Network with All Behaviors

Illustrative Example

Sharpness of predicted images, when the robot arm perform lift (left) and push (right) behaviors.

Quantitative Reconstruction Performance

Ablation study on sensory input

Ablation study on behavior input

Training the Network with Individual Behavior

Investigating the performance of different combinations of modalities per individual behavior

Name		Name	Last commit message	Last commit date
Latest commit History 206 Commits
Figures		Figures
data		data
networks		networks
.gitignore		.gitignore
README.md		README.md
main.py		main.py
metrics.py		metrics.py
model.py		model.py
options.py		options.py
requirements.txt		requirements.txt

Xiaohui9607/mmvp

Folders and files

Latest commit

History

Repository files navigation

A Framework for Multisensory Foresight for Embodied Agents

Environment Setup

Dataset Preparation

Usage

Architecture

Results

Dataset Visualization

Training the Network with All Behaviors

Illustrative Example

Quantitative Reconstruction Performance

Training the Network with Individual Behavior

Predicting Future Frames of Auxiliary Modalities

About

Topics

Resources

Stars

Watchers

Forks

Languages