Multi-modal Self-Supervised Learning for Autonomous Situation Awareness of Unmanned Systems

Abstract

Visuomotor intelligent agent use vision signals as input to directly predict the decisions and actions. Because they require a large corpus of labeled data or environment interactions to achieve satisfactory performance, supervised pre-training is often applied to simultaneously cooperating and training the perception and control modules in an end-to-end fashion and transfer to downstream tasks such as visual navigation and trajectory prediction.

Using " intelligent agent self-driving in extraterrestrial planetary environments" as a case study, the supervised pre-training paradigm suffers from a lack of labeled data and high cost, and inefficient transfer. Dominant self-supervised approaches in computer vision are not applicable due to the lack of translation and view invariance in vision-driven driving tasks, and the input contains irrelevant information for driving. Therefore, the research goal is to design a self-supervised pre-training method applicable to self-driving in extraterrestrial planetary open environments.

Method

Inspired by multimodal learning, we introduce temporal signals such as IMU and Odometry to help the visual encoder learning. The visual modality is the objective condition for driving decisions, and the temporal signal modality responds to the driving state and decision quality. The two are synergistic and complementary: the strong correlation between modalities makes it theoretically possible to predict semantic information from one modality to the other, while the inherent differences make cross-modal prediction a more challenging and valuable pretext task compared to within-modality learning.

We propose a pre-training method for cross-modal prediction by extracting features of both modalities through a visual encoder and a temporal signal encoder, constructing pseudo-labels of the other modality by clustering the features using the scalable K-Means algorithm, and optimizing the model by repeating the clustering and classification tasks.

Usage

Prerequisites

The main dependencies are as follows:

Python == 3.8.16
pytorch >= 1.12.1
torchvision >= 0.13.1
sklearn == 1.2.2
pillow == 9.5.0
prefetch-generator == 1.0.3
tensorboard == 2.12.1
seaborn == 0.12.2

Self-Supervised Pretraining

This implementation only supports multi-gpu, DataParallel training, which is faster and simpler; single-gpu is also supported but not advised.

To do unsupervised pre-training of a ResNet-50 model on ImageNet in an 4-gpu machine, run:

python main_pretrain.py \
  --gpunum 4 \
  --k 16 \
  --model resnet50 \
  --epoch 4 \
  --subepoch 15 \
  --lr 0.03 \
  --batchsize 32

To test the pretrained models of both modalities, run:

python main_test.py \
  --gpunum 4 \
  --model resnet50 \
  --dir1 ./v1_k16_epoch15_4_pre/vision_encoder_14 \
  --dir2 ./v1_k16_epoch15_4_pre/sensor_encoder_14

Experiment Result

Few-shot Classification

loss curve

CNNs have inductive bias of vision modality, but Transformer are not. Thus we initialize weights pretrained on ImageNet making the ViTs has the capability of recognizing the image texture features.

scratch ResNet	scratch ViT	IN pretrained ViT

NMI Criterion

Visualization

Model Zoo

Vision Backbone	5-shot/%	10-shot/%	Mean/%	Params/M	Download
ResNet-50	56.9	62.2	59.2	25.6	model
ResNet-101	56.1	63.5	59.8	44.5	model
ResNet-152	62.7	67.5	65.1	60.2	model
ViT-B	50.1	55.6	52.9	86.6	model
Swin-T	64.9	70.3	67.6	28.3	model
Swin-S	58.5	64.1	61.3	49.6	model
Swin-B	62.5	69.3	65.9	87.8	model

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Demos		Demos
figs		figs
projects		projects
LICENSE		LICENSE
README.md		README.md
main_pretrain.py		main_pretrain.py
main_test.py		main_test.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demos

Demos

figs

figs

projects

projects

LICENSE

LICENSE

README.md

README.md

main_pretrain.py

main_pretrain.py

main_test.py

main_test.py

test.py

test.py

Repository files navigation

Multi-modal Self-Supervised Learning for Autonomous Situation Awareness of Unmanned Systems

Abstract

Method

Usage

Prerequisites

Self-Supervised Pretraining

Experiment Result

Few-shot Classification

loss curve

NMI Criterion

Visualization

Model Zoo

About

Releases

Packages

Languages

License

hctian713/MM2US

Folders and files

Latest commit

History

Repository files navigation

Multi-modal Self-Supervised Learning for Autonomous Situation Awareness of Unmanned Systems

Abstract

Method

Usage

Prerequisites

Self-Supervised Pretraining

Experiment Result

Few-shot Classification

loss curve

NMI Criterion

Visualization

Model Zoo

About

Topics

Resources

License

Stars

Watchers

Forks

Languages