SCOUT: Task- and Context-Modulated Attention

This is a repository for the following models and data:

SCOUT and SCOUT+ models for task- and context-aware driver gaze prediction;
corrected and annotated ground truth for DR(eye)VE dataset;
extra annotations for drivers' actions and context for DR(eye)VE, BDD-A, and LBW datasets.

More information can be found in these papers:

I. Kotseruba, J.K. Tsotsos, "Understanding and Modeling the Effects of Task and Context on Drivers' Gaze Allocation", IV, 2024..
I. Kotseruba, J.K. Tsotsos, "SCOUT+: Towards practical task-driven drivers’ gaze prediction", IV, 2024.
I. Kotseruba, J.K. Tsotsos, "Data limitations for modeling top-down effects on drivers’ attention", IV, 2024.

SCOUT

SCOUT is a model for drivers' gaze prediction that uses task and context information (represented as a set of numeric values and text labels) to modulate the output of the model, simulating top-down attentional mechanisms.

Since the model is aware of the driver's actions and context, it is able to anticipate maneuvers and pay attention to the relevant element of context unlike bottom-up models that are more reactive. The qualitative results of SCOUT and two state-of-the-art models demonstrate this on scenarios from DR(eye)VE involving maneuvers at the intersections. For example, when making turns at unsignalized intersections and during merging, the model correctly identifies intersecting roads and neighbouring lanes, respectively, as areas which the driver should examine.

SCOUT+

SCOUT+ is an extension of SCOUT that uses a map and route images instead of task and context labels, which is more similar to the information available to the human driver.

SCOUT+ achieves similar results as SCOUT, without relying on the precise labels and vehicle sensor information.

Annotations for DR(eye)VE, BDD-A, and LBW datasets

extra_annotations folder contains additional annotations for the datasets as described below.

DR(eye)VE

Corrected and annotated gaze data

extra_annotations/DReyeVE/gaze_data contains .txt files (one for each video in the dataset) with the following columns:

frame_etg - frame index of eye-tracking glasses (ETG) video;
frame_gar - frame index of rooftop camera (GAR) video;
X,Y - gaze coordinates in the ETG video;
X_gar,Y_gar - gaze coordinates in the GAR video
event_type - type of data point: fixation, saccade, blink, or error;
code - timestamp
loc - text labels for gaze location: scene (windshield), in-vehicle (with subcategories, such as speedometer, dashboard, passenger, mirrors, etc), out-of-frame (gaze is out of GAR camera view) and NA (for blinks, saccades, errors).

Note that in these files, ETG and GAR videos are temporally realigned. As a result, the correspondences between ETG and GAR frame indices are different from the original files supplied with DR(eye)VE. We recomputed all homographies between pairs of ETG and GAR frames (available here) and manually corrected all outliers.

New ground truth saliency maps

To generate new saliency maps for DR(eye)VE we did the following:

filtered out saccades, blinks and fixations to the car-interior;
pushed fixations outside of the scene frame bounds to the image boundary to preserve the direction and elevation of the drivers' gaze;
re-aggregated fixations over 1s interval (+-12 frames) around each frame using motion-compensated saliency method based on the optical flow.

For more details see scripts/DReyeVE_ground_truth.

Maps

SCOUT+ uses street graphs from OpenStreetMap and valhalla to map-match the GPS coordinates to the street network. See scripts/maps/README.md for more details.

BDD-A

extra_anotations/BDD-A/video_labels.xlsx contains video-level labels indicating the recording time, time of day, location, weather, and quality issues
extra_annotations/BDD-A/exclude_videos.json is a list of videos that are excluded from training/evaluation due to missing data or recording quality issues
extra_annotations/BDD-A/vehicle_data contains Excel spreadsheets with GPS and heading data, as well as annotations for maneuvers and intersections (see the next section)
extra_annotations/BDD-A/route_maps contains .png images of OpenStreetMaps of the local area around the route recorded in each video. See scripts/maps/README.md for more details

LBW

extra_annotaitons/LBW/video_labels.xlsx contains video-level labels indicating the time of day, location, and weather for each video
extra_annotations/LBW/train_test.json is a train/val/test split used in our experiments
extra_annotations/LBW/gaze_data is a set of Excel spreadsheets with gaze information with the following fields:
- subj_id, vid_id, frame_id - subject, video, and frame ids
- segm_id - segment id (in LBW, some frames are missing, frames with consecutive ids belong to the same segment)
- X, Y - gaze location in the image plane
- left and right eye coordinates in 2D and 3D

Task and context annotations for all three datasets

We used a combination of processing and manual labeling to identify maneuvers (lane changes and turns) and intersections for each route. This information has been added to the vehicle data for each video in every dataset.

We converted the vehicle information in BDD-A to match the format of DR(eye)VE. Since LBW does not provide vehicle data, it was approximated and is saved in the same format.

Task and context annotations are saved with the vehicle data in extra_annotations/<dataset>/vehicle_data, which contains Excel spreadsheets (one for each video in the dataset) with the following columns:

frame - frame id;
speed - ego-vehicle speed (km/h);
acc - ego-vehicle acceleration (m/s2) derived from speed;
course - ego-vehicle heading;
lat, lon - original GPS coordinates
lat_m, lon_m - map-matched GPS coordinates
lat action - labels for lateral action (left/right turn, left/right lane change, U-turn);
context - type of intersection (signalized, unsignalized, ), ego-vehicle priority (right-of-way, yield), and starting frame (frame where the driver first looked towards the intersection). These three values are separated by semicolon, e.g. unsignalized;right-of-way;1731.

Dataset utility functions

Utility functions for DR(eye)VE, LBW, BDD-A, and MAAD allow to print various dataset statistics and create data structures for evaluation.

There are functions to convert gaze and vehicle data from different dataset to the same format described above.

See data_utils/README.md for more information.

Installation and running instructions

Setting up data

Download DR(eye)VE dataset following the instructions on the official webpage
Download BDD-A dataset following the instructions on the official webpage
Create environment variables DREYEVE_PATH for DR(eye)VE and BDDA_PATH for BDD-A (e.g. add a line export DREYEVE_PATH=/path/to/dreyeve/ to ~/.bashrc file)
Extract frames from DR(eye)VE or BDD-A (see scripts folder, requires ffmpeg)
Download new ground truth from here and extract the archives inside extra_annotations/DReyeVE/new_ground_truth/. Copy the new ground truth to DReyeVE dataset using scripts/copy_DReyeVE_gt.sh.

Installing the models

Instructions below use docker. To build the container, use the script in the docker folder:

docker/build_docker.py

Update paths to the datasets (DR(eye)VE or BDD-A), extra_annotations, and SCOUT code folders in docker/run_docker.py script. Then run the script:

docker/run_docker.py

Note: see comments in the script for available command line options.

If you prefer not to use docker, dependencies are listed in the docker/requirements.txt.

Training the models

To use the pretrained Video Swin Transformer, download pretrained weights by running download_weights.sh inside the pretrained_weights folder.

To train the model, run inside the docker:

python3 train.py --config <config_yaml> --save_dir <save_dir>

--save_dir is a path where the trained model and results will be saved, if it is not provided, a directory with current datetime stamp will be created automatically.

See comments in the configs/SCOUT.yaml and configs/SCOUT+.yaml for available model parameters.

Testing the models

To test a trained model run:

python3 test.py --config_dir <path_to_dir> --evaluate --save_images

--config_dir is a path to the trained model directory which must contain the config file and checkpoints.

--evaluate if this option is specified, predictions for the best checkpoint will be evaluated and the results will be saved in an excel file in the provided config_dir folder.

--save_images if this option is specified, predicted saliency maps will be saved to config_dir/results/ folder.

Pretrained weights

The following pretrained weights are available here:

SCOUT (with task) trained on DR(eye)VE or BDD-A
SCOUT+ (with map) trained on DR(eye)VE or BDD-A

To use pretrained weights, download them and place them in train_runs/best_model/.

Note on the KLD metric

The implementation of KL divergence in the DR(eye)VE metrics code produces incorrect results. Script test_saliency_metrics.py demonstrates discrepancies between DR(eye)VE and two other KLdiv implementations. For evaluationg SCOUT and other models, we follow Fahimi & Bruce implementation. See also supplementary materials for more details.

Citation

If you used models or data from this repository, please consider citing these papers:

@inproceedings{2024_IV_SCOUT,
    author = {Kotseruba, Iuliia and Tsotsos, John K.},
    title = {Understanding and modeling the effects of task and context on drivers' gaze allocation},
    booktitle = {IV},
    year = {2024}
}

@inproceedings{2024_IV_SCOUT+,
    author = {Kotseruba, Iuliia and Tsotsos, John K.},
    title = {{SCOUT+: Towards practical task-driven drivers’ gaze prediction}},
    booktitle = {IV},
    year = {2024}
}

@inproceedings{2024_IV_data,
    author = {Kotseruba, Iuliia and Tsotsos, John K.},
    title = {Data limitations for modeling top-down effects on drivers’ attention},
    booktitle = {IV},
    year = {2024}
}

References for the DR(eye)VE, BDD-A, MAAD, and LBW datasets:

@article{2018_PAMI_Palazzi,
    author = {Palazzi, Andrea and Abati, Davide and Calderara, Simone and Solera, Francesco and Cucchiara, Rita},
    title = {{Predicting the driver's focus of attention: The DR (eye) VE Project}},
    journal = {IEEE TPAMI},
    volume = {41},
    number = {7},
    pages = {1720--1733},
    year = {2018}
}

@inproceedings{2018_ACCV_Xia,
    author = {Xia, Ye and Zhang, Danqing and Kim, Jinkyu and Nakayama, Ken and Zipser, Karl and Whitney, David},
    title = {Predicting driver attention in critical situations},
    booktitle = {ACCV},
    pages = {658--674},
    year = {2018}
}

@inproceedings{2021_ICCVW_Gopinath,
  title={MAAD: A Model and Dataset for" Attended Awareness" in Driving},
  author={Gopinath, Deepak and Rosman, Guy and Stent, Simon and Terahata, Katsuya and Fletcher, Luke and Argall, Brenna and Leonard, John},
  booktitle={ICCVW},
  pages={3426--3436},
  year={2021}
}


@inproceedings{2022_ECCV_Kasahara,
    author = {Kasahara, Isaac and Stent, Simon and Park, Hyun Soo},
    title = {{Look Both Ways: Self-supervising driver gaze estimation and road scene saliency}},
    booktitle = {ECCV},
    pages = {126--142},
    year = {2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
cache		cache
configs		configs
data_utils		data_utils
docker		docker
extra_annotations		extra_annotations
images		images
pretrained_weights		pretrained_weights
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataloader.py		dataloader.py
generate_new_gt.py		generate_new_gt.py
loss.py		loss.py
model.py		model.py
position_encoding.py		position_encoding.py
saliency_metrics.py		saliency_metrics.py
supplementary.pdf		supplementary.pdf
test.py		test.py
test_saliency_metrics.py		test_saliency_metrics.py
train.py		train.py
utils.py		utils.py
video_swin_transformer.py		video_swin_transformer.py

License

ykotseruba/SCOUT

Folders and files

Latest commit

History

Repository files navigation