Video Foundation Models & Data for Multimodal Understanding
-
Updated
May 16, 2024 - Python
Video Foundation Models & Data for Multimodal Understanding
Code release for "Training a Large Video Model on a Single Machine in a Day"
FreeVA: Offline MLLM as Training-Free Video Assistant
SoccerNet Game State Reconstruction: End-to-End Athlete Tracking and Identification on a Minimap (CVPR24 - CVSports workshop)
(2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Official code for MiniGPT4-video
Official Repo for CVPR 2024 Paper "FACT: Frame-Action Cross-Attention Temporal Modeling for Efficient Fully-Supervised Action Segmentation"
[IJCNN 2024] Unifying Global and Local Scene Entities Modelling for Precise Action Spotting
A Large Short-video Recommendation Dataset with Raw Text/Audio/Image/Videos (Talk Invited by DeepMind).
[NAACL 2024] Z-GMOT: Zero-shot Generic Multiple Object Tracking
VTC: Improving Video-Text Retrieval with User Comments
Benchmarking Panoptic Video Scene Graph Generation (PVSG), CVPR'23
A curated list of recent diffusion models for video generation, editing, restoration, understanding, etc.
[NAACL 2024] Official Implementation of paper "Self-Adaptive Sampling for Efficient Video Question Answering on Image--Text Models"
Awesome OVD-OVS - A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
Official implementation of the paper "ST-LLM: Large Language Models Are Effective Temporal Learners"
Graph learning framework for long-term video understanding
[ICCV 2023] MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions
OpenTAD is an open-source temporal action detection (TAD) toolbox based on PyTorch.
Add a description, image, and links to the video-understanding topic page so that developers can more easily learn about it.
To associate your repository with the video-understanding topic, visit your repo's landing page and select "manage topics."