This repo collects AI-related research.
Name | Description | Links | Publish Time |
---|---|---|---|
Behavior Vision Suite | BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation | Project website | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
huggingface/lerobot | State-of-the-art Machine Learning for Real-World Robotics in Pytorch | Github | 2024 |
TidyBot | A household cleanup robot done by StanfordAILab. | GitHub | 2023 |
Eureka | Human-Level Reward Design via Coding Large Language Models, such as GPT-4, to perform in-context evolutionary optimization over reward code. Harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning | Github | 2023 |
NOIR | Neural Signal Operated Intelligent Robots for Everyday Activities. Stanford University | Project website | 2023 |
robotics-survey/Awesome-Robotics-Foundation-Models | This repository is largely based on the following paper: Foundation Models in Robotics: Applications, Challenges, and the Future By Stanford University, Princeton University, UT Austin, NVIDIA, Scaled Foundations, Google DeepMind, TU Berlin, Shanghai Jiao Tong University | Github | 2023 |
JeffreyYH/robotics-fm-survey | Survey Paper of foundation models for robotics. paper: oward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis By CMU, Bosch Center for AI, SAIR Lab, Georgia Tech, FAIR at Meta, UC San Diego, Google DeepMind | Github | 2023 |
Name | Description | Links | Publish Time |
---|---|---|---|
mPLUG-DocOwl | Modularized Multimodal Large Language Model for Document Understanding. By Alibaba Group | Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
DeepSeek-VL | An open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios. | Github | 2024 |
An Introduction to Vision-Language Modeling | An Introduction to Vision-Language Modeling. By Meta. | URL | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
TBC-TJU/MetaBCI | China’s first open-source platform for non-invasive brain computer interface. The project of MetaBCI is led by Prof. Minpeng Xu from Tianjin University, China. | Github | 2022 |
Name | Description | Links | Publish Time |
---|---|---|---|
Awesome-LLMs-Datasets | Summarize existing representative LLMs text datasets. | Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
mathvista | A benchmark designed to combine challenges from diverse mathematical and visual tasks. By UCLA and Microsoft Research | Project website | 2023 |
hallucination-leaderboard | Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents. | Github | 2023 |
GAIA | A benchmark for General AI Assistants. By Meta-FAIR, Meta-GenAI, HuggingFace and AutoGPT | Project website | 2023 |
microsoft/promptbench | A Unified Library for Evaluating and Understanding Large Language Models. | Github | 2023 |
Name | Description | Links | Publish Time |
---|---|---|---|
Summarization is (Almost) Dead | Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models. | https://arxiv.org/pdf/2309.09558.pdf | 2023 |
Name | Description | Links | Publish Time |
---|---|---|---|
VoiceCraft | VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference. | Github | 2024 |
Mega-TTS 2 | Input text and reference audio, clone the timbre of the reference audio to generate speech corresponding to the text. By Zhejiang University and ByteDance. Paper:https://arxiv.org/abs/2307.07218 | URL | 2024 |
NaturalSpeech 3 | Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models. By Microsoft Research Asia paper: https://arxiv.org/abs/2403.03100 |
URL | 2024 |
BASE TTS | BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. By amazon. paper:https://arxiv.org/abs/2402.08093 |
URL | 2024 |
metavoice-src | Foundational model for human-like, expressive TTS. Zero-shot cloning for American & British voices, with 30s reference audio. | Github | 2024 |
Bark | Multilingual Demo: https://huggingface.co/spaces/suno/bark Paper: https://arxiv.org/abs/2209.03143 |
Github | 2023 |
XTTS | Multilingual Demo: https://huggingface.co/spaces/coqui/xtts |
Github | 2021 |
OpenVoice | ZH + EN Demo: https://huggingface.co/spaces/myshell-ai/OpenVoice Paper: https://arxiv.org/abs/2312.01479 |
Github | 2023 |
TorToiSe TTS | English Demo: https://huggingface.co/spaces/Manmay/tortoise-tts Paper:https://arxiv.org/abs/2305.07243 |
Github | 2022 |
GPT-SoVITS | Multilingual | Github | |
EmotiVoice | ZH + EN | Github | 2023 |
MeloTTS | high-quality multi-lingual text-to-speech library by MyShell.ai. Support English, Spanish, French, Chinese, Japanese and Korean. | Github | 2024 |
Tacotron 2 | English Paper: https://arxiv.org/abs/1712.05884 |
Unofficial Repo:Github | GDrive |
Silero | EM + DE + ES + EA | Github | |
StyleTTS 2 | English Demo: https://huggingface.co/spaces/styletts2/styletts2 Paper:https://arxiv.org/abs/2306.07691 |
Github | 2023 |
Amphion | Demo: https://huggingface.co/amphion Paper: https://arxiv.org/abs/2312.09911 |
Github | 2023 |
VALL-E | Paper: https://arxiv.org/abs/2301.02111 |
Unofficial Repo:Github | 2023 |
Piper | Multilingual | Github | |
WhisperSpeech | English, Polish Demo |
Github | 2023 |
HierSpeech++ | KR + EN Demo:https://huggingface.co/spaces/LeeSangHoon/HierSpeech_TTS Paper:https://arxiv.org/abs/2311.12454 |
Github | 2023 |
Glow-TTS | English Demo:https://jaywalnut310.github.io/glow-tts-demo/index.html Paper:https://arxiv.org/abs/2005.11129 |
Github | 2020 |
xVASynth | Multilingual Demo:https://store.steampowered.com/app/1765720/xVASynth/ Paper:https://arxiv.org/abs/2009.14153 |
Github | 2023 |
IMS-Toucan | Multilingual, Demo: https://huggingface.co/spaces/Flux9665/IMS-Toucan Paper: https://arxiv.org/abs/2206.12229 |
Github | 2023 |
Matcha-TTS | English Demo:https://huggingface.co/spaces/shivammehta25/Matcha-TTS Paper:https://arxiv.org/abs/2309.03199 |
Repo | 2023 |
RAD-TTS | English Paper:https://openreview.net/pdf?id=0NQwnnwAORi |
Github | 2022 |
MahaTTS | English + Indic Demo: Colab |
Github | 2023 |
Neural-HMM TTS | English Demo:https://shivammehta25.github.io/Neural-HMM/ Paper:https://arxiv.org/abs/2108.13320 |
Repo | 2021 |
pflowTTS | English Paper:https://openreview.net/pdf?id=zNA7u7wtIN |
Unofficial Repo | 2023 |
Pheme | English Demo:https://huggingface.co/spaces/PolyAI/pheme Paper:https://arxiv.org/abs/2401.02839 |
Github | 2024 |
TTTS | ZH Demo:https://colab.research.google.com/github/adelacvg/ttts/blob/master/demo.ipynb |
Github | |
VITS/ MMS-TTS | English Demo:https://huggingface.co/spaces/kakao-enterprise/vits Paper:https://arxiv.org/abs/2106.06103 |
Github | 2021 |
OverFlow TTS | English Demo:https://shivammehta25.github.io/OverFlow/ Paper: https://arxiv.org/abs/2211.06892 |
Github | 2022 |
Name | Description | Links | Publish Time |
---|---|---|---|
AnyText | Multilingual Visual Text Generation And Editing. By Alibaba Group | Github | 2023 |
InstantID | InstantID is a new state-of-the-art tuning-free method to achieve ID-Preserving generation with only single image, supporting various downstream tasks. | Github | 2023 |
apple/ml-mgie | Guiding Instruction-based Image Editing via Multimodal Large Language Models. By Apple. | Github | 2024 |
lllyasviel/IC-Light | IC-Light is a project to manipulate the illumination of images. Demo:https://huggingface.co/spaces/lllyasviel/IC-Light | Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
MusePose | MusePose is a diffusion-based and pose-guided virtual human video generation framework.By Tencent. | Github | 2024 |
ProPainter | Improving Propagation and Transformer for Video Inpainting. S-Lab, Nanyang Technological University | Github | 2023 |
Emu Edit/Emu video | Emu Edit is an AI generated image model that supports modifying local content of images through text; Emu Video is an AI generated video model that also supports text modification of local content in videos. | Project website | 2023 |
PixelDance | A novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. By ByteDance Research | Project website | 2023 |
MagicDance | Realistic Human Dance Video Generation with Motions & Facial Expressions Transfer. By University of Southern California | Github | 2023 |
TencentARC/ MotionCtrl | A Unified and Flexible Motion Controller for Video Generation | Github | 2023 |
DreaMoving | A Human Video Generation Framework based on Diffusion Models. By Alibaba Group | Github | 2023 |
magicvideov2 | Multi-Stage High-Aesthetic Video Generation by ByteDance | URL | 2024 |
Boximator | Generating Rich and Controllable Motions for Video Synthesis. By ByteDance | URL | 2024 |
fudan-generative-vision/champ | Controllable and Consistent Human Image Animation with 3D Parametric Guidance | Github | 2024 |
TaoHuUMD/SurMo | Surface-based 4D Motion Modeling for Dynamic Human | Github | 2024 |
ToonCrafter | A research paper for generative cartoon interpolation | Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
V-Express | V-Express aims to generate a talking head video under the control of a reference image, an audio, and a sequence of V-Kps images. By Tencent. | Github | 2024 |
InstructAvatar | InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation. By Peking University | Project website | 2024 |
X-LANCE/AniTalker | Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding | Github | 2024 |
VASA-1 | Lifelike Audio-Driven Talking Faces Generated in Real Time. By Microsoft. paper:https://arxiv.org/abs/2404.10667 | Project Website | 2024 |
GeneFace | Generalized and High-Fidelity 3D Talking Face Synthesis. Zhejiang University, ByteDance | Github | 2023 |
GAIA | Zero-shot talking avatar generation aims at synthesizing natural talking videos from speech and a single portrait image. GAIA (Generative AI for Avatar), which eliminates the domain priors in talking avatar generation. By Microsoft | Project Website | 2023 |
Name | Description | Links | Publish Time |
---|---|---|---|
PKU-YuanGroup/Video-LLaVA | Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | Github | 2023 |
Name | Description | Links | Publish Time |
---|---|---|---|
cleanlab | The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels. | Github |
Name | Description | Links | Publish Time |
---|---|---|---|
DMV3D | Denoising Multi-View Diffusion using 3D Large Reconstruction Model. A single-stage approach for high-quality text-to-3D generation and single-image reconstruction in 30s. By Adobe, Stanford, etc | Project website | 2023 |
Make-A-Character | High Quality Text-to-3D Character Generation within Minutes. By Alibaba | Github | 2023 |
Name | Description | Links | Publish Time |
---|---|---|---|
open-mmlab/mmdetection | MMDetection is an open source object detection toolbox based on PyTorch. | Github | |
AILab-CVC/YOLO-World | Real-Time Open-Vocabulary Object Detection. By Tencent. | Github | 2024 |
LiheYoung/Depth-Anything | Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. Foundation Model for Monocular Depth Estimation. By 1The University of Hong Kong · 2TikTok · 3Zhejiang Lab · 4Zhejiang University | Github | 2024 |
t-rex | Towards Generic Object Detection via Text-Visual Prompt Synergy. | Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
CodeFormer | Towards Robust Blind Face Restoration with Codebook Lookup Transformer (NeurIPS 2022) . By S-Lab, Nanyang Technological University | Github | 2023 |
Name | Description | Links | Publish Time |
---|---|---|---|
Upscale-A-Video | Upscale-A-Video is a diffusion-based model that upscales videos by taking the low-resolution video and text prompts as inputs. S-Lab, Nanyang Technological University | Github | 2023 |
ComfyUI-SUPIR | SUPIR upscaling wrapper for ComfyUI | Github | 2024 |
APISR | APISR: Anime Production Inspired Real-World Anime Super-Resolution (CVPR 2024). APISR aims at restoring and enhancing low-quality low-resolution anime images and video sources with various degradations from real-world scenarios. | Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
OutfitAnyone | Outfit Anyone: Ultra-high quality virtual try-on for Any Clothing and Any Person. Institute for Intelligent Computing, Alibaba Group | Github | 2023 |
OOTDiffusion | Official implementation of OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on | Github Demo:https://ootd.ibot.cn/ | 2024 |
ViViD | ViViD: Video Virtual Try-on using Diffusion Models. By Alibaba | Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
StemGen | StemGen: A music generation model that listens, ByteDance Inc | Project Website | 2023 |
Name | Description | Links | Publish Time |
---|---|---|---|
Retrieval-Augmented Generation for Large Language Models: A Survey | Shanghai Research Institute for Intelligent Autonomous Systems | URL | 2023 |
Name | Description | Links | Publish Time |
---|---|---|---|
surya | Surya is a multilingual document OCR toolkit. It can do: Accurate line-level text detection | Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
sally-sh/vsp-llm | Visual Speech Processing incorporated with LLMs paper:https://arxiv.org/abs/2402.15151v1 |
Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
NationalGAILab/HoT | Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation | Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
GeneOH-Diffusion | Towards Generalizable Hand-Object Interaction Denoising via Denoising Diffusion | Github | 2024 |
Efficient-Large-Model/VILA | VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops) | Github | 2024 |
如果您喜欢这个项目,可以赞赏一下支持我们,谢谢您的支持!