$\textbf{Lumina-T2X}$: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

📰 News

[2024-05-21] 🚀🚀🚀 Lumina-Next-T2I supports a higher-order solver. It can generate images in just 10 steps without any distillation. Try our demos DEMO. The inference code will be released soon.
[2024-05-18] 🤩🤩🤩 We released training scripts for Lumina-T2I 5B. README
[2024-05-16] ❗❗❗ We have converted the .pth weights to .safetensors weights. Please pull the latest code and use demo.py for inference.
[2024-05-14] 🔥🔥🔥 Lumina-Next now supports simple text-to-music generation (examples), high-resolution (1024*4096) Panorama generation conditioned on text (examples), and 3D point cloud generation conditioned on labels (examples).
[2024-05-13] 🔥🔥🔥 We give examples demonstrating Lumina-T2X's capability to support multilingual prompts, and even support prompts containing emojis.
[2024-05-12] 🤩🤩🤩 We excitedly released our Lumina-Next-T2I model (checkpoint) which uses a 2B Next-DiT model as the backbone and Gemma-2B as the text encoder. Try it out at demo1 & demo2 & demo3.
[2024-05-10] 🔥🔥🔥 We released the technical report on arXiv.
[2024-05-09] 🚀🚀🚀 We released Lumina-T2A (Text-to-Audio) Demos. Examples
[2024-04-29] 🔥 We released the 5B model checkpoint and demo built upon it for text-to-image generation.
[2024-04-25] 🔥 Support 720P video generation with arbitrary aspect ratio. Examples 🚀🚀🚀
[2024-04-19] Demo examples released.
[2024-04-05] Code released for Lumina-T2I.
[2024-04-01] We release the initial version of Lumina-T2I for text-to-image generation.

🚀 Quick Start

Warning

Since we are updating the code frequently, please pull the latest code:

git pull origin main

In order to quickly get you guys using our model, we built different versions of the GUI demo site.

Lumina-Next-T2I 2B model demo:

[node1] [node2] [node3]

For more details about training and inference, please refer to Lumina-T2I and Lumina-Next-T2I

Warning

Lumina-T2X employs FSDP for training large diffusion models. FSDP shards parameters, optimizer states, and gradients across GPUs. Thus, at least 8 GPUs are required for full fine-tuning of the Lumina-T2X 5B model. Parameter-efficient Finetuning of Lumina-T2X shall be released soon.

Installation on your environment:

pip install git+https://github.com/Alpha-VLLM/Lumina-T2X

📑 Open-source Plan

📜 Index of Content

Lumina-T2X

Introduction

We introduce the $\textbf{Lumina-T2X}$ family, a series of text-conditioned Diffusion Transformers (DiT) capable of transforming textual descriptions into vivid images, dynamic videos, detailed multi-view 3D images, and synthesized speech. At the core of Lumina-T2X lies the Flow-based Large Diffusion Transformer (Flag-DiT)—a robust engine that supports up to 7 billion parameters and extends sequence lengths to 128,000 tokens. Drawing inspiration from Sora, Lumina-T2X integrates images, videos, multi-views of 3D objects, and speech spectrograms within a spatial-temporal latent token space, and can generate outputs at any resolution, aspect ratio, and duration.

🌟 Features:

Flow-based Large Diffusion Transformer (Flag-DiT): Lumina-T2X adopts the flow matching formulation and is equipped with many advanced techniques, such as RoPE, RMSNorm, and KQ-norm, demonstrating faster training convergence, stable training dynamics, and a simplified pipeline.
Any Modalities, Resolution, and Duration within One Framework:
1. $\textbf{Lumina-T2X}$ can encode any modality, including mages, videos, multi-views of 3D objects, and spectrograms into a unified 1-D token sequence at any resolution, aspect ratio, and temporal duration.
2. By introducing the [nextline] and [nextframe] tokens, our model can support resolution extrapolation, i.e., generating images/videos with out-of-domain resolutions not encountered during training, such as images from 768x768 to 1792x1792 pixels.
Low Training Resources: Our empirical observations indicate that employing larger models, high-resolution images, and longer-duration video clips can significantly accelerate the convergence speed of diffusion transformers. Moreover, by employing meticulously curated text-image and text-video pairs featuring high aesthetic quality frames and detailed captions, our $\textbf{Lumina-T2X}$ model is learned to generate high-resolution images and coherent videos with minimal computational demands. Remarkably, the default Lumina-T2I configuration, equipped with a 5B Flag-DiT and a 7B LLaMA as the text encoder, requires only 35% of the computational resources compared to Pixelart-$\alpha$.

📽️ Demo Examples

Text-to-Image Generation

Panorama Generation

Text-to-Video Generation

720P Videos:

Prompt: The majestic beauty of a waterfall cascading down a cliff into a serene lake.

video_720p_1.mp4

video_720p_2.mp4

Prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

video_tokyo_woman.mp4

360P Videos:

video_360p.mp4

Text-to-3D Generation

multi_view.mp4

Point Cloud Generation

Text-to-Audio Generation

Note

Attention: Mouse over the playbar and click the audio button on the playbar to unmute it.

Prompt: Semiautomatic gunfire occurs with slight echo

Generated Audio:

semiautomatic_gunfire_occurs_with_slight_echo.mp4

Groundtruth:

semiautomatic_gunfire_occurs_with_slight_echo_gt.mp4

Prompt: A telephone bell rings

Generated Audio:

a_telephone_bell_rings.mp4

Groundtruth:

a_telephone_bell_rings_gt.mp4

Prompt: An engine running followed by the engine revving and tires screeching

Generated Audio:

an_engine_running_followed_by_the_engine_revving_and_tires_screeching.mp4

Groundtruth:

an_engine_running_followed_by_the_engine_revving_and_tires_screeching_gt.mp4

Prompt: Birds chirping with insects buzzing and outdoor ambiance

Generated Audio:

birds_chirping_repeatedly.mp4

Groundtruth:

birds_chirping_repeatedly_gt.mp4

Text-to-music Generation

Prompt: An electrifying ska tune with prominent saxophone riffs, energetic e-guitar and acoustic drums, lively percussion, soulful keys, groovy e-bass, and a fast tempo that exudes uplifting energy.

Generated Music:

electrifying.ska.mp4

Prompt: A high-energy synth rock/pop song with fast-paced acoustic drums, a triumphant brass/string section, and a thrilling synth lead sound that creates an adventurous atmosphere.

Generated Music:

high_energy.song.mp4

Prompt: An uptempo electronic pop song that incorporates digital drums, digital bass and synthpad sounds.

Generated Music:

uptempo-electronic.mp4

Prompt: A medium-tempo digital keyboard song with a jazzy backing track featuring digital drums, piano, e-bass, trumpet, and acoustic guitar.

Generated Music:

medium-tempo.mp4

Prompt: This low-quality folk song features groovy wooden percussion, bass, piano, and flute melodies, as well as sustained strings and shimmering shakers that create a passionate, happy, and joyful atmosphere.

Generated Music:

low-quality-folk.mp4

Multilingual Generation

We present three multilingual capabilities of Lumina-Next-2B.

Generating Images conditioned on Chinese poems:

Generating Images with multilignual prompts:

Generating Images with emojis:

⚙️ Diverse Configurations

We support diverse configurations, including text encoders, DiTs of different parameter sizes, inference methods, and VAE encoders. Additionally, we offer features such as 1D-RoPE, image enhancement, and more.

📄 Citation

@article{gao2024lumina,
      title={Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers}, 
      author={Gao, Peng and Zhuo, Le and Lin, Ziyi and Liu, Dongyang and Du, Ruoyi and Luo, Xu and Qiu, Longtian and Zhang, Yuhang and others},
      journal={arXiv preprint arXiv:2405.05945},
      year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 206 Commits
Flag-DiT-ImageNet		Flag-DiT-ImageNet
assets		assets
lumina_next_t2i		lumina_next_t2i
lumina_t2i		lumina_t2i
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_cn.md		README_cn.md
pyproject.toml		pyproject.toml

License

Alpha-VLLM/Lumina-T2X

Folders and files

Latest commit

History

Repository files navigation

$\textbf{Lumina-T2X}$: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

📰 News

🚀 Quick Start

Lumina-Next-T2I 2B model demo:

📑 Open-source Plan

📜 Index of Content

Introduction

📽️ Demo Examples

Text-to-Image Generation

Panorama Generation

Text-to-Video Generation

Text-to-3D Generation

Point Cloud Generation

Text-to-Audio Generation

Text-to-music Generation

Multilingual Generation

⚙️ Diverse Configurations

📄 Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages