Skip to content

Latest commit

 

History

History
336 lines (231 loc) · 22.7 KB

README_en.md

File metadata and controls

336 lines (231 loc) · 22.7 KB

中文 | English


Fengshenbang Achievements

Fengshenbang 1.0: Fengshenbang1.0 bilingual general paper, aims to be the Foundation of Chinese Cognitive Intelligence.

BioBART: A generative language model for biomedical domain provided by Tsinghua University together with IDEA Institute.(BioNLP 2022)

UniMC: A unified model for zero-shot scenarios based on labeled datasets.(EMNLP 2022)

FMIT: A single-tower multimodal named entity recognition model based on relative position encoding.(COLING 2022)

UniEX: A Natural Language Understanding Model for Unified Extraction Tasks.(ACL 2023)

Solving Math Word Problems via Cooperative Reasoning induced Language Models: Solving Math Word Problems via Cooperative Reasoning induced Language Models.(ACL 2023)

MVP-Tuning: 基Multi-View Knowledge Retrieval with Prompt Tuning for Commonsense Reasoning.(ACL 2023)

Fengshenbang Big Event

Navigation

Model Infofrmation

Series Demand Task Parameter Scale Extra
Ziya General AGI >7B Ziya has the capabilities of translation, programming, text classification, information extraction, summarization, copy generation, common sense question and answer, and mathematical calculation.
Erlangshen General NLU 97M-3.9B Erlangshen was designed to solve NLU tasks; The largest BERT when publicly released; SOTA on FewCLUE and ZeroCLUE in 2021.
Wenzhong General NLG 1B-3.5B Wenzhong focuses on NLG tasks; Provides several generative models with different scales, such as GPT2, etc.
Randeng General NLT 770M-5B Randeng handles natural language transformation (NLT) type tasks that convert from source text to target text, such as machine translation, text summarization, etc.
Taiyi Speical MultiModal 87M-1B Taiyi was applied to cross-modality scenarios, including text image generation, protein structure prediction, speech-text representation, etc.
Yuyuan Speical Domain 0.1B-3.5B Yuyuan was applied to specific domains such as healthcare, finance, law, programming, etc; The largest open-source GPT2 medical model
-TBD- Special Exploration -Unknown- This series hopes to develop experimental models on NLP with various technology companies and universities. Currently there are:Zhouwenwang

Download url of Fengshenbang

Fengshenbang Model training and fine-tuning code script

Handbook of Fengshenbang

Fengshenbang-LM

Remarkable advances in Artificial Intelligence (AI) have produced great models, in particular, pre-trained based foundation models become an emerging paradigm. In contrast to traditional AI models that must be trained on vast datasets for one or a few scenarios, foundation models can be adapted to a wide range of downstream tasks, therefore, limiting the amount of resource demanded to acquire an AI venture off the ground. Moreover, we observe that these models grow rapidly within a short period, around 10 times each year. For instance, BERT has 100 million parameters and GTP-3 has over 100 billion parameters. Many of the forefront challenges in AI, especially generalization ability, are becoming achievable due to this inspiring trend.

Foundation models, most notably language models, are dominated by the English-language community. The Chinese language as the world's largest spoken language (native speakers), however, has no systematic research resources to support it, making the progress in the Chinese language domain lag behind others.

And the world needs an answer for this.

On November 22nd, 2021, Harry Shum, the Founder and Chairman of the IDEA (International Digital Economy Academy) officially announces the launch of "Fengshenbang" open source project. —— a Chinese language driven foundation ecosystem, incorporates pre-trained models, task-specific fine-tune applications, benchmarks, and datasets. avatar

Fengshenbang Model

"Fengshenbang Model" will open-source a series of NLP-related pre-trained models in all aspects. There are a wide range of research tasks in the NLP community, which can be divided into two categories: general demands and special demands. In general demands, there are common NLP tasks, which are classified into Natural Language Understanding (NLU), Natural Language Generation (NLG), and Natural Language Transformation (NLT). Due to the fast development, NLP community brings special demands to the entire AI community, which are often assigned to MultiModal (MM), Domains and Exploration. We consider all of these tasks and provide models that are fine tuning for downstream tasks, making our base model easy to use for users with limited computing resources. We consider all of these demands and provide models that are fine-tuned for downstream tasks, making our base model easy to use for users with limited computing resources. Moreover, we guarantee that we will optimize the models continuously with new datasets and latest algorithms. We aim to build universal infrastructure for Chinese cognitive intelligence and prevent duplicative construction, and hence save computing resources for the community.

avatar

We also call for businesses, universities and institutions to join us with the project and build the sytem of large-scale open-source models collaboratively. We envision that, in the near future, the first choice when in need of a new pretrained model should be selecting one in closest proximity to the desired scale,architecture and domain from the series, followed by further training. After obtaining a trained new model, we shall add it back to the series of open-source models for future usage. In this way we build the open-source system iteratively and collaboratively while individuals could get desired models using minimal computing resources.

For better open source experience, all models of the Fengshenbang series are synchronized within the Huggingface community, and can be obtained for use within few lines of code. Welcome to download and use our models from our repo at IDEA-CCNL at HuggingFace.

Ziya

The general large-scale model "Ziya" series has the capabilities of translation, programming, text classification, information extraction, summarization, copy generation, common sense question and answer, and mathematical calculation. At present, Ziya's general-purpose large model (v1/v1.1) has completed a three-stage training process of large-scale pre-training, multi-task supervised fine-tuning, and human feedback learning. Ziya series models include the following models:

Example of Usage

Refer to Ziya-LLaMA-13B-v1

Online Demo

Finetune Example

Refer to ziya_finetune

Inference & Quantization Example

Refer to ziya_inference

Erlangshen

This series focuses on using bidirectional language models with encoders to solve multiple natural language understanding tasks. Erlangshen-MegatronBert-1.3B is the largest Chinese open source model with the structure of Bert. It contains 13 billion parameters, and was trained with 280G datasets on 32 A100 GPUs for 14 days. It achieved the top on the Chinese natural language understanding benchmark FewCLUE on Nov 10th, 2021. Among the tasks of FewCLUE, Erlangshen-1.3 beat human performance on the task of CHID(Chinese idioms cloze test) and TNEWS(News Classification), and achieved SOTA on tasks of CHID, CSLDCP(academic literature classification) and OCNLI(Natural language Inference), refreshing the records of few-shot learning. We will continue to optimize the Erlangshen series with respect to model scale, knowledge fusion, auxiliary supervision tasks, etc.

image

Erlangshen-MRC achieved the Chinese language comprehension evaluations benchmark ZeroCLUE on Jan 24th, 2022. Among the tasks of ZeroCLUE, CSLDCP (discipline literature classification), TNEWS (news classification), IFLYTEK (application description classification), CSL (abstract keyword recognition), CLUEWSC (reference resolution) achieved SOTA.

image

Download the Models

Huggingface Erlangshen-MegatronBert-1.3B

Load the Models

from transformers import MegatronBertConfig, MegatronBertModel
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("IDEA-CCNL/Erlangshen-MegatronBert-1.3B")
config = MegatronBertConfig.from_pretrained("IDEA-CCNL/Erlangshen-MegatronBert-1.3B")
model = MegatronBertModel.from_pretrained("IDEA-CCNL/Erlangshen-MegatronBert-1.3B")

Example of Usage

For the convenience of developers, we offer an example script for downstream finetuning. The script uses the tnews dataset from CLUE.

1、Fisrt, modify the MODEL_TYPE and PRETRAINING_MODEL_PATH parameters of finetune script, and other parameters can be modified according to your specific device.

MODEL_TYPE=huggingface-megatron_bert
PRETRAINED_MODEL_PATH=IDEA-CCNL/Erlangshen-MegatronBert-1.3B

2、Then, run

sh finetune_classification.sh

Performance on Downstream Tasks

Model afqmc tnews iflytek ocnli cmnli wsc csl
roberta-wwm-ext-large 0.7514 0.5872 0.6152 0.777 0.814 0.8914 0.86
Erlangshen-MegatronBert-1.3B 0.7608 0.5996 0.6234 0.7917 0.81 0.9243 0.872

Taiyi

Taiyi series models are mainly used in cross-modal scenarios, including text image generation, protein structure prediction, speech-text representation, etc. On November 1, 2022, Fengshenbang released the first Chinese version of the stable diffusion model "Taiyi Stable Diffusion".

Download the Models

Taiyi Stable Diffusion Chinese

Taiyi Stable Diffusion Chinese&English Bilingual

Example of Usage

from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("IDEA-CCNL/Taiyi-Stable-Diffusion-1B-Chinese-v0.1").to("cuda")

prompt = '飞流直下三千尺,油画'
image = pipe(prompt, guidance_scale=7.5).images[0]  
image.save("飞流.png")

Performance

铁马冰河入梦来,3D绘画。 飞流直下三千尺,油画。 女孩背影,日落,唯美插画。

Advanced Prompt

铁马冰河入梦来,概念画,科幻,玄幻,3D 中国海边城市,科幻,未来感,唯美,插画。 那人却在灯火阑珊处,色彩艳丽,古风,资深插画师作品,桌面高清壁纸。

Handbook for Taiyi

https://github.com/IDEA-CCNL/Fengshenbang-LM/blob/main/fengshen/examples/stable_diffusion_chinese/taiyi_handbook.md

How to finetune

https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/main/fengshen/examples/finetune_taiyi_stable_diffusion

Configure webui

https://github.com/IDEA-CCNL/stable-diffusion-webui/blob/master/README.md

DreamBooth

https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/main/fengshen/examples/stable_diffusion_dreambooth

Fengshen Framework

To make it easy for everyone to use the FengShenbang model, participate in the continuous training and downstream applications of the large-scale model, we We simultaneously open-source the user-centered FengShen framework. For details, please also see: Fengshen Framework.

Referring to other excellent open source frameworks (including HuggingFace, Megatron-LM, Pytorch-Lightning, DeepSpeed) and combining the characteristics of NLP field, we redesign FengShen with Pytorch as the base framework and Pytorch-Lightning as the Pipeline. FengShen can be applied to pre-training of large models (tens of billions of parameters) based on massive data (terabytes of data) and fine-tuning on various downstream tasks. Users can easily perform distributed training and memory-saving techniques with configuration, thus focusing more on model implementation and innovation. Also, FengShen can directly use the model structure in HuggingFace for continued training, which facilitates domain transfer for users. FengShen provides rich and realistic source code and examples. We will continue to optimize the FengShen framework as the models of Fengshenbang are trained and applied. Stay tuned.

Installation

Installing in an existing environment

git clone https://github.com/IDEA-CCNL/Fengshenbang-LM.git
cd Fengshenbang-LM
git submodule init
git submodule update
# ubmodule is the fs_datasets we use to manage the datasets, pulled by ssh, which may fail if the user does not have ssh-key configured on the machine.
# If the pull fails, you need to go to the .gitmodules file and change the ssh address to an https address.
pip install --editable .

Using Docker

We provide a simple docker, which contains torch and cuda environment to run our framework.

sudo docker run --runtime=nvidia --rm -itd --ipc=host --name fengshen fengshenbang/pytorch:1.10-cuda11.1-cudann8-devel
sudo docker exec -it fengshen bash
cd Fengshenbang-LM
# Update the code. The code in docker may not be up to date
git pull
git submodule foreach 'git pull origin master' 
# Now you're ready to use our framework in docker

Pipelines

Fenghen framework is currently adapting various downstream tasks in Pipeline, support Predict, Finetuning by one-click in command line. Take Text Classification as an example

# predictfengshen-pipeline text_classification predict --model='IDEA-CCNL/Erlangshen-Roberta-110M-Similarity' --text='今天心情不好[SEP]今天很开心'
[{'label': 'not similar', 'score': 0.9988130331039429}]

# train
fengshen-pipeline text_classification train --model='IDEA-CCNL/Erlangshen-Roberta-110M-Similarity' --datasets='IDEA-CCNL/AFQMC' --gpus=0 --texta_name=sentence1 --strategy=ddp

Get Started with Fengshen in 3 Minutes

Fengshenbang Series Articles

Fengshen Series: Getting Started on Training Large Model with Data Parallelism

Fengshen Series: It is Time to Accelerate your Training Process!

Fengshen Series: Chinese PEGASUS Model Pre-training

Fengshen Series: Just a Simple Finetune, Erlangshen Accidentally Took the First Place

Fengshen Series: Quickly Build Your Algorithm Demo

2022 AIWIN World Artificial Intelligence Innovation Competition: Small Sample Multi-Task Track Winner Solution

Citation

If you are using the resource for your work, please cite the our paper:

@article{fengshenbang,
  author    = {Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen and Ruyi Gan and Jiaxing Zhang},
  title     = {Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence},
  journal   = {CoRR},
  volume    = {abs/2209.02970},
  year      = {2022}
}

You can also cite our website:

@misc{Fengshenbang-LM,
  title={Fengshenbang-LM},
  author={IDEA-CCNL},
  year={2021},
  howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
}

Contact

IDEA-CCNL team has created the Fengshebang open source discussion group, we will update and release new models and articles of Fengshenbang in the discussion group from time to time. Please scan the QR code below or search "fengshenbang-lm" on WeChat to add the Fengshen space assistant into the group!

avartar

We are also continuously recruiting, so feel free to send in your resume!

avartar

License

Apache License 2.0