This is 10th Solution For Kaggle LLM Science Competition

Here is Original Discussion

Dataset

First of all, we want to express our gratitude to all the amazing Kagglers who have contributed in creating a larger collection of public datasets.
Finally, we ended up with around 360k data in total, but because of time constraint, we only utilized approximately 303k data for fine-tuning our models.
We generated around 150k data using GPT-3.5 according to lots of Wikipedia page contents, and combined it with 99k Dataset, 60k Dataset, 17k MMLU Dataset and 70k Dataset.
Here: Code To Generate Dataset and Whole Dataset

Sentence Transformer Model

We tried all-MiniLM-L6, all-MiniLM-L12, gte-small and bge-small-en.
We tested different models when we were at public leaderboard score 0.843.

Before After

all-MiniLM-L6 0.843

all-MiniLM-L12 0.843 0.837

gte-small 0.843 0.851

bge-small-en 0.843 0.822

Finally we chose gte-small as our sentence transformer model.
Here: Notebook To Create Your Own Index.

Retrieval

We improved our models' performance by extracting new contexts for each dataset, using the parameters pages=10 and sentences=30.
To ensure that search documents contain more relevant knowledge when testing on a private dataset, we extract the contents of Wikipedia pages from HTML and re-cluster them to obtain additional articles, thereby preventing any inconsistencies.
We combined three types of retrievers to obtain the final outcome: 1. Extracting context from the entire Wikipedia page, 2. Regarding the two other retrievers proposed by MB, we have expanded the value target_articles in the shared notebook by MB. We manually extracted the titles of the dataset, which is considered a useful validation, and added them to the target_articles. Additionally, we have selected more clusters to help us discover more relevant data, increasing it from 270K to approximately 6 million rows. Furthermore, we are utilizing both TF-IDF and sentence transformer techniques for context extraction.

Training

Training Dataset: We divided the entire dataset into multiple chunks to enhance the robustness of our ensemble models. For instance, we split the dataset into four parts and trained four models on each large dataset formed by combining three smaller portions. Additionally, we also trained a model using the entire dataset for comparison. Here: 222k-133k-5fold, 222k-148k-3fold, 303k-202k-3fold.
Validation Dataset: We randomly selected a 2k dataset from the entire dataset we generated using GPT-3.5 and combined it with 200 original training datasets to create our validation dataset. In the end, we selected our final submissions according to its performance on 500 Valid Dataset.
Parameters: We are incredibly grateful that our teammate has 8 * A100 GPUs. This allowed us to set the max_length to either 512 or 768 for training different models, which could be more suitable. Additionally, we trained some models using fp32, which took approximately four days to train 222k data on a single A100 GPU. Here: 222k-133k-models, 222k-148k-models, 303k-202k-models, 222k-768-fp32-model, 303k+50k-finetune-model

Adversarial Weight Perturbation: We incorporated adversarial weight perturbation into our training approach to boost the robustness of each individual model, resulting in a further reduction in our training loss. Here: Training Notebook.

Parameters	Value
`per_device_train_batch_size`	2
`per_device_eval_batch_size`	1
`learning_rate`	4e-6
`lr_end`	8e-7
`num_train_epochs`	2 or 3
`gradient_accumulation_steps`	16
`weight_decay`	1e-4
`MAX_INPUT`	512 or 768
`dropout_rate`	0
`awp_lr`	0.1
`awp_eps`	1e-4
`awp_start_epoch`	0.5

The above parameters are the best tested on a single A100 GPU.

Ensemble Model: We used optuna to test the ensemble score on 500 validation dataset. We achieved a score of 0.915 on the GPT-3.5 500k dataset and 0.909 on the GPT-4 500k dataset. Because the test data set was generated using GPT-3.5, we finally chose the combination with a higher score under GPT-3.5. Submissions: Submission1, Submission2.

Combination	Ratio	Public Leaderboard	Private Leaderboard	Choose
`222k-full/checkpoint-12300` + `133k fold5/checkpoint-7100`	2: 1	0.929	0.921	No
`222k-full/checkpoint-12300` + `768_fp32_222k/checkpoint-27800` + `133k fold5/checkpoint-7100`	2: 1.5: 1	0.930	0.923	Yes
`222k_finetune/checkpoint-800` + `222k_relevant/checkpoint-12700`	1: 1	0.930	0.924	No
`148k fold1/checkpoint-5300` + `202k_fold3_512/checkpoint-12500` + `202k_fold2_512/checkpoint-12500` + `768-300k-best-para-epoch2/checkpoint-15600`	1: 1: 1: 1	0.929	0.922	No
`222k-full/checkpoint-12300` + `768_fp32_222k/checkpoint-27800` + `133k fold5/checkpoint-7100`	2: 1.5: 1	0.929	0.924	No
`133k fold5/checkpoint-7100` + `303k-50k-finetune/checkpoint-2800` + `singlegpu-200k-run-sft-awp-0-1-ls-dropout-4e6/checkpoint-11300` + `768-300k-best-para-epoch2/checkpoint-18900`	1: 1: 1: 1	0.930	0.923	Yes

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Kaggle Team		Kaggle Team
Solution		Solution
gpt3.5 data generation		gpt3.5 data generation
README.md		README.md
book_context.ipynb		book_context.ipynb
cd.py		cd.py
clean context.ipynb		clean context.ipynb
create_new_faiss_index.ipynb		create_new_faiss_index.ipynb
fillna.ipynb		fillna.ipynb
large-language-model-finetuned.ipynb		large-language-model-finetuned.ipynb
large_language_model_main.py		large_language_model_main.py
limit_length.py		limit_length.py
llama-7b-finetuned-llm-science-exam.ipynb		llama-7b-finetuned-llm-science-exam.ipynb
llama_main.py		llama_main.py
maincode.py		maincode.py
non-sci+cv-sci.py		non-sci+cv-sci.py
requirement.txt		requirement.txt
sample_submission.csv		sample_submission.csv
test.csv		test.csv
train.csv		train.csv

	`Before`	`After`
`all-MiniLM-L6`	0.843
`all-MiniLM-L12`	0.843	0.837
`gte-small`	0.843	0.851
`bge-small-en`	0.843	0.822

Lizhecheng02/Kaggle-LLM_Science_Exam

Folders and files

Latest commit

History

Repository files navigation

This is 10th Solution For Kaggle LLM Science Competition

Here is Original Discussion

Dataset

Sentence Transformer Model

Retrieval

Training

About

Topics

Resources

Stars

Watchers

Forks

Languages