Model

`Fine-tuned Longformer for Summarization of Machine Learning Articles`

Check the Poster first

Model

The led-base-7168-ml is a Longformer model fine-tuned on a ML_arxiv dataset containing articles related to machine-learning topics for long-document summarization task.

The led-base-7168-ml is available on Hugging Face. This is led-base-16384 pretrained transformer model fine-tuned on the ML_arxiv dataset for summarization task. This model is able to effectively generate a coherent and consistent summary of a long article (up to 16384 tokens) related to machine learning topics. To use, try something like the following:

import torch
from transformers import LEDTokenizer, LEDForConditionalGeneration
tokenizer = LEDTokenizer.from_pretrained("bakhitovd/led-base-7168-ml")
model = LEDForConditionalGeneration.from_pretrained("bakhitovd/led-base-7168-ml")

article = "... long document ..."
inputs_dict = tokenizer.encode(article, padding="max_length", max_length=16384, return_tensors="pt", truncation=True)
input_ids = inputs_dict.input_ids.to("cuda")
attention_mask = inputs_dict.attention_mask.to("cuda")
global_attention_mask = torch.zeros_like(attention_mask)
global_attention_mask[:, 0] = 1
predicted_abstract_ids = model.generate(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask, max_length=512)
summary = tokenizer.decode(predicted_abstract_ids, skip_special_tokens=True)
print(summary)

or you can use summarization.ipynb notebook. This code extracts the content of an online article, generates a summary of the article using a pretrained transformer model led-base-7168-ml, and then displays the summary in an HTML format in an IPython environment. The script uses BeautifulSoup for web scraping, requests for HTTP requests, and PyTorch along with the transformers library for summarization.

Dataset

The ML_arxiv is a dataset of long structured documents obtained from the ArXiv OpenAccess repository. The ML_arxiv is a subset of the scientific papers dataset established in "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents" and widely used for summarization of long scientific documents.

'Machine learning' articles were extracted by clustering of embeddings of articles abstracts. To do that 2,077 machine learning-related articles were identified within the scientific papers dataset. Then all articles of the scientific papers dataset were clustered into 6 clusters and the closest by cosine similarity to 2,077 machine learning-related articles cluster was selected.

It is not guaranteed that ML_arxiv contains only articles about machine learning, but the ML_arxiv contains 32,621 instances of the scientific papers dataset that are semantically, vocabulary-wise, structurally, and meaningfully closest to articles describing machine learning.

This dataset could be used for training models for summarization of long texts of machine learning specific domain.

You can access the dataset in Huggingface datasets library:

https://huggingface.co/datasets/bakhitovd/ML_arxiv

Evaluation

The performance of the fine-tuned model was evaluated using the ROUGE metrics, which measure the overlap between the ground truth summary and the generated summary in terms of unigrams (ROUGE-1), bigrams (ROUGE-2), and the longest contiguous common sequence (ROUGE-L).

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.gitattributes		.gitattributes
Bakhitov Summarization of Specific Domain Long Documents through Fine-Tuning Longformer.pdf		Bakhitov Summarization of Specific Domain Long Documents through Fine-Tuning Longformer.pdf
README.md		README.md
arxiv_analysis.ipynb		arxiv_analysis.ipynb
dataset_creation.ipynb		dataset_creation.ipynb
fine_tune_LED_7168_3_epochs.ipynb		fine_tune_LED_7168_3_epochs.ipynb
summarization.ipynb		summarization.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitattributes

.gitattributes

Bakhitov Summarization of Specific Domain Long Documents through Fine-Tuning Longformer.pdf

Bakhitov Summarization of Specific Domain Long Documents through Fine-Tuning Longformer.pdf

README.md

README.md

arxiv_analysis.ipynb

arxiv_analysis.ipynb

dataset_creation.ipynb

dataset_creation.ipynb

fine_tune_LED_7168_3_epochs.ipynb

fine_tune_LED_7168_3_epochs.ipynb

summarization.ipynb

summarization.ipynb

Repository files navigation

`Fine-tuned Longformer for Summarization of Machine Learning Articles`

Check the Poster first

Model

Dataset

Evaluation

About

Releases

Packages

Languages

Bakhitovd/led-base-7168-ml

Folders and files

Latest commit

History

Repository files navigation

Fine-tuned Longformer for Summarization of Machine Learning Articles

Check the Poster first

Model

Dataset

Evaluation

About

Topics

Resources

Stars

Watchers

Forks

Languages

`Fine-tuned Longformer for Summarization of Machine Learning Articles`