Skip to content

jannat0718/Text-Summarizations

Repository files navigation

Extractive and Abstractive Text Summarization Techniques using Transformer Models

Text summarization is the process of automatically generating a shortened version of a given document or text. In this era of big data and information overload, text summarization has become an essential tool for processing and presenting information in a concise and meaningful way. There are two main approaches to text summarization: extractive and abstractive.

Extractive Summarization of Web Articles Using Transformer-based Models

Extractive summarization involves selecting the most important sentences or phrases from the original text and presenting them as a summary. The summary consists of verbatim sentences or phrases extracted from the original text, and the goal is to maintain the coherence and meaning of the original content. Extractive summarization is relatively straightforward and requires less linguistic knowledge than abstractive summarization. This report presents a project that aims to perform extractive summarization of web articles using a pre-trained Transformer model available through the Hugging Face library. The process begins with scraping the content from the target URL, preprocessing the data, and chunking the text into smaller segments. The pre-trained summarization pipeline is then applied to each chunk to generate an extractive summary.

Objective:

The objective of this project is to develop a workflow for extractive summarization of web articles, allowing users to quickly understand the most important information without reading the entire article.

Step-by-Process:

a. Installed required packages, including transformers, BeautifulSoup, requests, and other optional libraries for parsing web pages.

b. Imported necessary libraries, such as the summarization pipeline from transformers, BeautifulSoup, and requests.

c. Provided a target URL to scrape the content.

d. Requested the website content using the requests library.

e. Parsed the webpage content with BeautifulSoup, extracting headings and text.

f. Removed labels from the extracted text and join the text into a single string.

g. Modifird punctuations and split the article into sentences.

h. Defined a maximum chunk size and chunk the text into smaller segments.

i. Loaded the summarization pipeline from the pre-trained Transformer model.

j. Applied the summarization pipeline to each chunk, generating a summary with a specified maximum and minimum length.

k. Joined the separate summaries to create a comprehensive extractive summary.

Results and Analysis:

The project successfully extracted and summarized a web article using the pre-trained Transformer model. The model provided an 80-word summary for each chunk, which was combined to create a comprehensive extractive summary. The generated summary focused on the most relevant and important information from the original article.

Optimization:

To optimize the performance of the summarization model, the chunk size, minimum length, and maximum length parameters can be adjusted based on the specific content and summarization requirements. Additionally, domain-specific fine-tuning of the pre-trained model can improve its performance on specific types of articles.

Abstractive Summarization Using the PEGASUS-XSUM model and PEGASUS-large model:

Abstractive summarization, on the other hand, involves generating a summary that may contain new phrases and sentences not present in the original text. The process requires a deeper understanding of the text's content and context, as the summary is generated by rephrasing and restructuring the original text while preserving its meaning. Abstractive summarization is more challenging than extractive summarization but can produce more concise and readable summaries. This report details a project that aims to scrape scientific articles and generate abstractive summaries using the Pegasus-xsum and Pegasus-large models. These models, developed by Google and available through the Hugging Face Transformers library, provide high-quality summaries in a relatively short time. The first step is to scrape the content from the web and preprocess the data before using the pre-trained Pegasus models to create summaries.

Objective:

The objective of this project is to evaluate the performance of Pegasus-xsum and Pegasus-large models in generating abstractive summaries of scientific articles and to compare the results produced by the two models.

Step-by-Process:

a. Installed necessary libraries, including PyTorch, sentencepiece, and transformers.

b. Imported BeautifulSoup libraries for web scraping.

c. Scraped the content from the target URL using the requests library.

d. Parsed the content with BeautifulSoup, extracting the required headings and text.

e. Removed labels from the extracted text and join the text.

f. Performd abstractive summarization using Pegasus-xsum and Pegasus-large models:

  i. Importd dependencies such as PegasusForConditionalGeneration and PegasusTokenizer.
  
  ii. Created a tokenizer for each model.
  
  iii. Created tokens as the numerical representation of the text.
  
  iv. Loadd the respective Pegasus models.
  
  v. Generated and decode the summaries.

Results and Analysis:

The Pegasus-xsum and Pegasus-large models generated abstractive summaries for the given scientific article at https://ai.googleblog.com/2022/03/auto-generated-summaries-in-google-docs.html. The Pegasus-xsum model produced a single statement as a summary, while the Pegasus-large model generated a more comprehensive summary consisting of 5-8 sentences. As the Pegasus model is based on gap-sentence or masking techniques, the generated summaries are more akin to paraphrasing the text rather than providing exact sentence overlaps.

Optimization:

To optimize the performance of the models, fine-tuning can be performed on a domain-specific corpus. Additionally, adjustments to the tokenizer's parameters can be made to better suit the content.

Performance Comparison:

Compared to abstractive summarization, the extractive summarization approach preserves the original wording and structure of the source text. While this may provide more accurate representations of the content, it may not be as concise as an abstractive summary, which paraphrases the content. The choice between extractive and abstractive summarization largely depends on the specific requirements and desired outcome of the summarization task. Extractive summarization may be more suitable for applications that require high fidelity to the original text, while abstractive summarization might be a better fit when brevity and paraphrasing are prioritized. On the other hand, the Pegasus-xsum model provided a more concise summary, suitable for a quick understanding of the main point. The Pegasus-large model generated a more in-depth and comprehensive summary that gives a broader understanding of the article's content.

Overall Conclusion:

In conclusion, both extractive and abstractive summarization techniques provide valuable methods for condensing web articles into shorter, more manageable summaries. A comprehensive understanding of both extractive and abstractive summarization techniques, as well as the pre-trained models and tools available, can enable developers to make informed decisions on the most appropriate method to use for their specific text summarization needs.

Citations:

  1. H. P. Luhn, "The Automatic Creation of Literature Abstracts," IBM Journal of Research and Development, vol. 2, no. 2, pp. 159-165, 1958. Link

  2. R. Nallapati, F. Zhai, and B. Zhou, "Summarunner: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents," in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 2016, pp. 214-223. Link

  3. R. Rush, S. Chopra, and J. Weston, "A Neural Attention Model for Abstractive Sentence Summarization," in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 2015, pp. 379-389. Link

  4. S. Karami and A. Sarkar, "A Review of Text Summarization Techniques," Artificial Intelligence Review, vol. 53, no. 4, pp. 2371-2404, 2020. Link

  5. S. S. Barik, P. R. Tripathy, and A. K. Rath, "A Comparative Study of Extractive and Abstractive Text Summarization Techniques," International Journal of Computer Applications, vol. 179, no. 46, pp. 6-13, 2018. Link

  6. Zhang, J., Zhao, Y., Saleh, M., & Liu, P. J. (2020). PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. Retrieved from https://arxiv.org/abs/1912.08777