Skip to content

VivekChoudhary77/Textify-text-Preprocessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Textify-A Text Preprocessing Web Application

A text preprocessing web application which helps a user to get the summary of an Article and also a text generator which generates text based on user input

Technologies used


Working of this application

Firstly the Application is Command line based executable under Python Environment and uses popular Python micro web framework that is FLASK.This Application consist of 2 main pages runs in LocalHost wherein intially a form is given to the user based on the content entered and on submit by the user,Accordingly the Summarizied Content is analyzed with support and importing of some python packages.This data is exported to the next connecting page.

For text generation Hugging Face is an NLP focused startup that shares a large open-source community and provides an open-source library for Natural Language Processing. Their core mode of operation for natural language processing revolves around the use of Transformers. This python based library exposes an API to use many well-known architectures that help obtain the state of the art results for various NLP tasks like text classification, information extraction, question answering, and text generation. All the architectures provided come with a set of pre-trained weights utilizing deep learning that help with ease of operation for such tasks.These transformer models come in different shape and size architectures and have their ways of accepting input data tokenization. A tokenizer takes an input word and encodes the word into a number, thus allowing faster processing.

Tokenizer in Python In both of the text processing part tokenizer is playing a vital role. In Python tokenization basically refers to splitting up a larger body of text into smaller lines, words or even creating words for a non-English language. The various tokenization functions in-built into the nltk module itself and can be used in programs as shown below.

This Project Comprises of 3 Modules namely

  • Landing Page
  • Summarized Content as Output
  • Text Generation as Output

Run the these Commands in the Windows Terminal:

Note Before Running the text-summarisation run these commands

pip install nlp

For exporting and processing the data,run the following script in new .py file before ruuning the application as follows:

   import nltk
   nltk.download('stopwords')
   nltk.download('word_tokenize')
   nltk.download('sent_tokenize')

In order to run and intialize the application there are 2 alternative methods:

  • Method - 1 : Run from Editor in venv and view localhost application in any Browser with link (http://127.0.0.1:5000/)
  • Method - 2 : Run from command prompt with specified path location of project by using following command
 python __init__.py

Landing Page

alt text

Summarisation (Before Summarisation)

alt text

Output (Summarised content of Article)

alt text


For the Text generation Part

Run these commands before running the Text Generation

 pip install tensorflow
 pip install transformers
 pip3 install torch torchvision torchaudio

For Conda

conda install pytorch torchvision torchaudio cpuonly -c pytorch

Note : While Running the text generator part the model will automatically download the required files for text generator i.e. GPT2 Model

Some terms and their meaning in the project

  • max_length : Outputs the no. of words you want to see while generating the text.
  • input_ids : Indices of input sequence tokens in the vocabulary.
  • pad_token_id : If a pad_token_id is defined in the configuration, it finds the last token that is not a padding token in each row.
  • num_beans : bean search to find the next appropriate words in the sequence.
  • no_repeat_ngram_size : Stops repeating certain sequences over and over again.(Basically it stops our model repeating words or sentences).
  • early_stopping : if model does not genrates more or great output it generally stops generating the output.
  • skip_special_tokens : always be True because we want to return sentences not the endofsentence tokens and other tokens we only want the words.
  • return_tensors : 'pt' refers as pytorch tensors.

Text-Generator (landing page)

alt_text

Text-Generator (Output)

alt_text


So here it Concludes the project by generating the output by matching the keywords what user has entered.