Textify-A Text Preprocessing Web Application

A text preprocessing web application which helps a user to get the summary of an Article and also a text generator which generates text based on user input

Technologies used

Flask (A micro web Framework)
nltk (Python Library)
Hugging Face GPT-2 (For text generation)
HTML,CSS,JS (Core technologies for building webpages)
Python (Programming language For Backend)
Pytorch (For text generation)
Tensorflow (An end-to-end open source machine learning platform)

Working of this application

Firstly the Application is Command line based executable under Python Environment and uses popular Python micro web framework that is FLASK.This Application consist of 2 main pages runs in LocalHost wherein intially a form is given to the user based on the content entered and on submit by the user,Accordingly the Summarizied Content is analyzed with support and importing of some python packages.This data is exported to the next connecting page.

For text generation Hugging Face is an NLP focused startup that shares a large open-source community and provides an open-source library for Natural Language Processing. Their core mode of operation for natural language processing revolves around the use of Transformers. This python based library exposes an API to use many well-known architectures that help obtain the state of the art results for various NLP tasks like text classification, information extraction, question answering, and text generation. All the architectures provided come with a set of pre-trained weights utilizing deep learning that help with ease of operation for such tasks.These transformer models come in different shape and size architectures and have their ways of accepting input data tokenization. A tokenizer takes an input word and encodes the word into a number, thus allowing faster processing.

Tokenizer in Python In both of the text processing part tokenizer is playing a vital role. In Python tokenization basically refers to splitting up a larger body of text into smaller lines, words or even creating words for a non-English language. The various tokenization functions in-built into the nltk module itself and can be used in programs as shown below.

This Project Comprises of 3 Modules namely

Landing Page
Summarized Content as Output
Text Generation as Output

Run the these Commands in the Windows Terminal:

Note Before Running the text-summarisation run these commands

pip install nlp

For exporting and processing the data,run the following script in new .py file before ruuning the application as follows:

   import nltk
   nltk.download('stopwords')
   nltk.download('word_tokenize')
   nltk.download('sent_tokenize')

In order to run and intialize the application there are 2 alternative methods:

Method - 1 : Run from Editor in venv and view localhost application in any Browser with link (http://127.0.0.1:5000/)
Method - 2 : Run from command prompt with specified path location of project by using following command

 python __init__.py

Landing Page

Summarisation (Before Summarisation)

Output (Summarised content of Article)

For the Text generation Part

Run these commands before running the Text Generation

 pip install tensorflow
 pip install transformers
 pip3 install torch torchvision torchaudio

For Conda

conda install pytorch torchvision torchaudio cpuonly -c pytorch

Note : While Running the text generator part the model will automatically download the required files for text generator i.e. GPT2 Model

Some terms and their meaning in the project

max_length : Outputs the no. of words you want to see while generating the text.
input_ids : Indices of input sequence tokens in the vocabulary.
pad_token_id : If a pad_token_id is defined in the configuration, it finds the last token that is not a padding token in each row.
num_beans : bean search to find the next appropriate words in the sequence.
no_repeat_ngram_size : Stops repeating certain sequences over and over again.(Basically it stops our model repeating words or sentences).
early_stopping : if model does not genrates more or great output it generally stops generating the output.
skip_special_tokens : always be True because we want to return sentences not the endofsentence tokens and other tokens we only want the words.
return_tensors : 'pt' refers as pytorch tensors.

Text-Generator (landing page)

Text-Generator (Output)

So here it Concludes the project by generating the output by matching the keywords what user has entered.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Images		Images
static		static
templates		templates
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
main.py		main.py
requirements.txt		requirements.txt
text_generate.py		text_generate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Images

Images

static

static

templates

templates

.gitignore

.gitignore

README.md

README.md

init.py

init.py

main.py

main.py

requirements.txt

requirements.txt

text_generate.py

text_generate.py

Repository files navigation

Textify-A Text Preprocessing Web Application

A text preprocessing web application which helps a user to get the summary of an Article and also a text generator which generates text based on user input

Technologies used

Working of this application

This Project Comprises of 3 Modules namely

For the Text generation Part

About

Releases

Packages

Languages

VivekChoudhary77/Textify-text-Preprocessing

Folders and files

Latest commit

History

Repository files navigation

Textify-A Text Preprocessing Web Application

A text preprocessing web application which helps a user to get the summary of an Article and also a text generator which generates text based on user input

Technologies used

Working of this application

This Project Comprises of 3 Modules namely

For the Text generation Part

About

Topics

Resources

Stars

Watchers

Forks

Languages