Skip to content

Latest commit

 

History

History
66 lines (52 loc) · 2.52 KB

DESCRIPTION.rst

File metadata and controls

66 lines (52 loc) · 2.52 KB

Text preprocessing for Natural Language Processing

A python package for text preprocessing task in natural language processing.

Usage

To use this text preprocessing package, first install it using pip:

pip install text-preprocessing

Then, import the package in your python script and call appropriate functions:

from text_preprocessing import preprocess_text
from text_preprocessing import to_lower, remove_email, remove_url, remove_punctuation, lemmatize_word

# Preprocess text using default preprocess functions in the pipeline
text_to_process = 'Helllo, I am John Doe!!! My email is john.doe@email.com. Visit our website www.johndoe.com'
preprocessed_text = preprocess_text(text_to_process)
print(preprocessed_text)
# output: hello email visit website

# Preprocess text using custom preprocess functions in the pipeline
preprocess_functions = [to_lower, remove_email, remove_url, remove_punctuation, lemmatize_word]
preprocessed_text = preprocess_text(text_to_process, preprocess_functions)
print(preprocessed_text)
# output: helllo i am john doe my email is visit our website

Features

Feature Function
convert to lower case to_lower
convert to upper case to_upper
keep only alphabetic and numerical characters keep_alpha_numeric
check and correct spellings check_spelling
expand contractions expand_contraction
remove URLs remove_url
remove names remove_name
remove emails remove_email
remove phone numbers remove_phone_number
remove SSNs remove_ssn
remove credit card numbers remove_credit_card_number
remove numbers remove_number
remove bullets and numbering remove_itemized_bullet_and_numbering
remove special characters remove_special_character
remove punctuations remove_punctuation
remove extra whitespace remove_whitespace
normalize unicode (e.g., café -> cafe) normalize_unicode
remove stop words remove_stopword
tokenize words tokenize_word
tokenize sentences tokenize_sentence
substitute custom words (e.g., vs -> versus) substitute_token
stem words stem_word
lemmatize words lemmatize_word
preprocess text through a sequence of preprocessing functions preprocess_text