A python package for text preprocessing task in natural language processing.
To use this text preprocessing package, first install it using pip:
pip install text-preprocessing
Then, import the package in your python script and call appropriate functions:
from text_preprocessing import preprocess_text
from text_preprocessing import to_lower, remove_email, remove_url, remove_punctuation, lemmatize_word
# Preprocess text using default preprocess functions in the pipeline
text_to_process = 'Helllo, I am John Doe!!! My email is john.doe@email.com. Visit our website www.johndoe.com'
preprocessed_text = preprocess_text(text_to_process)
print(preprocessed_text)
# output: hello email visit website
# Preprocess text using custom preprocess functions in the pipeline
preprocess_functions = [to_lower, remove_email, remove_url, remove_punctuation, lemmatize_word]
preprocessed_text = preprocess_text(text_to_process, preprocess_functions)
print(preprocessed_text)
# output: helllo i am john doe my email is visit our website
Feature | Function |
---|---|
convert to lower case | to_lower |
convert to upper case | to_upper |
keep only alphabetic and numerical characters | keep_alpha_numeric |
check and correct spellings | check_spelling |
expand contractions | expand_contraction |
remove URLs | remove_url |
remove names | remove_name |
remove emails | remove_email |
remove phone numbers | remove_phone_number |
remove SSNs | remove_ssn |
remove credit card numbers | remove_credit_card_number |
remove numbers | remove_number |
remove bullets and numbering | remove_itemized_bullet_and_numbering |
remove special characters | remove_special_character |
remove punctuations | remove_punctuation |
remove extra whitespace | remove_whitespace |
normalize unicode (e.g., café -> cafe) | normalize_unicode |
remove stop words | remove_stopword |
tokenize words | tokenize_word |
tokenize sentences | tokenize_sentence |
substitute custom words (e.g., vs -> versus) | substitute_token |
stem words | stem_word |
lemmatize words | lemmatize_word |
preprocess text through a sequence of preprocessing functions | preprocess_text |