Skip to content

pemagrg1/nlp-data-augmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 

Repository files navigation

NLP Data Augmentation

(Augmentating Textual Data Using NLP Libraries)

Augmentation” is the process of enlarging in size or amount and here in this article, we’ll work out how we can increase the size of the data using the data augmentation techniques for textual data. Also as the neural architectures rely on large parallel corpora, synthetically generating data (which is called data augmentation) can be of huge help.

As mentioned in “A Survey of Data Augmentation Approaches for NLP”[b], some of the Data Augmentation Techniques are:

  1. Rule-Based: Easy Data Augmentation(EDA)
  2. Example Interpolation Techniques: MIXUP, SEQ2MIXUP
  3. Model-Based Techniques: Seq2seq, language model, backtranslation, fine-tuning GPT-2, paraphrasing.

Under Rule-Based, the basic and most commonly used technique is EDA: Easy data augmentation techniques. The EDA techniques are:

  1. Synonym Replacement: Randomly choose n words from the sentence that does not stop words. Replace each of these words with one of its synonyms chosen at random.
  2. Random Deletion: Randomly remove each word in the sentence with probability p.
  3. Random Swap: Randomly choose two words in the sentence and swap their positions. Do this n times.
  4. Random Insertion: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times

Vairous Data Augmentation Task:

  1. Summarization
  2. Question Answering
  3. Sequence Tagging
  4. Parsing
  5. Grammatical Error Correction
  6. Neural Machine Translation
  7. Data to Text
  8. Dialogue

Various Libraries available:

  1. TextAugment
  2. Augly
  3. NLPAug
  4. Parrot paraphrase
  5. Pegasus paraphrase

Working Code of each libraries can be found here:

Sample Output:

TextAugment

TextAugment is a Python 3 library for augmenting text for natural language processing applications. TextAugment stands on the giant shoulders of NLTK, Gensim, and TextBlob and plays nicely with them.

image

Augly

Facebook just recently released the AugLy package to the public domain. AugLy library is divided into four sub-libraries, each for different kinds of data modalities (audio, images, videos and texts).
image

NLPAug

NLPAug is a library for textual augmentation in machine learning experiments. The goal is improving deep learning model performance by generating textual data.

image image
Back translation involves taking the translated version of a document or file and then having a separate independent translator (who has no knowledge of or contact with the original text) translate it back into the original language.

Parrot paraphrase

Parrot is a paraphrase-based utterance augmentation framework purpose-built to accelerate training NLU models. A paraphrase framework is more than just a paraphrasing model.
image

Pegasus paraphrase

PEGASUS is a standard Transformer encoder-decoder. PEGASUS uses GSG to pre-train a Transformer encoder-decoder on large corpora of documents.
image

REF:

[a] Surangika Ranathunga, En-Shiun Annie Lee, Marjana Prifti Skenduli, Ravi Shekhar, Mehreen Alam, Rishemjit Kaur. 2021. **Neural Machine Translation for Low-Resource Languages: A Survey.

[b] Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, Eduard Hovy. 2021. A Survey of Data Augmentation Approaches for NLP.

[c] EDA from scratch: https://jovian.ai/abdulmajee/eda-data-augmentation-techniques-for-text-nlp

[d]TextAugment https://github.com/dsfsi/textaugment

[e] Augly https://analyticsarora.com/how-to-use-augly-on-image-video-audio-and-text/

[f] nlpaug https://github.com/makcedward/nlpaug

[g] Parrot Paraphraser https://github.com/PrithivirajDamodaran/Parrot_Paraphraser

[h] Pegasus Paraphraser https://huggingface.co/tuner007/pegasus_paraphrase

[I] Improving short text classification through global augmentation methods.

[j] PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization https://arxiv.org/abs/1912.08777