Skip to content

afogarty85/applied_nlp_demos

Repository files navigation

applied_nlp_demos in PyTorch

Modern NLP

  1. Pretrain T5 v1.1: Pre-Train T5 on C4 Dataset Code. This code is exceedingly less complicated, more readable, and truer to Google's implementation, than other available options thanks to HuggingFace. In comparison to the T5 1.1 paper which reports 1.942 loss at 65,536 steps, a single RTX 4090 produces comparable results on the test set (2.08) in roughly 18.5 hours of training using this code (see image below). Pretraining on your own data set is as simple as swapping out the existing Dataset with your own.

T5 Pretraining Loss

  1. Seq2Seq (ChatBot): Fine Tune Flan-T5 on Alpaca Code

  2. Seq2Seq: Fine Tune Flan-T5 on Data Using HuggingFace Dataset Framework Code

google/flan-t5-large

input sentence: Given a set of numbers, find the maximum value.
{10, 3, 25, 6, 16}
response: 25

input sentence: Convert from celsius to fahrenheit.
Temperature in Celsius: 15
response: Fahrenheit

input sentence: Arrange the given numbers in ascending order.
2, 4, 0, 8, 3
response: 0, 3, 4, 8

input sentence: What is the capital of France?
response: paris

input sentence: Name two types of desert biomes.
response: sahara

google/flan-t5-large: Fine-tuned on Alpaca

input sentence: Given a set of numbers, find the maximum value.
{10, 3, 25, 6, 16}
response: 25

input sentence: Convert from celsius to fahrenheit.
Temperature in Celsius: 15
response: 77

input sentence: Arrange the given numbers in ascending order.
2, 4, 0, 8, 3
response: 0, 2, 3, 4, 8

input sentence: What is the capital of France?
response: Paris

input sentence: Name two types of desert biomes.
response: Desert biomes can be divided into two main types: arid and semi-arid. Arid deserts are characterized by high levels of deforestation, sparse vegetation, and limited water availability. Semi-desert deserts, on the other hand, are relatively dry deserts with little to no vegetation.

Legacy NLP

This repository contains start-to-finish data processing and NLP algorithms using PyTorch and often HuggingFace (Transformers) for the following models:

  1. Paper: Hierarchical Attention Networks PyTorch Implementation: Code

  2. Paper: BERT PyTorch Implementation: Code

  3. BERT-CNN Ensemble. PyTorch Implementation: Code

  4. Paper: Character-level CNN PyTorch Implementation: Code

  5. Paper: DistilBERT PyTorch Implementation: Code

  6. DistilGPT-2. PyTorch Implementation: Code

  7. Paper: Convolutional Neural Networks for Sentence Classification PyTorch Implementation: Code

  8. Paper: T5-Classification PyTorch Implementation: Code

  9. Paper: T5-Summarization PyTorch Implementation: Code

  10. Building a Corpus: Search Text Files. Code: Code

  11. Paper: Heinsein Routing TorchText Implementation: Code

  12. Entity Embeddings and Lazy Loading. Code: Code

  13. Semantic Similarity. Code: Code

  14. SQuAD 2.0 BERT Embeddings Emissions in PyTorch. Code: Code

  15. SST-5 BERT Embeddings Emissions in PyTorch. Code: Code

Credits: The Hedwig group has been instrumental in helping me learn many of these models.

Nicer-looking R Markdown outputs can be found here: http://seekinginference.com/NLP/