Topic-Modeling-on-Jeopardy-Trivia

Applied NLP techniques (NMF, TFI-IDF, topic-modeling) to discover latent meta-categories from 200k+ questions in the Jeopardy! archives.

The Goal

**The goal of this analysis is to apply NLP techniques to categorize Jeopardy! questions into meta-categories. These will help form a study app that home viewers and aspirign contestants can use to study -- or just have fun! **

Background

Jeopardy is a trivia gameshow where contestants are presented with general knowledge clues in the form of answers, and must phrase their responses in the form of questions. Clues are concealed under varying dollar amounts on the game board, ostensibly reflecting varying levels difficulty, and come from different categories. Popular categories include Literature, Science, Word Origins, and Before & After, a tricky category that asks users to combine clues and answers.

Hell, we'll take Before & After for $200!

Have a think on that.

Key Terms

Clue: What I will be calling a question-answer combination. A single clue instance can be considered as a text document
J-Category: The "Jeopardy!" defined category. In the image above, 'BEFORE & AFTER', 'SAY "WH"', etc are the J-Categories of one round
Meta-category: An overarching topic that can describe each clue's context, also referred to as a hidden theme. For example, the J-Category "BEFORE & AFTER" seen above might belong to potential meta-categories "Literature" and "History". In data-science, we can also think of a meta-category as a latent topic.

The Data

The original dataset is a .csv file and has 216,929 rows and 7 columns. Each row contains the information pertaining to a single clue per episode from 1984 until 2021.

Algorithms

Text Pre-processing: I used lemmatization, part-of-speech tagging and custom stopword removal to filter words. I then used regex expressions to remove punctuation, and dropped clues that included images or video.
Vectorize Text: used a tf-idf (Term Frequency * Inverse Document Frequency) to vectorize the text from each clue. In other words, I turned the raw text from the "Jeopardy!" questions and answers into a matrix whose entries are the numerical tf-idf features of each word in the text.
Dimensionality Reduction: I then used Non-Negative Matrix Factorization (NMF) to create clusters of words, where each cluster can be thought of as a meta-category or latent topic, which is one of the goals of this analysis.

Tools

Numpy & Pandas for data processing
Matplotlib, WordCloud for visualization
Scikit-learn for machine learning
NLTK for natural language processing

Results

Initial data exploration revealed the most popular categories:

In total, there are 27,295 actual Jeopardy categories from 1984 until 2012. From these, my analysis constructed 13 decently cohesive meta-catogies in the dataset such as PEOPLE, FILM & TV, WORDS, and HISTORY.

Below is a wordcloud based on terms appearing in the CITY meta-category, along with two representative clues:

"4 treaties to mitigate the horrors of war were signed in this city in August, 1949." ('What is Geneva')
"The last of the 13 colonies to be founded, its ‘Mother City’, Savannah, was settled in 1733." (What is Georgia)

As another example, take the 'BOOKS' meta-category:

The top Produce item is Bananas by far, and organic produce is very popular, with 15 of the top 20 products being organic.

Weakened by scarlet fever, Beth March was sentenced to death by this author.
In a 14-line poem Wordsworth wrote "Scorn not" this form of poetry.

You can interact with the live Jeopardy! Quiz Game app here.

the answer is 'The Birth of Venus Williams' *

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
code		code
images		images
.gitignore		.gitignore
Jeopardy streamlit app.mov		Jeopardy streamlit app.mov
README.md		README.md
Topic Modeling on Jeopardy!.pdf		Topic Modeling on Jeopardy!.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

images

images

.gitignore

.gitignore

Jeopardy streamlit app.mov

Jeopardy streamlit app.mov

README.md

README.md

Topic Modeling on Jeopardy!.pdf

Topic Modeling on Jeopardy!.pdf

Repository files navigation

Topic-Modeling-on-Jeopardy-Trivia

The Goal

Background

Key Terms

The Data

Algorithms

Tools

Results

About

Releases

Packages

Languages

lizzynaameh/Topic-Modeling-on-Jeopardy-Trivia

Folders and files

Latest commit

History

Repository files navigation

Topic-Modeling-on-Jeopardy-Trivia

The Goal

Background

Key Terms

The Data

Algorithms

Tools

Results

About

Resources

Stars

Watchers

Forks

Languages