create clusters based on most similar course descriptions #415

kembayeb · 2023-11-01T23:13:05Z

Summary

Created clusters based on the similarity of course descriptions, using a library called genism to do those calculations for us.

Clusters are based on all courses offered in FA23.

Each cluster has 50 coursed inside, or lightly more if an outlier was merged inside. Also currently have 87 clusters.

NOTES
currently courses that don't have a course description aren't accounted for which is 97 courses

dti-github-bot · 2023-11-01T23:13:17Z

[diff-counting] Significant lines: 140.

michelleli01

It looks great Kemba! I'm so excited to see progress with the recommendations feature. I left some minor comments about separating certain sections within your `clusters.py file

michelleli01 · 2023-11-13T21:42:04Z

course-clusters/clusters.py

+subjects_url = "https://classes.cornell.edu/api/2.0/config/subjects.json?roster=FA23"
+subjects = getSubjects(subjects_url)


Can we move this section into a new file and import the subjects variable.

michelleli01 · 2023-11-13T21:43:45Z

course-clusters/clusters.py

+urls = [f"https://classes.cornell.edu/api/2.0/search/classes.json?roster=FA23&subject={sub}" for sub in subjects]
+
+def fetchDescriptions(urls):
+    descriptions = []
+    for x in urls:
+        descriptions += loadCourseDesc(x)
+    print(len(descriptions))
+    return descriptions
+
+
+course_descriptions = fetchDescriptions(urls)


Can we also move this to a separate file? We can add the getSubjects and fetchDescriptions functions into a separate scrapper.py file.

michelleli01 · 2023-11-13T21:44:50Z

course-clusters/clusters.py

+def preprocess_text(text):
+    """ Tokenization and preprocessing with custom stopwords"""
+    custom_stopwords = ["of", "the", "in", "and", "on", "an", "a", "to"]
+    strong_words = ["technology","calculus","business", "Artificial Intelligence", "First-Year Writing", "computer","python","java","economics","US","writing","biology","chemistry", "physics", "engineering","ancient"
+                    "programming", "algorithms", "data structures","art","software","anthropology" "databases","fiction","mathematics", "history","civilization"]
+
+    translator = str.maketrans("", "", string.punctuation)
+    text = text.lower()
+    text = text.translate(translator)
+    tokens = text.split()
+    tokens = [token for token in tokens if token not in custom_stopwords]
+    for word in strong_words:
+        if word in tokens:
+            tokens += [word] * 10 
+
+    return " ".join(tokens)
+
+def removeStopwords():
+    "Removes stopwords for all the descriptions "
+    preprocessed = []
+    for desc in course_descriptions:
+        if desc:
+            preprocessed.append(preprocess_text(desc))
+    return preprocessed
+
+preprocessed_descriptions = removeStopwords()


Let's also move this preprocessing out. Separate the preprocessing/web scrapping with the actual model itself. Can we also create some persistent objects to store the web scrapping data and the preprocessing so we don't always have to do that?

create clusters based on most similar course descriptions

14cff3c

kembayeb requested a review from a team as a code owner November 1, 2023 23:13

michelleli01 requested changes Nov 13, 2023

View reviewed changes

top 10 function, seperate functionality

4bd65c3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create clusters based on most similar course descriptions #415

create clusters based on most similar course descriptions #415

kembayeb commented Nov 1, 2023

dti-github-bot commented Nov 1, 2023 •

edited

michelleli01 left a comment

michelleli01 Nov 13, 2023

michelleli01 Nov 13, 2023

michelleli01 Nov 13, 2023

		subjects_url = "https://classes.cornell.edu/api/2.0/config/subjects.json?roster=FA23"
		subjects = getSubjects(subjects_url)

create clusters based on most similar course descriptions #415

Are you sure you want to change the base?

create clusters based on most similar course descriptions #415

Conversation

kembayeb commented Nov 1, 2023

Summary

dti-github-bot commented Nov 1, 2023 • edited

michelleli01 left a comment

Choose a reason for hiding this comment

michelleli01 Nov 13, 2023

Choose a reason for hiding this comment

michelleli01 Nov 13, 2023

Choose a reason for hiding this comment

michelleli01 Nov 13, 2023

Choose a reason for hiding this comment

dti-github-bot commented Nov 1, 2023 •

edited