Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create clusters based on most similar course descriptions #415

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

kembayeb
Copy link

@kembayeb kembayeb commented Nov 1, 2023

Summary

Created clusters based on the similarity of course descriptions, using a library called genism to do those calculations for us.

Clusters are based on all courses offered in FA23.

Each cluster has 50 coursed inside, or lightly more if an outlier was merged inside. Also currently have 87 clusters.

NOTES
currently courses that don't have a course description aren't accounted for which is 97 courses

@kembayeb kembayeb requested a review from a team as a code owner November 1, 2023 23:13
@dti-github-bot
Copy link
Member

dti-github-bot commented Nov 1, 2023

[diff-counting] Significant lines: 140.

Copy link
Contributor

@michelleli01 michelleli01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great Kemba! I'm so excited to see progress with the recommendations feature. I left some minor comments about separating certain sections within your `clusters.py file

Comment on lines 5 to 6
subjects_url = "https://classes.cornell.edu/api/2.0/config/subjects.json?roster=FA23"
subjects = getSubjects(subjects_url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this section into a new file and import the subjects variable.

Comment on lines 8 to 18
urls = [f"https://classes.cornell.edu/api/2.0/search/classes.json?roster=FA23&subject={sub}" for sub in subjects]

def fetchDescriptions(urls):
descriptions = []
for x in urls:
descriptions += loadCourseDesc(x)
print(len(descriptions))
return descriptions


course_descriptions = fetchDescriptions(urls)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also move this to a separate file? We can add the getSubjects and fetchDescriptions functions into a separate scrapper.py file.

Comment on lines 19 to 44
def preprocess_text(text):
""" Tokenization and preprocessing with custom stopwords"""
custom_stopwords = ["of", "the", "in", "and", "on", "an", "a", "to"]
strong_words = ["technology","calculus","business", "Artificial Intelligence", "First-Year Writing", "computer","python","java","economics","US","writing","biology","chemistry", "physics", "engineering","ancient"
"programming", "algorithms", "data structures","art","software","anthropology" "databases","fiction","mathematics", "history","civilization"]

translator = str.maketrans("", "", string.punctuation)
text = text.lower()
text = text.translate(translator)
tokens = text.split()
tokens = [token for token in tokens if token not in custom_stopwords]
for word in strong_words:
if word in tokens:
tokens += [word] * 10

return " ".join(tokens)

def removeStopwords():
"Removes stopwords for all the descriptions "
preprocessed = []
for desc in course_descriptions:
if desc:
preprocessed.append(preprocess_text(desc))
return preprocessed

preprocessed_descriptions = removeStopwords()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also move this preprocessing out. Separate the preprocessing/web scrapping with the actual model itself. Can we also create some persistent objects to store the web scrapping data and the preprocessing so we don't always have to do that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants