Discerning Suicidal Ideation From the Language of Depression

Problem Statement: Can Natural Language Processing be used to tell suicidal ideation apart from depression in a collection of electronic records? Target Audience: A psychiatric treatment center is investigating the association of suicidal ideation and depression. There are a large number of incomplete health records available for clinical research on, but first they need to be encoded correctly. I have been hired to quantify the distinction between help them determine if they can automate the process.

EHR Electronic Health Records (EHR) construction of a depression "lexicon" Automation of text screening for depression, proactive screening for suicidality

Datasets

Provided Data

Reddit-API_Doc.html: Pushshift API for (source | data dictionary)

Additional Data

depression.csv: All the data pulled from r/depression with Pushshift API
depression_6.csv: Subset of data from r/depression just the 6 columns of interest
depression_clean.csv: Depression data, final form ready to model
suicide_watch.csv: All the data pulled from r/suicide_watch with Pushshift API
suicide_watch_6.csv: Subset of data from r/suicide_watch just the six columns of interest
suicide_watch_clean.csv: Suicide data, final form ready to model

Data Dictionary

Feature	Type	Dataset	Category	Description

Notebook 01: Web Scraping

Use the Pushshift API to scrape submissions off Reddit, specifically subreddits r/depression and r/SuicideWatch

Notebook 02: EDA and Data Cleaning

Notebook 03: Model Benchmarks

Main Takeaway: CountVectorizer was better at weighting the tokenized subreddit submission text than TF-IDF.

After loading the "clean" subreddit CSV from Notebook 02 into a DataFrame the data was test/train split and then after struggling to work with the entire dataset of over 200,000 rows all weekend, I took Hov's advice and broke off a random subset of the data to speed up the work flow.

For no particular reason Multinomial Naive Bayes was the chosen estimator to score and hypertune the transformation of submission text in a pipline. Care was taken to identify stop-words and to analyze the most frequent words, both of the entire dataset as well as the individual subreddits.

What was frustrating was no stop-words did worse then the standard 'english' CountVectorizer dictionary of stop-words, but the standard 'english' set did better than any set of words I chose. I tried to only set the top-1,000 words that were also standard stop-words and various alternatives, but they always performed worse.

I think the most intersting discovery was that a direct comparison of the top 500 words in each subreddit revealed that 460 were common to both. In particular, out of the top 500 most frequent words, there were 40 that were unique to both. These 80 words I believe were primarily responsible for giving an accuracy of over 70% which, compared to the baseline null of 50% was a pretty good jump in predictive power. Lastly, I thought it was illuminating to observe that when comparing the top 1,000 words from each subreddit the ONLY word that in depression that wasn't a top 1,000 word in SuicideWatch was the word 'depression'.

Notebook 04: Model Tuning

In this Notebook a number of estimators were streamlined through a pipeline workflow seamlessly thanks to the help of lesson notes from the weeks prior, the notes posted on advanced piplines in the WCBC directory, and Stackoverflow. Random Forest Classifier, Logistic Regression Classifier, Support Vector Machine, and Multinomial Naive Bayes estimators are instantiated in a pipe with CountVectorizer tokenizing the subreddit text as it outperformed Termm Frequency-Inverse Document Frequency(TF-IDF) consistently, for several variations of hyperparameters.

Run times were waay too long so features were limited to 1,000 with less than 20,000 randomly selected submissions. The so-called "smoothing parameter" in Naive Bayes helped a couple percent for alpha=0.01. I understand conceptually how it is necessary as a unit place holder because if a term in the test set is not in the train set, a value of zero would negate any other useful information provided by other words in the document (Reddit-submission/row). I understand how a number between 0 and 1 is appropriate but I do not know exaxclty why 0.01 is optimal.

Notebook 05: Production Model

Business Recommendations

-Yes, Suicidal ideation can be distinguished from depression with a sensitivity of 90% when a threshold of P(Y=1) ≤ 0.39 is enforced -A hybrid Machine Learning and Rule based approach may be necessary The severity of suicidal ideation has been proposed by Beck et al. as an indicator of suicidal risk.

Resources:

Beck, A. T., Kovacs, M., & Weissman, A. (1979). Assessment of suicidal intention: The Scale for Suicide Ideation. Journal of Consulting and Clinical Psychology, 47(2), 343–352. https://doi.org/10.1037/0022-006X.47.2.343

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
code		code
data		data
images		images
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

images

images

.gitattributes

.gitattributes

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Discerning Suicidal Ideation From the Language of Depression

Datasets

Provided Data

Additional Data

Data Dictionary

Notebook 01: Web Scraping

Notebook 02: EDA and Data Cleaning

Notebook 03: Model Benchmarks

Notebook 04: Model Tuning

Notebook 05: Production Model

Business Recommendations

About

Releases

Packages

Contributors 4

Languages

griffinbran/Discerning-Suicidal-Ideation-From-the-Language-of-Depression

Folders and files

Latest commit

History

Repository files navigation

Discerning Suicidal Ideation From the Language of Depression

Datasets

Provided Data

Additional Data

Data Dictionary

Notebook 01: Web Scraping

Notebook 02: EDA and Data Cleaning

Notebook 03: Model Benchmarks

Notebook 04: Model Tuning

Notebook 05: Production Model

Business Recommendations

About

Topics

Resources

Stars

Watchers

Forks

Languages