Using machine learning to analyze app reviews for suicide prevention apps
Using Apple app store and Google play store api to obtain the user reviews of suicide prevention related apps.
- Positive:85665
- Negative:21141
- Neutral:4216
- Stripped punctuation, special symbols, and unnecessary spaces.
- Normalized excessive character repetitions (for instance, transforming “toooo goooood” to “too good”).
- Excluded numerical values.
- Substituted slang terms with their standard English equivalents using the referenced no slang dictionary.
- Unpacked contractions (such as converting “oughtn’t” to “ought not” and “there’s” to “there is”).
- Transitioned all words to lowercase.
- Eliminated stop words like "the", "an", "will", and so on.
- Employed the WordNet Lemmatizer, a feature of Python's nltk module that leverages WordNet, to return words to their base form. For instance, “better” was changed to “good” and “regretted” to “regret”.
- Omitted duplicate entries. Following these preprocessing steps, the review count condensed to 110337.
- Positive: 63912
- Negative: 15764
- Neutral: 3077 In order to balance the data set, we need to have a equal number of positive and negative reviews, after removing the NaN values after cleaned_text, we have
- Positive: 15764
- Negative: 15764
I used 5 models which are popular in the NLP field, which are Logistic Regression,Random Forest,Gaussian NB, MultinomialNB and SGDClassifier.
All five classifiers surpassed the chance baseline of 50%
. Random Forest stood out with an impressive F1 score of 87.19%
, closely followed by SGDClassifier, Logistic Regression, and MultinomialNB with 86.47%
, 86.04%
, and 86.07%
respectively. In addition, RF achieved a high precision and recall of 88.4%
and 85%
respectively for negative reviews with 85.6%
and 88.9%
for positive reviews . Thus, the RF classifier was able to correctly predict the sentiment.
Next, we applied the best performing ML classifier (i.e.,RF) to classify the 27584 reviews that were not labelled. Based on the prediction results, 19898 reviews were classified as positive, while 7686 reviews were classified as negative.