Skip to content

er1czz/beyondwords

Repository files navigation

Beyond Words - predicting user decision with text data

Executive Summary

  • Software as a service (SaaS) is a major sector of cloud computing business. To thrive in this competitive market, growing user base is a crucial drive of business. Predicting and understanding customer decision are imperative to help a company timely adapt its service to meet users’ needs.
  • Performing sentiment and text analysis on user communication data can be an effective approach to reflect user experience or satisfaction level. An algorithmic approach based on the user text data was carried out to predict when a user is about to subscribe or unsubscribe. The client of this consulting project is a startup company specializing in platforms designed for content creators to create their mobile apps. The data are generated from user communication in-app.
  • 60 features were extracted from the text data marked with different time periods, including sentiment, number of characters, number of words, etc. Machine learning models such as Random Forest and XGBoost were trained by data with these features to predict user decision. Specifically, this model can 1) forecast users at high risk of churning 4 weeks in advance with 0.87 AUC and 2) esimate user lifecycle which was corroborated by additional time-series analysis. My client can have valuable time to take actions (e.g. sending out targeted surveys and in-app perks). And this model can evaluate the performance of these strategies. Therefore, this machine learning approach can help my client to grown their premium users base through prediction and evaluation.

Key Procedures

  1. Preprocessing text data for machine to read

    • Converte emoji and emoticon by emoji and emot packages, respectively.
    • Note: although emot can also process emoji, its emoji database is not as comprehensive as emoji.
  2. Choosing the right natural language processing (NLP)models

    • Test unsupervised NLP: TextBlob and VADER
    • Test supervised NLP: off-the-shelf pretrained BERT (state-of-the-art)
    • Highly skewed data: user text contents were overwhelmingly positive and supportive, unsuitable for existing unsupervised models or off-the-shelf supervised models.
  3. Tuning BERT model with proper labelling

    • Create two type of labels for each text: Tone (positive/neutral/negative) and Content (rich/partial/none)
    • Fine-tune two BERT models through ktrain for each label class separately
    • Achieved accuracy score 0.85 and 0.78 for Tone and Content, respectively
    • Note: another approach is to merge two label classes into one (2x3) to train one model (less costly but weakned prediction: accuracy score 0.67 due to data imbalance)
  4. Predicting user churn and bounce

    • Only use text data generated before user decisions
    • Extract text features, including number of word, character, and text of differnt time periods for each user
    • Combine text features and sentiment features (60 features)
    • Applied classificiation models and a stacking ensemble (combined KNN, RF and XGB by Logistic Regression)
    • Achieved 0.89 and 0.76 accuracy for churn and bounce, respectively.
  5. Takeaways

    • Strong correlation between text and sentiment features
      • text meta features are good enough to predict user decision (easy to scale up for big data)
    • User engagement level is a key indicator of user decision
      • model can predict user churn 4 weeks before user decision
      • premium users have a lifetime 3-4 months
    • With more data
      • real-time prediction and evaluation by sliding window approach

Presentation: YouTube and slides

Examples

Click to show to an example of Emoji and Emoticon Conversion

Click to show the Sanity Check of sentiment analysis by different NLP models

NLP Models Performance Comparision, OTS: off-the-shelf

Last update 2020/11/05

Created 2020/10/08

Current Page
Return to My GitHub

>>>>>> CC BY 4.0 <<<<<<

About

A technical summary of my consulting project "Beyond Words" at Insight Program 20C.DS.SV

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages