Skip to content

Latest commit

 

History

History

social-media

Social Media Filtering with Amazon Machine Learning

Amazon Machine Learning can help your company make better use of social media. This example application will automatically analyze Twitter content to identify customer support issues. This application will continuously monitor all tweets that mention your company's Twitter handle, and predict whether or not your company's customer support team should reach out to the poster. By using a machine learning (ML) model as your first tier of support you can lower support costs and increase customer satisfaction. The same application integrates Amazon Machine Learning with Amazon Mechanical Turk, Amazon Kinesis, AWS Lambda, and Amazon Simple Notification Service (Amazon SNS).

The example walks through the following steps:

  1. Gather training data
  2. Label training data with Amazon Mechanical Turk
  3. Create the ML Model
  4. Configure the model
  5. Set up continuous monitoring

Skipping ahead

That this repository includes examples of the output of the first two steps (gathering and labeling training data), so if you're anxious to get going, and see ML working, you can jump ahead to step 3. Just download a sample of labelled training data from the @awscloud account which is on S3 at https://aml-sample-data.s3.amazonaws.com/social-media/aml_training_dataset.csv (Its S3 URL is s3://aml-sample-data/social-media/aml_training_dataset.csv.) Copy that file to your local directory named aml_training_dataset.csv, which is the final output of step 2.

You might be tempted to try using a model trained on our example data for your own application, but we don't recommend it. In ML, the quality of data is the most important thing. So if you use somebody else's data to analyze your customers, it's probably not going to work very well.

Step 0: Setting up your environment

You will need python virtualenv and the npm node.js package manager. On linux machines with apt-get, you can install them with the commands:

sudo apt-get update
sudo apt-get install python-virtualenv python-dev libffi-dev npm

Once those are installed, execute

source setup.sh

The script uses npm and python's virtualenv to setup the required dependencies and environment variables in the current shell.

The following scripts depend on the python boto library. See instructions on how setup credentials for boto in ~/.aws/credentials. See instructions on how to get AWS credentials. The AWS user that you choose, needs access to a subset of the following policy to run the scripts:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "iam:*",
                "machinelearning:*",
                "kinesis:*",
                "lambda:*",
                "s3:*",
                "sns:*"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

Step 1: Gathering training data

To gather the training data, run the following command:

python gather-data.py @awscloud

Substitute your company's twitter handle instead of @awscloud and configure your Twitter API credentials in config.py. Learn how to obtain your credentials here.

This will produce a file called line_separated_tweets_json.txt that other scripts will read later.

Step 2: Label training data with Mechanical Turk

In this application, as in many ML applications, we're trying to build an ML model that mimics the behavior or opinions of humans. Building a good model requires lots of examples of the choices that humans would make. Doing this yourself is always an option, but often too slow or expensive to be practical. In supervised machine learning, these opinions are called the labels, or the target of the model.

Amazon Mechanical Turk (Mturk) is a great way to quickly and economically label large quantities of data. This section will walk through that process.

Step 2a: Prepare tweets for labelling with MTurk

The first step is to take the raw JSON data that we have received from the Twitter API and convert it to a CSV format that Mechanical Turk can use. Do this by running:

python build-mturk-csv.py

This will consume the line_separated_tweets_json.txt file and output a file called mturk_unlabeled_dataset.csv.

Step 2b: Submit the job to MTurk

Use the Mechanical Turk Console to create a set of Human Intelligence Tasks (HITs) to assign labels to these tweets. Turkers will be asked to pick which label best applies to the tweet amongst:

  • Request
  • Question
  • Problem Report
  • Angry
  • None of the above (i.e. non-actionable)

These different categories will be collapsed into a single binary attribute of actionable / non-actionable. But asking for more detail can help the Turkers focus on the content better, and raises the opportunity to create more sophisticated ML Models later with the same data.

For each tweet, we will create 3 HITs so that we can automatically figure out consensus between three different human opinions on the tweet.

Detailed steps for generating training labels using MTurk

  1. Create an account with Mechanical Turk
  2. Start a new project
  3. Select Other from the options and click Create Project
  4. Enter properties on next page. Suggested values (unless you know better):
  • Project Name: Labeling of tweets
  • Title: Categorize the tweet (WARNING: This HIT may contain adult content. Worker discretion is advised.)
  • Description: Categorize the tweet into 1 of 5 categories.
  • Keywords: tweet, tweets, categorization, labeling, sentiment
  • Checkbox for adult content: Select as checked because content may contain offensive tweets. See details
  • Rewards per assignment: Higher values can fetch faster results.
  • Number of assignments per HIT: 3
  • Time allotted per assignment: 2
  • HIT expires in: 7 days
  • Auto-approve and pay Workers in: 2 Hours
  1. On the page for design layout, click the Source button and cut paste contents from mturk-project-template.xml. You may preview and edit as deemed fit. Parameter value ${tweet} and checkbox values should be left unmodified as the later steps depend on them.
  2. Preview and finish. This creates the Project template.
  3. Goto Create New Batch with an Existing Project
  4. Select Publish Batch for the project you just created.
  5. Follow instructions on the following screen. You will be using the csv file produced by build-mturk-csv.py as part of them.
  6. Preview the HITs and submit the batch for labeling. This step will cost you money

Step 2c: Processing the output from MTurk

Once all of your Turk HITs are complete, download the results into a file called mturk_labeled_dataset.csv. Then run the script

python build-aml-training-dataset.py

to convert the 3 HIT responses for each tweet into a single dataset with a binary attribute.

Step 3: Create the ML Model

Once you have your labelled training data in CSV format, creating the ML model requires a few API calls, which are automated in this script:

python create-aml-model.py aml_training_dataset.csv aml_training_dataset.csv.schema s3-bucket-name s3-key-name

This utility creates a machine learning model that performs binary classification. Requires input dataset and corresponding scheme specified through file names in the parameter. This utility splits the dataset into two pieces, 70% of the dataset is used for training and 30% of the dataset is used for evaluation. Once training and evaluation is successful, AUC is printed which indicates the quality of the model -- the closer to 1.0 the better.

Step 4: Configuring the model

Once your model is built, you need to decide how sensitive your model should be. The model summary page provides the configuration options to modify the model's sensitivity. A link to the model summary is printed when you run the tool in the previous step. You can also lookup the model from the Amazon ML web console. Here you can set the score threshold. A lower value means more tweets will be classified as actionable, but there will also be more "false positives" where the model predicts something is actionable that isn't. And vice versa.

Step 5: Set up continuous monitoring

Continuous monitoring requires following parts:

  1. Receiver of tweets from Twitter streaming api.
  2. Kinesis stream to which the previous receiver pushes the tweets.
  3. Lambda function that process records from Kinesis stream.
  4. Realtime machine learning endpoint which is called by the Lambda function to make predictions on the incoming tweets.
  5. SNS Topic to which the Lambda function pushes notifications in case a tweet requires response from the customer service.

NOTE: Components being setup in this step have ONGOING costs associated with them. Please check respective pricing schemes for details.

Step 5a: Setting up Kinesis/Lambda/Machine Learning realtime endpoint/SNS

Use the following script to automate the creation of Kinesis Stream, Lambda function, the machine learning realtime endpoint, and the SNS Topic.

python create-lambda-function.py

This script requires that config.py is present and contains appropriate values. Description of the configuration required in config.py is as follows:

  • awsAccountId : The AWS Account Id corresponding to the credentials being used with boto. See docs for details.
  • kinesisStream : The name being given to the Kinesis stream. See docs for constraints.
  • lambdaFunctionName : The name being given to the Lambda function. See docs for constraints.
  • lambdaExecutionRole : The name being given to the execution role used by the lambda function. See docs for details. See docs for constraints.
  • mlModelId : The name of the machine learning model id which is used to perform predictions on the tweets. This is the id of the model that is generated as part of Step 3.
  • region : AWS region used for each of the service. See docs for details.
  • snsTopic : The name of the topic that is created with Simple Notification Service. See docs for constraints.

Step 5b: Testing the continuous monitoring setup

After the successful execution of this script the Kinesis stream is ready to accept tweets data. Use the following script to test that the setup works.

python push-json-to-kinesis.py line_separated_json.txt kinesisStreamName interval

The following parameters are needed to run this script

  • line_separated_json.txt : File that contains line separated json data.
  • kinesisStreamName : Name of the stream to which the data is pushed to.
  • interval : Interval in milli-seconds between two calls to kinesis stream.

This script merely pushes json data to the given Kinesis stream. As at this step, we have the file from previous steps that contains line separated tweets json data, we reuse it for testing.

Step 5c: Pushing tweets into Kinesis using Twitter's Streaming APIs

This project includes a sample app to push into Kinesis tweets that match a simple filter using Twitter's public stream API. For a production system, you can work with GNIP to consume streams. Refer to their blog post on the subject, or their open source code on github.

You'll need a twitter library that supports streaming:

pip install twitter

Modify config.py to add a kinesis partition name, the twitter text filter you'd like to search for, and your twitter credentials if you haven't already done so. Then simply call the sample scanner.

python scanner.py

Tweets that match your filter will be processed in real time and pushed to the kinesis stream. The lambda function will use the ML model to classify these tweets and publish a notification to the configured SNS topic with a link to any tweet that is considered actionable. The easiest way to get these notifications is to subscribe your email address to the SNS topic.