AitoDotAI / sarcasm-on-reddit Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Information and exercises to explore Sarcasm on Reddit dataset in Kaggle.

0 stars 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
answers		answers
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
schema.json		schema.json
train-balanced-sarcasm-small.csv		train-balanced-sarcasm-small.csv
transform.js		transform.js
upload.sh		upload.sh

Repository files navigation

NOTE: This example repository is not actively maintained. For later examples navigate to Aito Docs.

Sarcasm on Reddit

This repository contains information and exercises to explore the Sarcasm on Reddit dataset.

Key information

The read-only API key is 9Ik1wJQ1tq86vMQG7taDB2cgfpSogUFu69lBGTnV
The environment is served at https://aito-reddit-sarcasm.api.aito.ai
Aito documentation and API docs are good resources
You can use our hosted Swagger UI to do queries from browser
The data preparation / upload is explained in Initial Aito setup chapter

Curl reference

Here's a curl command to list rows in comments table (default limit is 10 results).

curl -X POST \
  https://aito-reddit-sarcasm.api.aito.ai/api/v1/_search \
  -H "content-type: application/json" \
  -H "x-api-key: 9Ik1wJQ1tq86vMQG7taDB2cgfpSogUFu69lBGTnV" \
  -d '
  {
    "from": "comments"
  }
  '

Exploration exercises

Contains exercises you can try to explore the Sarcasm on Reddit dataset. Note that we've reduced the data to 10k comments (original is 1.3M) in the publicly shared instance.

1. Search for comments which are labeled sarcastic

Hint: API documentation. This helps to understand the labeling distribution.

The repsonse you should see

{
  "offset" : 0,
  "total" : 5000,
  "hits" : [ {
    "author" : "Trumpbart",
    "comment" : "NC and NH.",
    "comment_2grams" : "nc-and and-nh.",
    "comment_has_upper_case_word" : true,
    "comment_whitespace" : "NC and NH.",
    "date" : "2016-10",
    "downs" : -1,
    "label" : 0,
    "parent_comment" : "Yeah, I get that argument. At this point, I'd prefer is she lived in NC as well.",
    "score" : 2,
    "subreddit" : "politics",
    "ups" : -1
  },
  ...

2. Search for comments which have the word "cool" in them

Hint: Text operators

The repsonse you should see

{
  "offset" : 0,
  "total" : 33,
  "hits" : [ {
    "author" : "Lolwhatisfire",
    "comment" : "Zip lines are cool, but I'm more interested in what appears to be a swimming pool underneath her that is larger than my hometown.",
    "comment_2grams" : "zip-lines lines-are are-cool, cool,-but but-i'm i'm-more more-interested interested-in in-what what-appears appears-to to-be be-a a-swimming swimming-pool pool-underneath underneath-her her-that that-is is-larger larger-than than-my my-hometown.",
    "comment_has_upper_case_word" : false,
    "comment_whitespace" : "Zip lines are cool, but I'm more interested in what appears to be a swimming pool underneath her that is larger than my hometown.",
    "date" : "2016-11",
    "downs" : -1,
    "label" : 0,
    "parent_comment" : "Dubai Zipline",
    "score" : 15,
    "subreddit" : "gifs",
    "ups" : -1
  },
  ...

3. Search for the most upvoted sarcastic comment in "AskReddit" subreddit

No hint.

The repsonse you should see

{
  "offset" : 0,
  "total" : 262,
  "hits" : [ {
    "$sort" : 90,
    "author" : "BlackIronSpectre",
    "comment" : "You're a bloody traitor you kilt sniffing cunt!",
    "comment_2grams" : "you're-a a-bloody bloody-traitor traitor-you you-kilt kilt-sniffing sniffing-cunt!",
    "comment_has_upper_case_word" : false,
    "comment_whitespace" : "You're a bloody traitor you kilt sniffing cunt!",
    "date" : "2016-09",
    "downs" : 0,
    "label" : 1,
    "parent_comment" : "English here.. bloody love Irn Bru. But then again.. I love Scotland and everything about it. I think I should have been born Scottish.",
    "score" : 90,
    "subreddit" : "AskReddit",
    "ups" : 90
  } ]
}

4. Predict if "wow you are smart" comment is sarcastic or not

No hint.

The repsonse you should see

{
  "offset" : 0,
  "total" : 2,
  "hits" : [ {
    "$p" : 0.9309766190277114,
    "field" : "label",
    "feature" : 1
  }, {
    "$p" : 0.06902338097228855,
    "field" : "label",
    "feature" : 0
  } ]
}

The probability of the text being sarcastic is 93.1% based on Aito's prediction.

5. Explain the results of the last prediction

Hint: select $why to get statistical information about predictions.

The repsonse you should see

{
  "offset" : 0,
  "total" : 2,
  "hits" : [ {
    "$why" : {
      "type" : "product",
      "factors" : [ {
        "type" : "baseP",
        "value" : 0.5
      }, {
        "type" : "normalizer",
        "name" : "exclusiveness",
        "value" : 1.0
      }, {
        "type" : "relatedVariableLift",
        "variable" : "comment:wow",
        "value" : 1.6189729371414923
      }, {
        "type" : "relatedVariableLift",
        "variable" : "comment:smart",
        "value" : 1.520886821001524
      } ]
    }
  },
  ...

Comments which have the word "wow" in them are 1.6x more likely to be sarcastic than an average comment.

6. Evaluate how accurately Aito could predict if a comment is sarcastic based on just the comment

Hint: the goal is to use 90% of the data in Aito for training and 10% for testing the accuracy. The 10% of data will be tested as if Aito didn't know if the comments are sarcastic or not. See Evaluate in API docs.

The repsonse you should see

{
  "mxe" : 0.9271739795298308,
  "baseAccuracy" : 0.5,
  "meanUs" : 11184.48125,
  "accuracyGain" : 0.12,
  "n" : 1000,
  "rankGain" : 0.12,
  "warmingMs" : 0.0,
  "features" : 119133.0,
  "accuracy" : 0.62,
  "trainSamples" : 9000.0,
  "geomMeanP" : 0.5258874672093419,
  "baseGmp" : 0.5,
  "meanMs" : 11.18448125,
  "error" : 0.38,
  "baseError" : 0.5,
  "testSamples" : 1000,
  "geomMeanLift" : 1.0517749344186837,
  "meanRank" : 0.38,
  "meanNs" : 1.118448125E7,
  "h" : 1.0,
  "informationGain" : 0.07282602047016917,
  "baseMeanRank" : 0.5
}

Here the correct field is accuracy.

7. Explain what features make comments sarcastic

Hint: see Relate query in API docs.

The repsonse you should see

{
  "offset" : 0,
  "total" : 99,
  "hits" : [ {

    ...

    {
      "related" : "label:1",
      "lift" : 1.5481235278381007,
      "condition" : "comment:yeah",
      "fs" : {
        "f" : 5000,
        "fOnCondition" : 275,
        "fOnNotCondition" : 4725,
        "fCondition" : 354,
        "n" : 10000
      },
      "ps" : {
        "p" : 0.5,
        "pOnCondition" : 0.7740617639190503,
        "pOnNotCondition" : 0.4899421867374226,
        "pCondition" : 0.035399930417845726
      },
      "info" : {
        "h" : 1.0,
        "mi" : 0.22913674121873823,
        "miTrue" : 0.4880618812947286,
        "miFalse" : -0.2589251400759904
      },
      "relation" : {
        "n" : 10000,
        "varFs" : [ 354, 5000 ],
        "stateFs" : [ 4921, 79, 4725, 275 ],
        "mi" : 0.008392996906335543
      }
    },
    ...

Results are sorted by default on how strong the corrrelation is, most correlating ones being first. Comments which have the word "yeah" in them are 1.5x more likely to be sarcastic than an average comment.

Initial Aito setup

This can be used as a guide to upload this same dataset into your own environment.

Full speed

Warning: this deletes all data in the aito environment.

Set API_KEY environment variable
Change AITO_ENV in upload.sh
npm install for transform.js
Run bash upload.sh

Steps explained

Download the Sarcasm on Reddit dataset.
Unzip the package. We'll be using train-balanced-sarcasm.csv file.
Cut the data size

The balanced data set contains 50% sarcastic and 50% not sarcastic comments. We want to maintain this balance.

Doing a simple head -n 10001 train-balanced-sarcasm.csv > train-balanced-sarcasm-small.csv results into a wrong balance: 3710 sacrastic comments out of 10 000.

The csv fortunately has the label information as the first character on each line, so we can split it based on that:
```
head -n 1 train-balanced-sarcasm.csv > train-balanced-sarcasm-small.csv

grep -i "^0" train-balanced-sarcasm.csv | head -n 5000 >> train-balanced-sarcasm-small.csv
grep -i "^1" train-balanced-sarcasm.csv | head -n 5000 >> train-balanced-sarcasm-small.csv
```
With csvtojson, run csvtojson train-balanced-sarcasm-small.csv > comments.json

You could auto-convert types with csvtojson --checkType=true train-balanced-sarcasm-small.csv > comments.json, but in this case it didn't work properly. Textual comments which are numbers will be also converted and then they don't comply with the schema.
With jq, convert numbers to correct types and also JSON to NDJSON
```
jq -c '.[] | . + {label: (.label)|tonumber, score: (.score)|tonumber, ups: (.ups)|tonumber, downs: (.downs)|tonumber}' comments.json > comments.ndjson
```
Breaking it down:
- -c compressed (remove extra spaces)
- .[] Iterates the JSON array and passes each object through the pipe (this will in the end cause json->ndjson conversion)
- | is similar to a shell piping
- . + . refers to the individual object and with + we extend it
- {label: (.label)|tonumber, score: (.score)|tonumber, ups: (.ups)|tonumber, downs: (.downs)|tonumber}' the new object that will override the string values with numerical ones
Enrich the data with transform.js
- Add 2-grams of the comment
- Duplicate comment to a comment_whitespace so we can try Whitespace analyzer instead of English. Whitespace analyzer preserves for example capital letters.
- Add comment_has_upper_case_word boolean as an example

Create the schema to the environment

Add export API_KEY=READ_WRITE_KEY to .env file.

source .env

curl -X DELETE \
  https://aito-reddit-sarcasm.api.aito.ai/api/v1/schema \
  -H "x-api-key: $API_KEY"

curl -X PUT \
  https://aito-reddit-sarcasm.api.aito.ai/api/v1/schema \
  -H "content-type: application/json" \
  -H "x-api-key: $API_KEY" \
  -d@schema.json

With upload-file.sh we're uploading comments.json to the environment

Make sure API_KEY environment variable is set, upload-file.sh uses that.
```
bash upload-file.sh comments.ndjson https://aito-reddit-sarcasm.api.aito.ai
```
Now the data should be uploaded!

About

Information and exercises to explore Sarcasm on Reddit dataset in Kaggle.

Custom properties

Report repository

Releases

No releases published

Packages

No packages published

Contributors 3

Languages