Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pride and Prejudice and Innuendo #146

Open
michelleful opened this issue Dec 1, 2016 · 1 comment
Open

Pride and Prejudice and Innuendo #146

michelleful opened this issue Dec 1, 2016 · 1 comment

Comments

@michelleful
Copy link

michelleful commented Dec 1, 2016

Time having run out for my other grander ideas, I am reduced to (once again) taking Jane Austen's great work and injecting puerile humour into it.

This time I attempted to see if I could find words containing innuendo, generally of the sexual variety, and italicise them in a nudge-wink kind of way. After experimenting with a few ways of obtaining the words (chiefly using sense2vec to find words used in similar context to actual swear words), I settled on searching Urban Dictionary for words whose primary (meaning most upvoted, I think) dictionary entry contained the word 'sex'. In addition, I replaced some perfectly innocent words with grawlixes for giggles.

Complete novel

Sample output:

IT is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a &@$%.

"It will be no use to us if twenty such should come, since you will not %&@# them."

"Depend upon it, my dear, that when there are twenty I will %@#& them all."

"Indeed, Sir, I have not the least intention of &$@*ing. — I entreat you not to suppose that I moved this way in order to beg for a partner."

He was most highly esteemed by Mr. Darcy, a most intimate, confidential friend.

I do not pretend to regret any thing I shall leave in Hertfordshire, except your society, my dearest friend; but we will hope at some future period, to enjoy many returns of the delightful intercourse we have known...

Tools used/lessons learned

  • @#&% is called a grawlix.
  • SpaCy (Python)
    • Chiefly for part-of-speech tagging and (very little) dependency parsing.
    • Its token.text_with_ws function is especially useful for maintaining good spacing.
    • There's still room for a Python library to do intelligent text replacement (e.g. handling a/an, conjugation, plurals, phrasal verbs, etc) though.
  • Urban Dictionary and py-urbandict
    • There are a lot of very common and completely innocuous words (and innocuous definitions) in UD, which I didn't expect.
    • I would have liked to use its word combinations but I wound up just using solitary words.
    • Urban Dictionary really could use part-of-speech information.
  • (earlier versions): sense2vec word embeddings
    • Does word2vec on (word, part-of-speech) combinations.
    • Trained on Reddit comments, which I was hoping would know swear words well.
    • Still very hard to triangulate words with multiple meanings like ball, which wasn't close to dance and a bunch of other likely words I tried. Further word sense disambiguation would still be useful.
  • Identifying words with innuendo is really hard and people are doing actual research on this.
    • Expanding the list of words searched for beyond 'sex' would be a good next step.
    • Maybe training a classifier on urban dictionary entries would work even better, incorporating other information like whether a word is used in sex-related subreddits versus other subreddits...
@lizadaly
Copy link

lizadaly commented Dec 1, 2016

🎉 🍆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants