Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1001 plots #112

Open
WhiteFangs opened this issue Nov 22, 2017 · 5 comments
Open

1001 plots #112

WhiteFangs opened this issue Nov 22, 2017 · 5 comments
Labels
completed For completed novels!

Comments

@WhiteFangs
Copy link

My idea for this year is to generate 1001 plots, each around 50 words, and their titles using the WikiPlots dataset and simple Markov chains.

I didn't think I would find time in November to join this year's edition but I found one available evening and started this. My handicap is that I planned to do this in only a few hours and using PHP (for a lot of -not very good- reasons).

Anyway, I started a few hours ago and I struggled to get the statistical model for my Markov chains generator from a 220Mo text file containing all the plots but I found a way (by cutting it into smaller files basically). But now I'm stuck with a >200Mo PHP array that I will try to use to generate the small plots. Let's hope it will work, pray for my RAM.

I plan to release the array generation code as well as the text generation code (but not the full data because it's a bit heavy and can be rebuild using the dataset).

@WhiteFangs
Copy link
Author

WhiteFangs commented Nov 23, 2017

I ended up using a lighter version of my PHP array, it was still more than 70Mo but reasonnably usable.
Here's the code : https://github.com/WhiteFangs/1001plots
The resulting text is in the 1001-plots.html file, also on my website: http://louphole.com/divers/1001-plots.html

I was hoping to get more readable plots but I fear the markov chains were not sufficient for this time. Anyway, I plan to update the ReadMe later and maybe try to generate another sample with the full array although I doubt the results will be much better.

@hugovk hugovk added the completed For completed novels! label Nov 23, 2017
@WhiteFangs
Copy link
Author

I have an (easy) idea to (maybe) improve my model without making it heavier: I'll let it run through a subpart of the corpus to train for words (as I did for my light array), and then run it through the rest of the corpus without adding the new words encountered (that's what makes the model much heavier after each pass), it'll only increment occurrences of already known words, thus improving the statistical model without making it bigger. I hope to have more human readable results thanks to this.
I'll keep the thread updated.

@WhiteFangs
Copy link
Author

WhiteFangs commented Nov 23, 2017

So I ended up using an even lighter version of the model with 3000 plots for the words and the rest for improvements of the model. The result seems better, still not very readable but sometimes funny. It kind of looks like plots being told by a child who has no proper grammar but good enough vocabulary. Think of it this way and it can actually make some sense =)

I also changed the length of the plots, between 50 and 250 words for each. The result is here: http://louphole.com/divers/1001-plots.html

@greg-kennedy
Copy link

Not sure how much you've done with Markov chains before, but ... grammar quality is basically controlled by the (word) length of each phrase in your lookup hash. This is called "order" in technical terms, at least according to Wikipedia.

I looked over your code and it seems like your table is "word1" -> pick_random_of("word2","word3","word4"), which is essentially just Order == 1.

To get better results, your seed phrase "word1" should be a two- or three-word phrase, so the followup word makes more sense in context. That way instead of picking the next word based on the word before it, pick the next word based on the previous 2 or 3 words.

"word1 word2" -> array("word3", "word5"),
"word2 word3" -> array("word4"),
etc

If you're familiar with Perl at all, maybe give this a lookover - I did a Markov perl module for an entry a couple years ago, you can steal ideas from it.

https://github.com/greg-kennedy/MarkovChain

@WhiteFangs
Copy link
Author

I was aware of the order parameter for Markov Chains but the corpus of WikiPlots is composed of many (many) proper nouns and I feared they would bias the model into copying existing sentences. Also I didn't have the time to test it to see if the results would be better.

I also did the same as you but in PHP, that's sort of the code I used for my model and text generation : https://github.com/WhiteFangs/WordBasedMarkov

Thanks for your advice though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
completed For completed novels!
Projects
None yet
Development

No branches or pull requests

3 participants