Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pride, Prejudice by @hugovk #130

Open
hugovk opened this issue Nov 30, 2017 · 13 comments
Open

Pride, Prejudice by @hugovk #130

hugovk opened this issue Nov 30, 2017 · 13 comments
Labels
completed For completed novels! preview There is an excerpt somewhere in the thread!

Comments

@hugovk
Copy link
Member

hugovk commented Nov 30, 2017

Pride, Prejudice

Generated output

What it does

The problem isn't generating over 50,000 words. The problem is existing books are too long. Pride and Prejudice is 130,000 words, Moby Dick is 215,136 words (or 215,136 meows). And we all know 50,000 is the gold standard for a novel! So how can we reduce the word count?

These tactics reduce Pride and Prejudice by about 15% to 111,000 words.

Next we work out the ratio of words we have to 50k, count how many sentences we have, and work out how many sentences we want to approach 50k and use a text summariser to chop out the dead wood.

How to do it

Run:

pip install -r requirements.txt

python reducifier.py

Example:

python reducifier.py
open
word count: 130,000
word count: 126,936	diff: 97.643%	deboilerplatify
word count: 125,438	diff: 96.491%	remove_quote_things
word count: 121,549	diff: 93.499%	deveryify
word count: 121,018	diff: 93.091%	decontractify
word count: 111,633	diff: 85.872%	dehonorify
Ratio (words/50k):	 3
Number of sentences:	 4588
Number to keep:		 1529
word count: 54,273	diff: 41.748%	summarise

This produces output.txt before the summariser, and output2.txt after the summariser.

Works at least with macOS High Sierra with Python 3.6.3.

Example

Here's a diff of Pride and Prejudice and the first pass output.txt:

'tis a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.

Source code

https://github.com/hugovk/NaNoGenMo-2017/tree/master/03-reducifier

@hugovk hugovk added completed For completed novels! preview There is an excerpt somewhere in the thread! labels Nov 30, 2017
@hugovk hugovk mentioned this issue Nov 30, 2017
8 tasks
@janelleshane
Copy link

Ha, this is great! 60% reduced Pride and Prejudice is still totally readable.

Too bad the summarizer took out all the damns.

@henrikh
Copy link

henrikh commented Dec 4, 2017

"Remove honorifics (Mr., Mrs., Miss, Dr.)" 😱 How can I then tell the "Bennet"s apart?!

@alexyuriev
Copy link

@janelleshane Cliff-notes are also readable.

@sandes
Copy link

sandes commented Dec 4, 2017

Great

@danesparza
Copy link

danesparza commented Dec 4, 2017

@henrikh Agreed -- lines like this become ... odd.

😬

@hugovk
Copy link
Member Author

hugovk commented Dec 4, 2017

@henrikh @danesparza Yep, I did realise that but unfortunately they just had to go to reduce the word count :) I should have replaced "Mrs. Bennet" with her maiden name, "Gardiner"!

@bryanrasmussen
Copy link

Sometimes you will see major characters referred to with a shortened version of the name after introduction. I would suggest calling Mrs. Bennet Mrs. B, Mr. Bennet Mr. B. You don't remove honorifics and reduce word count but you reduce character count.

@bryanrasmussen
Copy link

bryanrasmussen commented Dec 5, 2017

Actually considering the patriarchy Mr. B can just be B.

on edit: Ms can be used in place of Mrs. in modern times of course.

@hugovk
Copy link
Member Author

hugovk commented Dec 5, 2017

@bryanrasmussen Word count is all that matters :)

@bryanrasmussen
Copy link

bryanrasmussen commented Dec 5, 2017 via email

@hugovk
Copy link
Member Author

hugovk commented Dec 5, 2017

PS. Using the 't' contraction instead of 'the' makes this really hard to parse.

Only in some cases.

"...by a young man of large fortune from t'north of England;"[1]

This is just about the perfect edit.

[1] https://github.com/hugovk/NaNoGenMo-2017/blob/master/03-reducifier/output.txt#L35

:)

See https://news.ycombinator.com/item?id=15823499 for more discussion.

@philsnow
Copy link

philsnow commented Dec 5, 2017

@henrikh you'd have to make do with context, I suppose, but that's not all that different than the base text because only the eldest daughter is addressed by only her surname ("Miss Bennet") whereas the younger daughters are addressed with either their first or full names ("Miss Elizabeth" / "Miss Elizabeth Bennet"). I haven't read Pride and Prejudice in a while, are there any examples where the reader must discern identity (among Bennets or any other family) from context?

@henrikh
Copy link

henrikh commented Dec 5, 2017

@philsnow As far as I recall, Elizabeth is actually referred to as "Miss Bennet" when adressed directly by Mr Darcy and Mr Wickham -- but, of course, in those situations there would be no doubt 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
completed For completed novels! preview There is an excerpt somewhere in the thread!
Projects
None yet
Development

No branches or pull requests

8 participants