Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross product? #40

Open
bensteinberg opened this issue Nov 27, 2022 · 6 comments
Open

Cross product? #40

bensteinberg opened this issue Nov 27, 2022 · 6 comments

Comments

@bensteinberg
Copy link

This is a late entry from the half-bakery, a rough idea about squashing multiple texts together. It is I think similar to but not the same as an idea @rebeccacremona has mentioned. I have in mind some pseudo-mathematical ideas, along with phrases like "cross product" and "convolution", though I doubt this will be any of those.

@bensteinberg
Copy link
Author

The repo cross-product includes code and an example text of some 60,539 words, made by combining Moby Dick, The History of Tom Jones, a Foundling, and Middlemarch.

An early passage reads,

“In her sure south, could would Spermacetti thinking honor.” did an was any vice; the till she of of much with sea obstinacy had covered to at.
“In obedience never element force him swam, will, girl dived, may he and an chace, pains, marriage Fishes when for colour, a bookworm kind; relation, except, friend, a paint, we of Had grumbling for from with of dislike.

The mechanism produces a text of the same number of sentences as the least of the inputs, and each sentence, of length equal to the shortest at that position among all the inputs, takes words alternately from each.

@bensteinberg
Copy link
Author

(I imagine this or something like it has been done before.)

@bensteinberg
Copy link
Author

The code takes local text files as inputs. It might be nice to retrieve texts from Project Gutenberg over the network, which would be a chance to get familiar with PG's machine-readable metadata.

@bensteinberg
Copy link
Author

This change allows the use of Project Gutenberg text numbers as inputs, caching metadata and text files. The program is now somewhat more error-prone. There is no cache invalidation.

I went down the wrong path at first, beating my head against XPath and lxml until I realized that the catalog file hadn't been updated since 2014. The current catalog, a CSV file, is much easier to deal with (though I'm not using it at the moment), but the head-beating was useful, as I still had to handle the individual works' RDF files.

@bensteinberg
Copy link
Author

Almost any result is fun:

$ poetry run cross 5678 9987 9101 | head
Let they pie why fairy.
What is.

"Vows!" COURTSHIP a nurse, for.
"Is these 100 we.
To-day the the all that the there Eochaid them one stronghold to you in table-spoonful but the and but Fremain of "Never," moreover them.
"Worse a another returned.
To-day the drain be that and as of in you, to water, it that bear another not.
What Eochaid Oysters going the own.
That at quart young, of oysters.

@hugovk hugovk added the preview label Nov 28, 2022
@bensteinberg
Copy link
Author

Another sample output, of about 65,526 words, was produced by squashing War and Peace, Crime and Punishment, and Anna Karenina:

poetry run cross 2600 2554 1399 > war-crime-karenina.txt

from which

You I.
They that’s you.
They going jumped the going.
How.
In not said.
In have am, years, be don’t be.

I've also added some input validation; I think I'll call this done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants