Cross product? #40

bensteinberg · 2022-11-27T01:24:38Z

This is a late entry from the half-bakery, a rough idea about squashing multiple texts together. It is I think similar to but not the same as an idea @rebeccacremona has mentioned. I have in mind some pseudo-mathematical ideas, along with phrases like "cross product" and "convolution", though I doubt this will be any of those.

bensteinberg · 2022-11-27T02:12:34Z

The repo cross-product includes code and an example text of some 60,539 words, made by combining Moby Dick, The History of Tom Jones, a Foundling, and Middlemarch.

An early passage reads,

“In her sure south, could would Spermacetti thinking honor.” did an was any vice; the till she of of much with sea obstinacy had covered to at.
“In obedience never element force him swam, will, girl dived, may he and an chace, pains, marriage Fishes when for colour, a bookworm kind; relation, except, friend, a paint, we of Had grumbling for from with of dislike.

The mechanism produces a text of the same number of sentences as the least of the inputs, and each sentence, of length equal to the shortest at that position among all the inputs, takes words alternately from each.

bensteinberg · 2022-11-27T02:19:58Z

(I imagine this or something like it has been done before.)

bensteinberg · 2022-11-27T02:28:13Z

The code takes local text files as inputs. It might be nice to retrieve texts from Project Gutenberg over the network, which would be a chance to get familiar with PG's machine-readable metadata.

bensteinberg · 2022-11-28T04:24:42Z

This change allows the use of Project Gutenberg text numbers as inputs, caching metadata and text files. The program is now somewhat more error-prone. There is no cache invalidation.

I went down the wrong path at first, beating my head against XPath and lxml until I realized that the catalog file hadn't been updated since 2014. The current catalog, a CSV file, is much easier to deal with (though I'm not using it at the moment), but the head-beating was useful, as I still had to handle the individual works' RDF files.

bensteinberg · 2022-11-28T04:30:49Z

Almost any result is fun:

$ poetry run cross 5678 9987 9101 | head
Let they pie why fairy.
What is.

"Vows!" COURTSHIP a nurse, for.
"Is these 100 we.
To-day the the all that the there Eochaid them one stronghold to you in table-spoonful but the and but Fremain of "Never," moreover them.
"Worse a another returned.
To-day the drain be that and as of in you, to water, it that bear another not.
What Eochaid Oysters going the own.
That at quart young, of oysters.

bensteinberg · 2022-11-29T02:50:48Z

Another sample output, of about 65,526 words, was produced by squashing War and Peace, Crime and Punishment, and Anna Karenina:

poetry run cross 2600 2554 1399 > war-crime-karenina.txt

from which

You I.
They that’s you.
They going jumped the going.
How.
In not said.
In have am, years, be don’t be.

I've also added some input validation; I think I'll call this done.

hugovk added the preview label Nov 28, 2022

hugovk added the completed label Nov 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross product? #40

Cross product? #40

bensteinberg commented Nov 27, 2022

bensteinberg commented Nov 27, 2022

bensteinberg commented Nov 27, 2022

bensteinberg commented Nov 27, 2022

bensteinberg commented Nov 28, 2022

bensteinberg commented Nov 28, 2022

bensteinberg commented Nov 29, 2022

Cross product? #40

Cross product? #40

Comments

bensteinberg commented Nov 27, 2022

bensteinberg commented Nov 27, 2022

bensteinberg commented Nov 27, 2022

bensteinberg commented Nov 27, 2022

bensteinberg commented Nov 28, 2022

bensteinberg commented Nov 28, 2022

bensteinberg commented Nov 29, 2022