Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whalequest #32

Open
mkremins opened this issue Nov 28, 2023 · 4 comments
Open

Whalequest #32

mkremins opened this issue Nov 28, 2023 · 4 comments

Comments

@mkremins
Copy link

I didn't intend to submit anything to NaNoGenMo this year. But I've been helping out a bit with @ikarth's hlockey commentary and playing around a lot with sentence embeddings and an idea popped into my head that I couldn't help but realize. So here it is.

Whalequest sorts the sentences in Moby-Dick by their semantic similarity to the word "whale", with the least whaley sentences at the beginning and the most whaley sentences at the end. The novel is thus reordered into a quest through semantic space for the whale, with the reader progressing closer and closer to the ultimate object of their quest as they read.

Samples

The beginning of the novel ends up as a pile of mostly stage directions:

(ENTER AHAB: THEN, ALL) BOOK I. (FOLIO) CHAPTER IV. (SULKY AND SLEEPY) (AHAB GOES; PIP STEPS ONE STEP FORWARD.) (ASIDE) (QUIETLY SMOKING)

(DANCING) AHAB (ADVANCING) (DURING THE ENSUING SCENE, THE CARPENTER CONTINUES SNEEZING AT TIMES)

(OCTAVO), CHAPTER IV. BOOK I. (FOLIO), CHAPTER II. (OCTAVO), CHAPTER II. BOOK III. (DUODECIMO), CHAPTER 1.

BOOK I. (FOLIO), CHAPTER III.

(OCTAVO), CHAPTER III. BOOK III. (DUODECIMO), CHAPTER II. BOOK III. (DUODECIMO), CHAPTER III. PIP (SHRINKING UNDER THE WINDLASS). (SINGS, AND ALL FOLLOW) CARPENTER (RESUMING HIS WORK).

BOOK I. (FOLIO), CHAPTER VI. (THEY SCATTER.) DAGGOO (GRIMLY). DAGGOO (SPRINGING). (RECLINING.)

As we progress further, we continue to encounter little clusters of strongly similar sentences, like this catalog of sailors:

NANTUCKET SAILOR. MALTESE SAILOR. SICILIAN SAILOR. AZORE SAILOR. AZORE SAILOR. LASCAR SAILOR. MALTESE SAILOR. SICILIAN SAILOR. TAHITAN SAILOR.

Or this bit, where the narrator becomes very surprised and takes a moment to gather their wits:

By heavens! "Think, think of the fevers, yellow and bilious!" In the present instance, too, this dignity is heightened by the pepper and salt colour of his head at the summit, giving token of advanced age and large experience. Nothing will content them but the extremest limit of the land; loitering under the shady lee of yonder warehouses will not suffice. Upon which Ahab retired, and Stubb returned to the Frenchman. These lines run whole, and whirling out: come in broken, and dragging slow. "Well done, old Fleece!" cried Stubb, "that's Christianity; go on." "They are stowaways, Mr. Flask." Oh! Oh!

Oh!

Oh! Oh! Oh! Oh! Oh! Oh! Oh! Oh! Oh! Oh! Oh!

Oh! Oh! Oh!

Oh!

Oh! Oh! Oh! Stand by to come close alongside, now! "Look ye, blacksmith, these are the gathered nail-stubbs of the steel shoes of racing horses." Improving his mind, poor fellow!

Certain juxtapositions linger in the mind:

This world pays dividends. No, no; only black water!

Haul in the chains! "But what are you holding YOURS for?"

It doesn't take long at all before things start to become a little whaley:

To the finny people's king. Come down, all of ye, but the regular look outs! And though this also holds true of merchant vessels, yet not by any means to the same extent as with whalemen. In his fiery eyes of scorn and triumph, you then saw Ahab in all his fatal pride. I take it that this part of the bill must have run something like this: "GRAND CONTESTED ELECTION FOR THE PRESIDENCY OF THE UNITED STATES. "Twill soon be out." But will any whaleman believe these stories? Yet is it no easy task. murmured Ahab, as Starbuck disappeared. The sound came nearer and nearer; the thick mists were dimly parted by a huge, vague form. As it seemed to me at the time, such a grand embodiment of adoration of the gods was never beheld, even in Persia, the home of the fire worshippers. my dear fellow beings, why should we longer cherish any social acerbities, or know the slightest ill-humor or envy! Pull, then, do pull; never mind the brimstone--devils are good fellows enough. ye but strike a thing without a lid; and no coffin and no hearse can be mine:--and where go ye now? The Hindoo whale referred to, occurs in a separate department of the wall, depicting the incarnation of Vishnu in the form of leviathan, learnedly known as the Matse Avatar. Here then, from three impartial witnesses, I had a deliberate statement of the entire case. "Won't the Duke be content with a quarter or a half?"

But the whaleyness remains substantially limited (albeit gradually increasing) for most of the text. Just the final 20% of the novel contains fully half of the novel's 1672 total instances of the substring "whale":

"The mighty whales which swim in a sea of water, and have a sea of oil swimming in them." (HUMP-BACK).--This whale is often seen on the northern American coast. "did you ever strike a fish?" It was the whiteness of the whale that above all things appalled me. But that pipe, poor whale, was thy last. Now, art thou the man to pitch a harpoon down a live whale's throat, and then jump after it? It seemed not a whale; and yet is this Moby Dick? thought Daggoo. This whale averages some sixteen or eighteen feet in length. Now, in the whale ship, it is not every one that goes in the boats. "All ye mast-headers have before now heard me give orders about a white whale." It is used for catching turns with the whale line.

Here have I heard all their chat just now, and the white whale--shirr! How is it, then, with the whale?

I won't spoil the ending, but as the reader approaches their goal things start to become very whaley indeed, and I was delighted by the shift in the texture of the prose during the final section of the book.

Implementation details

Whalequest's implementation is kinda ridiculously simple. At a high level, we split the text up into chapters using a simple regex, split each chapter into sentences according to the default behavior of the Python spaCy library, and then use spaCy's built-in semantic similarity features for the actual sorting of sentences by their similarity to the goal. A bit of extra work is done to fix the most glaring errors with mismatched quote characters, and to split the text up randomly into paragraphs for readability. Then we write out the processed text to a new file and we're done!

The whole thing is about 35 lines of Python:

import random
import re
import spacy

def fix_whitespace(text):
  return re.sub(r"\s+", " ", text)

def fix_quotes(text):
  if text[0] == '"' and text[-1] != '"':
    text = text + '"'
  if text[-1] == '"' and text[0] != '"':
    text = '"' + text
  return text

nlp = spacy.load("en_core_web_lg")
goal = nlp("whale")
all_sents = []

with open("./mobydick.txt", "r") as infile:
  fulltext = infile.read()
  chaps = fulltext.split("\n\n\n\n")
  for chap in chaps:
    chaptext = fix_whitespace(re.sub(r"^CHAPTER[^\.]+\.", "", chap, 1).strip())
    doc = nlp(chaptext)
    for sent in doc.sents:
      all_sents.append(sent)

ranked = sorted(all_sents, key=lambda sent: sent.similarity(goal))

with open("./whalequest.txt", "w") as outfile:
  for sent in ranked:
    outfile.write(fix_quotes(sent.text) + " ")
    if random.random() < 0.2:
      outfile.write("\n\n")

For our input file, we use the standard txt file of Moby-Dick furnished by Project Gutenberg, with some of the header material removed to ensure that only the most evocative of textual flotsam is scattered throughout the book.

If I was going to spend any more time on this, I'd consider switching over to a better sentence embeddings model (e.g., one of the models provided by the SentenceTransformers library), since spaCy's built-in similarity features actually use an older and generally worse-performing method based on word vectors. I'd also try to improve the handling of quotation marks and whitespace a bit. But I'm not going to spend any more time on this, so that'll have to wait.

@lizadaly
Copy link

This is great!

@arnicas
Copy link

arnicas commented Nov 30, 2023

Hah - I had a whole couple entries planned for semantic similarity (I've been using it on fairy tales) but didn't get to them, because my first idea got too involved. Nice!

@notquitethere
Copy link

I loved the ending.

@bibliotechy
Copy link

🐳 🐳 🐳 Whale whale whale, you've really outdone yourself this time. 🐳 🐳 🐳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants