Don't require a period at the end of a line for Scansion().scan_text() #1247

sjhuskey · 2024-01-25T20:10:18Z

Is your feature request related to a problem? Please describe.
The Scansion().scan_text() method won't produce a result unless the text string to be scanned ends in a period ('.').

Describe the solution you'd like
It would be helpful to allow the user to specify the delimiter. In my case, I'd like to use the newline character (e.g., \n) so that I can scan multiple lines of poetry.

Describe alternatives you've considered
I have resorted to inserting a period at the end of every line, but that's tedious. It's also problematic if you're trying to study the relationship between line breaks and sentence termination.

Additional context
None.

The text was updated successfully, but these errors were encountered:

kylepjohnson · 2024-04-04T03:01:40Z

Hi @sjhuskey I hope my radio silence doesn't come off as me not appreciating your bug reports. I truly do, but as I explained in another issues, I simply lack the time to work on these properly. Would you consider submitting a pull request? You could give it a shot and I would make light edits if necessary.

sjhuskey · 2024-04-10T00:49:06Z

No worries, @kylepjohnson. I understand about having to manage several projects, and I'm grateful for the resource that you and the CLTK team have created. Yes, I can probably come up with a patch for this, but probably not for a few weeks.

SDCLA · 2024-04-22T11:36:07Z

The (quick and dirty) solution for this would be to just put "\n" in the list of stops recognised by the _tokenizer function. For me this gives
for:
Scansion().scan_text("""Εἴθ᾽ ὤφελ᾽ Ἀργοῦς μὴ διαπτάσθαι σκάφος
Κόλχων ἐς αἶαν κυανέας Συμπληγάδας,
μηδ᾽ ἐν νάπαισι Πηλίου πεσεῖν ποτε
τμηθεῖσα πεύκη, μηδ᾽ ἐρετμῶσαι χέρας """)
output:
['¯¯˘¯¯¯˘¯¯¯˘˘¯x', '˘¯¯˘˘˘¯¯¯˘˘x', '¯˘¯˘¯˘¯˘¯˘˘¯¯x']

and then with a \n manually added at the end it would give:
τμηθεῖσα πεύκη, μηδ᾽ ἐρετμῶσαι χέρας\n""")
['¯¯˘¯¯¯˘¯¯¯˘˘¯x', '˘¯¯˘˘˘¯¯¯˘˘x', '¯˘¯˘¯˘¯˘¯˘˘¯¯x', '¯¯¯˘¯¯¯˘x']

But this would make it a bit of a pain to run the module on an other blocks of text with odd formatting, e.g. those copied from the TLG or Diogenes with Teubner line divisions. Maybe a class argument to set the scansion module to poetry mode?
Alternatively the tokenizer could be set to recognise the last word in the input string and append the token list and the lines of a poem could be scanned as individual items.
The better long-term solution might be to just use the cltk_doc.tokens tokenizer so as to go cltk_doc.tokens > Scansion._clean_accents() > Scansion._make_syllables > Scansion._syllable_condenser() > Scansion._scansion()

sjhuskey · 2024-04-22T14:03:49Z

Yes, I had been thinking about that quick and dirty solution, and I had the same qualms about it. I think your long-term solution is best. I'll give that a try.

sjhuskey · 2024-06-02T22:58:34Z

I finally had a chance to try the long-term solution suggested by @SDCLA.

I'm using Aeschylus Ag. 55–59 as a test_string:

ὕπατος δʼ ἀίων ἤ τις Ἀπόλλων
ἢ Πὰν ἢ Ζεὺς οἰωνόθροον
γόον ὀξυβόαν τῶνδε μετοίκων
ὑστερόποινον
πέμπει παραβᾶσιν Ἐρινύν.

I did this:

# Import the modules
from cltk import NLP
from cltk.prosody.grc import Scansion

cltk_nlp = NLP(language="grc")
cltk_doc = cltk_nlp.analyze(text_string)

# Implement the suggested pipeline
tokens = cltk_doc.tokens
clean_accents = Scansion()._clean_accents(tokens)
syllables = Scansion()._make_syllables(clean_accents)
condensed = Scansion()._syllable_condenser(syllables)
scanned = Scansion()._scansion(condensed)

The output is ['˘˘¯˘¯¯˘˘¯¯¯˘¯¯¯¯¯˘¯˘˘¯˘˘¯¯˘˘¯¯¯˘˘¯¯¯¯˘˘¯˘˘˘x']. When I take the time to divide that up by line, it's still not accurate, but that's a different issue.

I inserted a period at the end of each line of text_string, and the result was this:

ὕπατος δʼ ἀίων ἤ τις Ἀπόλλων
˘˘˘˘˘˘˘˘˘˘¯˘¯˘˘˘˘˘˘˘˘¯˘
ἢ Πὰν ἢ Ζεὺς οἰωνόθροον
¯˘˘˘¯˘˘˘˘˘˘¯˘˘˘˘˘˘˘
γόον ὀξυβόαν τῶνδε μετοίκων
˘˘˘˘˘˘˘˘˘˘˘˘¯˘˘˘˘˘˘˘˘˘¯˘
ὑστερόποινον
˘˘˘˘˘˘˘˘˘˘˘˘
πέμπει παραβᾶσιν Ἐρινύν.
˘˘˘˘˘˘˘˘˘˘˘¯˘˘˘˘˘˘˘˘˘˘

At least it treated each line individually, but the scansion is still incorrect.

I'm going to think about this some more.

kylepjohnson · 2024-06-02T23:37:47Z

Thanks for working on this @sjhuskey . It's been 15 years since I last scanned anything, otherwise I would help. I trust you and @SDCLA and will accept your PR when you have it ready.

sjhuskey added the feature-request label Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't require a period at the end of a line for Scansion().scan_text() #1247

Don't require a period at the end of a line for Scansion().scan_text() #1247

sjhuskey commented Jan 25, 2024

kylepjohnson commented Apr 4, 2024

sjhuskey commented Apr 10, 2024

SDCLA commented Apr 22, 2024

sjhuskey commented Apr 22, 2024

sjhuskey commented Jun 2, 2024 •

edited

kylepjohnson commented Jun 2, 2024 •

edited

Don't require a period at the end of a line for Scansion().scan_text() #1247

Don't require a period at the end of a line for Scansion().scan_text() #1247

Comments

sjhuskey commented Jan 25, 2024

kylepjohnson commented Apr 4, 2024

sjhuskey commented Apr 10, 2024

SDCLA commented Apr 22, 2024

sjhuskey commented Apr 22, 2024

sjhuskey commented Jun 2, 2024 • edited

kylepjohnson commented Jun 2, 2024 • edited

sjhuskey commented Jun 2, 2024 •

edited

kylepjohnson commented Jun 2, 2024 •

edited