Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't require a period at the end of a line for Scansion().scan_text() #1247

Open
sjhuskey opened this issue Jan 25, 2024 · 6 comments
Open

Comments

@sjhuskey
Copy link

Is your feature request related to a problem? Please describe.
The Scansion().scan_text() method won't produce a result unless the text string to be scanned ends in a period ('.').

Describe the solution you'd like
It would be helpful to allow the user to specify the delimiter. In my case, I'd like to use the newline character (e.g., \n) so that I can scan multiple lines of poetry.

Describe alternatives you've considered
I have resorted to inserting a period at the end of every line, but that's tedious. It's also problematic if you're trying to study the relationship between line breaks and sentence termination.

Additional context
None.

@kylepjohnson
Copy link
Member

Hi @sjhuskey I hope my radio silence doesn't come off as me not appreciating your bug reports. I truly do, but as I explained in another issues, I simply lack the time to work on these properly. Would you consider submitting a pull request? You could give it a shot and I would make light edits if necessary.

@sjhuskey
Copy link
Author

No worries, @kylepjohnson. I understand about having to manage several projects, and I'm grateful for the resource that you and the CLTK team have created. Yes, I can probably come up with a patch for this, but probably not for a few weeks.

@SDCLA
Copy link
Contributor

SDCLA commented Apr 22, 2024

The (quick and dirty) solution for this would be to just put "\n" in the list of stops recognised by the _tokenizer function. For me this gives
for:
Scansion().scan_text("""Εἴθ᾽ ὤφελ᾽ Ἀργοῦς μὴ διαπτάσθαι σκάφος
Κόλχων ἐς αἶαν κυανέας Συμπληγάδας,
μηδ᾽ ἐν νάπαισι Πηλίου πεσεῖν ποτε
τμηθεῖσα πεύκη, μηδ᾽ ἐρετμῶσαι χέρας """)
output:
['¯¯˘¯¯¯˘¯¯¯˘˘¯x', '˘¯¯˘˘˘¯¯¯˘˘x', '¯˘¯˘¯˘¯˘¯˘˘¯¯x']

and then with a \n manually added at the end it would give:
τμηθεῖσα πεύκη, μηδ᾽ ἐρετμῶσαι χέρας\n""")
['¯¯˘¯¯¯˘¯¯¯˘˘¯x', '˘¯¯˘˘˘¯¯¯˘˘x', '¯˘¯˘¯˘¯˘¯˘˘¯¯x', '¯¯¯˘¯¯¯˘x']

But this would make it a bit of a pain to run the module on an other blocks of text with odd formatting, e.g. those copied from the TLG or Diogenes with Teubner line divisions. Maybe a class argument to set the scansion module to poetry mode?
Alternatively the tokenizer could be set to recognise the last word in the input string and append the token list and the lines of a poem could be scanned as individual items.
The better long-term solution might be to just use the cltk_doc.tokens tokenizer so as to go cltk_doc.tokens > Scansion._clean_accents() > Scansion._make_syllables > Scansion._syllable_condenser() > Scansion._scansion()

@sjhuskey
Copy link
Author

Yes, I had been thinking about that quick and dirty solution, and I had the same qualms about it. I think your long-term solution is best. I'll give that a try.

@sjhuskey
Copy link
Author

sjhuskey commented Jun 2, 2024

I finally had a chance to try the long-term solution suggested by @SDCLA.

I'm using Aeschylus Ag. 55–59 as a test_string:

ὕπατος δʼ ἀίων ἤ τις Ἀπόλλων
ἢ Πὰν ἢ Ζεὺς οἰωνόθροον
γόον ὀξυβόαν τῶνδε μετοίκων
ὑστερόποινον
πέμπει παραβᾶσιν Ἐρινύν.

I did this:

# Import the modules
from cltk import NLP
from cltk.prosody.grc import Scansion

cltk_nlp = NLP(language="grc")
cltk_doc = cltk_nlp.analyze(text_string)

# Implement the suggested pipeline
tokens = cltk_doc.tokens
clean_accents = Scansion()._clean_accents(tokens)
syllables = Scansion()._make_syllables(clean_accents)
condensed = Scansion()._syllable_condenser(syllables)
scanned = Scansion()._scansion(condensed)

The output is ['˘˘¯˘¯¯˘˘¯¯¯˘¯¯¯¯¯˘¯˘˘¯˘˘¯¯˘˘¯¯¯˘˘¯¯¯¯˘˘¯˘˘˘x']. When I take the time to divide that up by line, it's still not accurate, but that's a different issue.

I inserted a period at the end of each line of text_string, and the result was this:

ὕπατος δʼ ἀίων ἤ τις Ἀπόλλων
˘˘˘˘˘˘˘˘˘˘¯˘¯˘˘˘˘˘˘˘˘¯˘
ἢ Πὰν ἢ Ζεὺς οἰωνόθροον
¯˘˘˘¯˘˘˘˘˘˘¯˘˘˘˘˘˘˘
γόον ὀξυβόαν τῶνδε μετοίκων
˘˘˘˘˘˘˘˘˘˘˘˘¯˘˘˘˘˘˘˘˘˘¯˘
ὑστερόποινον
˘˘˘˘˘˘˘˘˘˘˘˘
πέμπει παραβᾶσιν Ἐρινύν.
˘˘˘˘˘˘˘˘˘˘˘¯˘˘˘˘˘˘˘˘˘˘

At least it treated each line individually, but the scansion is still incorrect.

I'm going to think about this some more.

@kylepjohnson
Copy link
Member

kylepjohnson commented Jun 2, 2024

Thanks for working on this @sjhuskey . It's been 15 years since I last scanned anything, otherwise I would help. I trust you and @SDCLA and will accept your PR when you have it ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants