Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix word wise for stressed Russian epubs #192

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Vuizur
Copy link
Contributor

@Vuizur Vuizur commented Feb 23, 2024

I fixed the code for stressed epubs by using (only in this special case) two spacy docs: one containing the text and one for the lemmatization/pos detection. I tested it for one Russian and one non-Russian book so far and it seemed to work.
Should I add the same for kindle? (I only can't test it.)

@xxyzz
Copy link
Owner

xxyzz commented Feb 24, 2024

Only one character need to be removed? It could be added at here:

WordDumb/epub.py

Lines 137 to 144 in 460db47

with xhtml_path.open("r", encoding="utf-8") as f:
# remove soft hyphen, byte order mark, word joiner
xhtml_text = re.sub(
r"\xad|­|­|­|\ufeff|\u2060|⁠",
"",
f.read(),
flags=re.I,
)

@Vuizur
Copy link
Contributor Author

Vuizur commented Feb 24, 2024

Removing that one character works fine for books created by my program, for general purpose one should maybe use the more sophisticated remove_accents function like in Proficiency, which can also remove grave accents.

I don't 100 percent understand the code, but removing it from the place you suggested will also remove the character from the output epub, right? I would want to keep it. So that at the end it looks like this, but currently spacy can't perform lemmatization and POS analysis with stressed text:
image

@xxyzz
Copy link
Owner

xxyzz commented Feb 24, 2024

Yes, that'll change the book text, I forget you want to keep the stress marker...

This issue should be fixed in spaCy's Russian lemmatizer, change the text using str.replace breaks the word location and would make the footnote added to the wrong place.

@Vuizur
Copy link
Contributor Author

Vuizur commented Feb 24, 2024

There is an issue on the spacy repo that is related to the fixing of the lemmatizer: explosion/spaCy#12530. It seems terribly complicated.
I think my workaround works fine. It takes the token index positions, because these don't change between stressed and unstressed texts, so the alignment is kept.

@xxyzz
Copy link
Owner

xxyzz commented Feb 24, 2024

I think it's better to wait for spaCy's pr, sorry. This patch runs the model pipelines on words that have stress marks again...

@Vuizur
Copy link
Contributor Author

Vuizur commented Feb 24, 2024

On Russian words with stress marks. So it doesn't affect any non-Russian books, and no normal Russian books. It probably won't even affect normal Russian books that have French citations like "À mauvais ouvrier point de bon outil.", because it only detects combining accent marks. So the performance impact should be negligible in all other cases except with stressed Russian books, where the program is currently broken.

The problem with the spacy PR is that the original one has been sitting for almost a year, and only fixes lemmatisation, but not POS detection. For fixing POS detection we would apparently have to host our own unstressed_core_news_* models and implement a custom language, which would probably result in more convoluted code changes than in this PR.

@xxyzz
Copy link
Owner

xxyzz commented Feb 24, 2024

Doesn't the English Wiktionary have forms that have stress marks? Have you tried disable the "Use POS type" feature, stressed forms should be matched if they are in the Word Wise db.

@Vuizur
Copy link
Contributor Author

Vuizur commented Feb 24, 2024

Doesn't the English Wiktionary have forms that have stress marks? Have you tried disable the "Use POS type" feature, stressed forms should be matched if they are in the Word Wise db.

True, this works.

@xxyzz
Copy link
Owner

xxyzz commented Feb 24, 2024

Another way to fix this is add word wise notes first then add stress marks...

@Vuizur
Copy link
Contributor Author

Vuizur commented Feb 24, 2024

Another way to fix this is add word wise notes first then add stress marks...

It is possible, although implementing the processing of word wise files would be hard. 🗿

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants