Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude ruby>rt from translated text #120

Open
cspotcode opened this issue Apr 19, 2024 · 1 comment
Open

Exclude ruby>rt from translated text #120

cspotcode opened this issue Apr 19, 2024 · 1 comment

Comments

@cspotcode
Copy link

I am requesting an option to exclude nested <rt> tags from the selected text when translating and generating text-to-speech.

<rt> nested inside <ruby> is an annotation to explain the phonetic pronunciation of a word, it is not a separate word in the sentence.

https://www.w3schools.com/tags/tag_rt.asp


Caveat: I'm a beginner Japanese learner, so I'm not an expert on kanji nor <ruby><rt> tags.

I'm using https://www3.nhk.or.jp/news/easy/ to learn Japanese. This site uses <ruby> and <rt> tags to annotate kanji with their hiragana equivalent. It appears like this on the site:

image

The kanji word "地震" (earthquake) is pronounced like "じしん". If you don't understand the kanji, you can sound it out using the hiragana.

When this text is copy-pasted or extracted with window.getSelection().toString(), you get the following string. Note that the annotation above "地震" appears within the copied text.

この地震じしんでけ

When text-to-speech reads this aloud, that word is repeated twice, which is wrong. It says "earthquake earthquake". The word should only be spoken once, because the annotation should not be spoken separately. Translations may also be wrong because the word is repeated twice.

The HTML looks roughly like this, where the annotated text is wrapped in <ruby> and the annotation is nested as <rt>:

この<ruby>地震<rt>じしん</rt></ruby>でけ

Here is an alternative method of extracting the text that removes <rt> elements first, stripping the annotations.

const selectionFragment = window.getSelection().getRangeAt(0).cloneContents();
for(const rtNode of r.querySelectorAll('ruby>rt')) {
  rtNode.parentNode.removeChild(e);
}
return selectionFragment.textContent;

Thank you for the excellent browser extension!

@cspotcode
Copy link
Author

cspotcode commented Apr 25, 2024

I hit a similar issue with Satori Reader. For each kanji, it uses markup to include multiple possible representations: kana and kanji. CSS rules selectively show only one of them. But when text is extracted, you get the repetition which is wrong.

Maybe there's a different copy-paste mechanism that only includes visible things, similar to what a user gets when they highlight text and hit ctrl+c.

Or maybe it's easiest to use the same stripping mechanism as I implemented for ruby>rt:
Adding additional CSS selectors for elements to be deleted before text is extracted. Could be a configurable list.

Here is an example of the markup from Satori Reader.
image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant