Exclude ruby>rt from translated text #120

cspotcode · 2024-04-19T20:45:14Z

I am requesting an option to exclude nested <rt> tags from the selected text when translating and generating text-to-speech.

<rt> nested inside <ruby> is an annotation to explain the phonetic pronunciation of a word, it is not a separate word in the sentence.

https://www.w3schools.com/tags/tag_rt.asp

Caveat: I'm a beginner Japanese learner, so I'm not an expert on kanji nor <ruby><rt> tags.

I'm using https://www3.nhk.or.jp/news/easy/ to learn Japanese. This site uses <ruby> and <rt> tags to annotate kanji with their hiragana equivalent. It appears like this on the site:

The kanji word "地震" (earthquake) is pronounced like "じしん". If you don't understand the kanji, you can sound it out using the hiragana.

When this text is copy-pasted or extracted with window.getSelection().toString(), you get the following string. Note that the annotation above "地震" appears within the copied text.

この地震じしんでけ

When text-to-speech reads this aloud, that word is repeated twice, which is wrong. It says "earthquake earthquake". The word should only be spoken once, because the annotation should not be spoken separately. Translations may also be wrong because the word is repeated twice.

The HTML looks roughly like this, where the annotated text is wrapped in <ruby> and the annotation is nested as <rt>:

この<ruby>地震<rt>じしん</rt></ruby>でけ

Here is an alternative method of extracting the text that removes <rt> elements first, stripping the annotations.

const selectionFragment = window.getSelection().getRangeAt(0).cloneContents();
for(const rtNode of r.querySelectorAll('ruby>rt')) {
  rtNode.parentNode.removeChild(e);
}
return selectionFragment.textContent;

Thank you for the excellent browser extension!

The text was updated successfully, but these errors were encountered:

cspotcode · 2024-04-25T16:26:56Z

I hit a similar issue with Satori Reader. For each kanji, it uses markup to include multiple possible representations: kana and kanji. CSS rules selectively show only one of them. But when text is extracted, you get the repetition which is wrong.

Maybe there's a different copy-paste mechanism that only includes visible things, similar to what a user gets when they highlight text and hit ctrl+c.

Or maybe it's easiest to use the same stripping mechanism as I implemented for ruby>rt:
Adding additional CSS selectors for elements to be deleted before text is extracted. Could be a configurable list.

Here is an example of the markup from Satori Reader.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude ruby>rt from translated text #120

Exclude ruby>rt from translated text #120

cspotcode commented Apr 19, 2024

cspotcode commented Apr 25, 2024 •

edited

Exclude ruby>rt from translated text #120

Exclude ruby>rt from translated text #120

Comments

cspotcode commented Apr 19, 2024

cspotcode commented Apr 25, 2024 • edited

cspotcode commented Apr 25, 2024 •

edited