You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Caveat: I'm a beginner Japanese learner, so I'm not an expert on kanji nor <ruby><rt> tags.
I'm using https://www3.nhk.or.jp/news/easy/ to learn Japanese. This site uses <ruby> and <rt> tags to annotate kanji with their hiragana equivalent. It appears like this on the site:
The kanji word "地震" (earthquake) is pronounced like "じしん". If you don't understand the kanji, you can sound it out using the hiragana.
When this text is copy-pasted or extracted with window.getSelection().toString(), you get the following string. Note that the annotation above "地震" appears within the copied text.
この地震じしんでけ
When text-to-speech reads this aloud, that word is repeated twice, which is wrong. It says "earthquake earthquake". The word should only be spoken once, because the annotation should not be spoken separately. Translations may also be wrong because the word is repeated twice.
The HTML looks roughly like this, where the annotated text is wrapped in <ruby> and the annotation is nested as <rt>:
この<ruby>地震<rt>じしん</rt></ruby>でけ
Here is an alternative method of extracting the text that removes <rt> elements first, stripping the annotations.
I hit a similar issue with Satori Reader. For each kanji, it uses markup to include multiple possible representations: kana and kanji. CSS rules selectively show only one of them. But when text is extracted, you get the repetition which is wrong.
Maybe there's a different copy-paste mechanism that only includes visible things, similar to what a user gets when they highlight text and hit ctrl+c.
Or maybe it's easiest to use the same stripping mechanism as I implemented for ruby>rt:
Adding additional CSS selectors for elements to be deleted before text is extracted. Could be a configurable list.
Here is an example of the markup from Satori Reader.
I am requesting an option to exclude nested
<rt>
tags from the selected text when translating and generating text-to-speech.<rt>
nested inside<ruby>
is an annotation to explain the phonetic pronunciation of a word, it is not a separate word in the sentence.https://www.w3schools.com/tags/tag_rt.asp
Caveat: I'm a beginner Japanese learner, so I'm not an expert on kanji nor
<ruby><rt>
tags.I'm using https://www3.nhk.or.jp/news/easy/ to learn Japanese. This site uses
<ruby>
and<rt>
tags to annotate kanji with their hiragana equivalent. It appears like this on the site:The kanji word "地震" (earthquake) is pronounced like "じしん". If you don't understand the kanji, you can sound it out using the hiragana.
When this text is copy-pasted or extracted with
window.getSelection().toString()
, you get the following string. Note that the annotation above "地震" appears within the copied text.When text-to-speech reads this aloud, that word is repeated twice, which is wrong. It says "earthquake earthquake". The word should only be spoken once, because the annotation should not be spoken separately. Translations may also be wrong because the word is repeated twice.
The HTML looks roughly like this, where the annotated text is wrapped in
<ruby>
and the annotation is nested as<rt>
:Here is an alternative method of extracting the text that removes
<rt>
elements first, stripping the annotations.Thank you for the excellent browser extension!
The text was updated successfully, but these errors were encountered: