Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Support for phrases #21

Open
seth-js opened this issue Jun 8, 2022 · 7 comments
Open

[Feature Request] Support for phrases #21

seth-js opened this issue Jun 8, 2022 · 7 comments
Labels

Comments

@seth-js
Copy link

seth-js commented Jun 8, 2022

In the example sentence: Как не сойти с ума, когда вокруг одна лишь тьма?, the sentence contains the phrase сойти с ума. There is a Wiktionary definition available for this phrase.

However, when I double click сойти, I only get the definition for the one word, not the phrase. Ideally, the word search feature would be able to look ahead until it reached a punctuation character, and then looked up all the words leading down to the word you first selected.

Here's an example how that would work:

Sentence: чтобы хотя бы так поддержать твои старания

  • I double click хотя.
  • It looks up хотя бы так поддержать твои старания to see if it's a lemma or non lemma Russian word.
  • No match found, continuing.
  • Look up хотя бы так поддержать твои
  • No match found, continuing.
  • It does this until it looks up хотя бы
  • I get a match, and it gives me the definition for this phrase.

The approach isn't viable when using an online dictionary. If this feature was added, you would need to make it a requirement that to use this feature, an offline dictionary is needed.

@1over137
Copy link
Contributor

1over137 commented Jun 8, 2022

This is an interesting feature.
Actually for the example you gave, it would work just fine if you manually selected сойти с ума and then pressed "define direct".
Still, this would not work if it involves a conjugated or inflected word.

I'm not sure how to address this lemma problem though. Simply lemmatizing each of the words would not really work, since the example you give will simply become сойти с ум and not be found in the dictionary.

@seth-js
Copy link
Author

seth-js commented Jun 8, 2022

It would search the surface form first, сойти с ума, then if there was no match it would search сойти с ум. The issue with manually highlighting сойти с ума and clicking define is that I would have to know that сойти с ума is a phrase to begin with. Yomichan correctly handles Russian phrases I throw at it, so perhaps I'll build a full dictionary for it instead.

@1over137
Copy link
Contributor

You managed to build dictionaries for Yomichan?? How did you do that? Is there an article somewhere?

@seth-js
Copy link
Author

seth-js commented Jun 11, 2022

I'm really excited too. This will most likely be the first European language Yomichan dictionary ever.

I had to make a custom version of Yomichan that parses words based on space separators rather than character by character. There was also the issue where Yomichan parsed words in each element after the one I hovered over even if the element didn't have the same parent.

Making the dictionary was the easiest part. I have Wiktionary JSONs for multiple European languages but Russian is the most fleshed out one. Previously, I was basically reinventing the wheel and remaking an Electron version of Yomichan Search for Russian, but it looks like I can drop that project and work on this instead.

There are still some features I want to add, and it needs to undergo testing, but here's what I have so far:

1

2

I connected a Forvo audio server, and that's working too.

@1over137
Copy link
Contributor

1over137 commented Jun 11, 2022

Great work!
Do you have the code for your project published somewhere?
On another note, can you share the Wiktionary JSONs and/or the code used to generate them?
Regarding your original request, I think it involves a somewhat complicated change with the way this tool works. Still, it sounds like something useful enough that I'll try to implement it at some point. (or you are welcome to try it, of course)
Still, I think there are going to be some more difficulties, due to some peculiarities with Russian grammar.

  1. inflections, as mentioned before. This can be addressed by brute forcing every combination of words, I suppose. How will you address this with Yomichan? Japanese does not have cases.
  2. flexible word order. As an example, the phrase "собаку съесть" involves a noun and a verb. Their order doesn't really matter, however: saying "съесть собаку" is possible too.

@seth-js
Copy link
Author

seth-js commented Jun 11, 2022

All my work is on my computer, but I'm planning on putting it on a couple repositories when I'm satisfied with this Yomichan project. The code for generating Wiktionary JSONs will also be available.

I don't plan on contributing to vocabsieve since this project should cover all my needs.

Inflections are easily handled since form data is provided in the JSONs. I get the JSONs from here (raw Wiktionary data), then I run a script that pulls out all the useful information into a new JSON file. Then I run another script to create dictionary entries. It's kind of complicated, so I'll have to clean that up a bit. Here's what the data looks like after all that:

3

From there, I run another script on that JSON to create a compatible Yomichan dictionary.

Non-lemma words get their own definition in Yomichan. If you turn on Allow scanning popup content in the Yomichan settings, you can lookup the lemmas that they point to. This was a bit tedious, so I modified Yomichan to also automatically do this for you and add it to the definitions list:

4

Unfortunately съесть собаку and собаку съесть don't have a definition since there's no entry for that phrase on Wiktionary (for English).

@seth-js
Copy link
Author

seth-js commented Jun 17, 2022

Here's the GitHub repository for extracting relevant data from the Kaikki Wiktionary rip. Let me know if you have any problems with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants