New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace mecab with a deconjugation parser and display inflection explanations on definitions #210
base: master
Are you sure you want to change the base?
Conversation
@BigBoyBarney I still don't understand the extent of your suggestion. Do you mean to parse the output of mecab to derive a yomichan style inflection explainations (
Unidic is pretty big at ~500 MB zipped, ~1GB unzipped for the light weight version.
I'm not clear on what those issues are.
I don't believe performance of either mecab or a deinflector is an issue. I did some very quick benchmarks and they were on the same order of performance, and the dictionary queries as a whole were dwarfed by the Qt UI code that took like ~90-99% of the runtime. |
I don't really have any preferences on this feature as I have been fine without it until now. Just some of my thoughts here: As I mentioned a long time ago in #109 (comment), I don't think it is very productive to try to make MeCab's output match western style explanations (often known collectively as 日本語教育文法, but differs between instructions) as MeCab's output is aligned with the more traditional 学校文法 that's taught to native speakers. Here's a table listing some of the differences: excerpt taken from 考えて、解いて、学ぶ 日本語教育の文法 (https://www.3anet.co.jp/np/books/4692/) I really do think natural language has too many edge cases and no rule-based parser (MeCab included) has been accurate enough for me to have been able to rely on it during learning. If we really add this feature, and don't want to learn school grammar, I think a custom parser would probably be better than trying to shoehorn MeCab output. However, to be honest I feel the most helpful thing in this day and age is probably to pass the context to some LLM server you have running and ask it to explain the usage of the word being looked up. |
For a very simple example where MeCab fails:
MeCab with IPAdic fails to segment the text properly at "人に委ねるな", "ってことですか" and thinks there's a なる. Bad segmentation from MeCab is really the issue that affects Memento, though the addition of the search/lookup feature ameliorates that: if the word we want to look up cannot be selected, just type it with Ctrl-R
with Unidic it works and is probably the best thing for inflections, but I also understand the reasons this project doesn't want to use it.
|
That was my initial point. Western grammar is not, and will not be applicable to Japanese to a satisfactory extent. I don't like Yomi's deinflection explanations either, so I was not suggesting implementing the exact same functionality with MeCab. This misunderstanding was possibly a mistake on my part, I should have clarified my point more.
Unidic, given the appropriate dictionary (現代書き言葉, 現代話し言葉 or a version of 古文用 if you're reading classical Japanese), always works. It might not be the first entry that's applicable to the given context, but if you use a
My main gripe with it is that it's not how the language works. I would much prefer simply tabulating the MeCab output directly in a dropdown of some sort, so it can be viewed from within Memento. I'm already using it in a separate window while watching or reading, so this would streamline my workflow. However, I understand that a lot of people, allegedly the majority who use Memento, would prefer direct western grammar equivalents. I only ask that if this ends up being merged, please keep the current MeCab parser as a togglable option as well. |
Ultimately prescriptive grammar are academic constructs and I personally do not believe it is necessary to learn either system to be able to accurately understand contemporary Japanese given one consumes enough content, which after all is the postulate of language acquisition through mass immersion. I must say the case for 学校文法, besides most academic texts in Japanese use its terms, is that it is just indispensable for analyzing classical Japanese, whereas 教育文法 is only fitted on contemporary Japanese and does not generalize. Though I doubt that is the main use case for Memento by most users. @BigBoyBarney If you have a workflow and materials that you learn classical Japanese in Memento with ([lesson?] videos and accurate transcriptions / subtitles), please send me an email and I'd be very interested. In that case again Memento could use better segmentation or just custom selection of text to look up, but that is a different matter. |
I don't think there's a current MeCab parser option. Unless I've lost track of recent development, Memento uses MeCab to do segmentation only, to get the substring to look up when you mouse over it. I also hope this can still be an option. This PR adds a "Deconjugator" which Memento never had before. @spacehamster actually how do you mean replacing MeCab with this? I assume you are not removing it. Are you giving progressively longer inputs to your deconjugator? |
The initial idea was to remove mecab completely. The code that uses mecab is currently commented out in the PR. Mecab is only used to find any possible deconjugated forms in a sentence, so if you give it 行かれる途中でしたか?, mecab will generate 行く, 行かれる途中です, 行かれる途中でる, 行かれる途中だ, 行かる which are then used to search the dictionary. It's more like partial segmentation because memento relies on the user cursor position to find the start of a word. The deconjugator does the same thing, but generates an inflection explanations while it's at it.
Yes, for the sentence 行かれる途中でしたか? it will generate the queries 行かれる途中でしたかる I've been thinking about it more, and I think it may be a good idea to leave mecab in and have it be an option to choose one method or the other. Classical japanese is a bit of an niche example, but I think a more likely scenario is obscure grammar points and dialects like kansai-ben, yomichan cannot parse stuff like 戦わざるをえない or 行かへん which mecab can manage. |
@spacehamster I think this sounds good. I personally won't be needing it to save some cycles due to how intensive Qt + mpv already is. I do wonder about the performance when starting to parse from the leftmost position. Though the length of a line should be bounded my dictionary db is almost 3GB. More than anything I think we should sit back and make sense of the terminology in this discussion. First, MeCab is a tokenizer and morphological analyzer. If I understand correctly, your "deconjugator" is also a morphological analyzer that needs the stem/root form of the word. You still use MeCab's morphological analyzer to get the stem, it's just you want to generate the western style explanation instead of using MeCab's output. Can you also clarify if/how you are using MeCab's tokenization? I think we should clearly define things and figure out consistent terminology to use, then decide the direction we should take. |
@Calvin-Xu I'm primarily motivated by seeing inflection explanations implemented. I used a deconjugator approach because:
I'm interpreting stem/root form to mean the plain/dictionary form such as 食べる and not 食べ. The deconjugator doesn't take the dictionary form as input, but emits it as output, so it replaces mecab's functionality and doesn't use mecab at all. A pseudocode simplication of how searchTerms in dictionary.cpp currently works. function searchTerms(query, subtitle)
{
//query is substring of subtitle, in this example
//query=行かれる途中でしたか?
//subtitle=王立学園に行かれる途中でしたか?
let terms = [];
for(let i = 1; i < query.length; i++)
{
//Search for exact matches
//行, 行か, 行かれ, etc
terms += db.search(query.substring(0, i))
}
let mecabQueries = generateMecab(query);
//generateMecab produces a list of surface and deconj forms using mecab
//only deconj is used for searching the db, surface is used for highlighting the subtitle and clozebody
//[{ deconj: "行く", surface: "行か" },
// { deconj: "行かれる途中です", surface:"行かれる途中でし" },
//...snip
// { deconj: "行かる", surface:"行かれ" }]
for(let mecabQuery of mecabQueries)
{
terms += db.search(mecabQuery);
}
terms = terms.sorted();
emit termsChanged(terms); //Let popup widget know there are new terms
} and the proposed change function searchTerms(query, subtitle)
{
//query is substring of subtitle, in this example
//query=行かれる途中でしたか?
//subtitle=王立学園に行かれる途中でしたか?
let terms = [];
for(let i = 1; i < query.length; i++)
{
//Search for exact matches
//行, 行か, 行かれ, etc
terms += db.search(query.substring(0, i))
}
let deconjugatorQueries = deconjugate(query);
//deconjugate produces a list of deconj forms, surface forms, and inflection explainations
//without using mecab
//only deconj is used for searching the db, surface is used for highlighting the subtitle and clozebody
//[{ deconj: "行かれる途中でしたかる", surface: "行かれる途中でしたか", explaination: "masu stem" },
// { deconj: "行かれる途中でしたる", surface: "行かれる途中でした", explaination: "masu stem" },
//...snip
// {deconj: "行く" surface: "行かれる", explaination: "passive" },
// {deconj: "行る" surface: "行", explaination: "masu stem" },
// {deconj: "行かる" surface: "行かれ", explaination: "imperative" } ]
for(let deconjugatorQuery of deconjugatorQueries)
{
terms += db.search(deconjugatorQuery);
}
terms = terms.sorted();
emit termsChanged(terms); //Let popup widget know there are new terms
}
Are you imagining the search is done progressively, showing the user results one by one as they are found? The current implementation (both mecab and deconjugator) searches the database in one go and collects all the definitions before showing them to the user.
Searches from mousing over subtitles currently bounded with a max length of 37 characters (for both mecab and deconjugator). I don't believe the search widget has a bound. I did some quick performance logging for what I believe is a typical case, which was enough to convince me performance was not a major concern as the QT UI was the main bottleneck. I have not looked at worst case pathological cases, my sqlite dictionary is 600 MB so it may get worse if it's 3GB with very large inputs. I think it would be a good idea to have a global settings flag to switch between the original method and the new method, because there are cases where a deconjugator will fail like kansai-ben 行かへん, but as a bonus it would also allow you to compare performance side by side. |
Great. Let's just wait for what @ripose-jp thinks. I think this accomplishes the goal of having the feature Yomichan/Yomitan has. Given the renewed interest in UniDic, I have opened another issue on supporting it #211 |
I'm not interested in discussing the merits of 教育文法 vs 学校文法 because I am unqualified to do so. I read Tae Kim's Grammar Guide and have mostly just learned by feel since. That said, if this works well, it will replace MeCab as the default. This style of deconjugation is what most Memento users will likely want as it is more consistent with the way most people learn Japanese as a second language. I'll probably make enabling MeCab support a compile time option so I can drop the dependency on the binaries I distribute. I promise that MeCab support will remain for those that want it. To achieve this multiple deconjugation approach, I believe it's probably for the best that I rework how searching currently works. Right now it's a bit of a rats nest that hard codes a start to finish algorithm. I think a pipeline approach would be best. It would look something like Query Generator → Query Merge → Duplicate Filter → Database Query → Results The query generation step would be the most relevant to this PR. I'll probably make an interface class that query generators can implement. Then I'll rework the code so existing generators are implemented in something like ExactGenerator, MeCabGenerator, and MultiGenerator. This way query generators are much more modular and don't require so much additional code to implement. This PR would be mostly contained to creating a DeconjugationGenerator save for a few potential changes to the Database Query step to fix some problems with the PR. This also opens the door to other generators such as a UniDic generator in the future. I don't expect you to implement this yourself, so I ask that you hold off sweating the details for how your code fits into Memento at the moment. I'll try to get this done this weekend.
I agree. There is probably something that can be changed in the sorting algorithm to make this happen. I don't expect it to be too difficult.
Yomichan does a fairly poor job of finding しよう as する considering it doesn't even rank. I don't think this is such a big deal considering that しよう is fairly common word that happens to be an exception to standard rules, so most people will learn it on it's own rather than in the greater grammatical framework. Likewise, most dictionaries will have an entry for けれど, so as long as that ranks above any irrelevant results, I believe it's not a big issue. Of course, try to fix these things if you can, but don't sweat them if you can't.
Your deconjugator is providing valuable information, so the vertical space is well warranted. Overall, great PR and I have no reason to believe this won't be merged eventually. |
b6a3d92
to
aa081cf
Compare
I've pushed the current state that I have that has some very minor changes, i've been holding off making any big changes. I've added some more deconjugation forms. I added a quick and easy query filter because the duplicates were annoying me. I added a config flag that allows for choosing between mecab and deconjugator, I don't think this is a great option for normal users because it requires technical understanding of the difference between parsing methods to know what it does, but I found it useful to switch between the two quickly for testing. |
Create a new class called QueryGenerator that defines a class that can generate queries from a string of text. ExactQueryGenerator implements this for searching exact text, and MeCabQueryGenerator implements this for deconjuating text with MeCab. This makes it so these two methods of generating queries can share all the same code for filtering duplicates, generating cloze info, and searching the database. This also makes it easier to implement new query generators in the future such as the deconjuator in #210 or the proposed MeCab with UniDic option in #211. This commit removes all multithreading from searches. I have known for a while that it provides little to no performance imporvement due to most time being spent querying a synchronized database. It may also harm performance on Windows. I don't see any need to add it back, but it can be if there is a compelling reason to. Queries seem plenty fast with this change.
Add option MECAB_SUPPORT that allows Memento to be built without MeCab. This will ON by default for now, but when #210 is merged, it will be OFF by default.
Create a new class called QueryGenerator that defines a class that can generate queries from a string of text. ExactQueryGenerator implements this for searching exact text, and MeCabQueryGenerator implements this for deconjuating text with MeCab. This makes it so these two methods of generating queries can share all the same code for filtering duplicates, generating cloze info, and searching the database. This also makes it easier to implement new query generators in the future such as the deconjuator in #210 or the proposed MeCab with UniDic option in #211. This commit removes all multithreading from searches. I have known for a while that it provides little to no performance imporvement due to most time being spent querying a synchronized database. It may also harm performance on Windows. I don't see any need to add it back, but it can be if there is a compelling reason to. Queries seem plenty fast with this change.
Add option MECAB_SUPPORT that allows Memento to be built without MeCab. This will ON by default for now, but when #210 is merged, it will be OFF by default.
Create a new class called QueryGenerator that defines a class that can generate queries from a string of text. ExactQueryGenerator implements this for searching exact text, and MeCabQueryGenerator implements this for deconjuating text with MeCab. This makes it so these two methods of generating queries can share all the same code for filtering duplicates, generating cloze info, and searching the database. This also makes it easier to implement new query generators in the future such as the deconjuator in #210 or the proposed MeCab with UniDic option in #211. This commit removes all multithreading from searches. I have known for a while that it provides little to no performance imporvement due to most time being spent querying a synchronized database. It may also harm performance on Windows. I don't see any need to add it back, but it can be if there is a compelling reason to. Queries seem plenty fast with this change.
Add option MECAB_SUPPORT that allows Memento to be built without MeCab. This will ON by default for now, but when #210 is merged, it will be OFF by default.
Create a new class called QueryGenerator that defines a class that can generate queries from a string of text. ExactQueryGenerator implements this for searching exact text, and MeCabQueryGenerator implements this for deconjuating text with MeCab. This makes it so these two methods of generating queries can share all the same code for filtering duplicates, generating cloze info, and searching the database. This also makes it easier to implement new query generators in the future such as the deconjuator in #210 or the proposed MeCab with UniDic option in #211. This commit removes all multithreading from searches. I have known for a while that it provides little to no performance imporvement due to most time being spent querying a synchronized database. It may also harm performance on Windows. I don't see any need to add it back, but it can be if there is a compelling reason to. Queries seem plenty fast with this change.
Add option MECAB_SUPPORT that allows Memento to be built without MeCab. This will ON by default for now, but when #210 is merged, it will be OFF by default.
Create a new class called QueryGenerator that defines a class that can generate queries from a string of text. ExactQueryGenerator implements this for searching exact text, and MeCabQueryGenerator implements this for deconjuating text with MeCab. This makes it so these two methods of generating queries can share all the same code for filtering duplicates, generating cloze info, and searching the database. This also makes it easier to implement new query generators in the future such as the deconjuator in #210 or the proposed MeCab with UniDic option in #211. This commit removes all multithreading from searches. I have known for a while that it provides little to no performance imporvement due to most time being spent querying a synchronized database. It may also harm performance on Windows. I don't see any need to add it back, but it can be if there is a compelling reason to. Queries seem plenty fast with this change.
Add option MECAB_SUPPORT that allows Memento to be built without MeCab. This will ON by default for now, but when #210 is merged, it will be OFF by default.
Create a new class called QueryGenerator that defines a class that can generate queries from a string of text. ExactQueryGenerator implements this for searching exact text, and MeCabQueryGenerator implements this for deconjuating text with MeCab. This makes it so these two methods of generating queries can share all the same code for filtering duplicates, generating cloze info, and searching the database. This also makes it easier to implement new query generators in the future such as the deconjuator in #210 or the proposed MeCab with UniDic option in #211. This commit removes all multithreading from searches. I have known for a while that it provides little to no performance imporvement due to most time being spent querying a synchronized database. It may also harm performance on Windows. I don't see any need to add it back, but it can be if there is a compelling reason to. Queries seem plenty fast with this change.
Add option MECAB_SUPPORT that allows Memento to be built without MeCab. This will ON by default for now, but when #210 is merged, it will be OFF by default.
I've pushed my update that makes adding new deconjugators much easier. It does all the work of merging results and filtering duplicates later in the pipeline, so you just need to focus on outputting queries. Most of your code should be moved into a class that inherits from |
aa081cf
to
b21a8a7
Compare
I've updated the latest changes. I think the output is in a pretty good state now. I think there's room for refactoring to improve performance and readability, I've only been trying to ensure the output was correct. I think I've got all the common forms now. Some notes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall a very solid code. Most of my comments are nitpicks about style and best practices.
Not included in my other comments is that there is some trailing whitespace in some files that needs to be removed. I prefer to use VSCode with the "Remove Trailing Whitespace" option set, but you can also remove whitespace with awk.
Commit message should be changed to conform to what's written in CONTRIBUTING.md. Most of the explanation can probably be copied or paraphrased from your initial PR description.
I'd not seen the -とる contraction before, I'll add it. There were a few forms I found that yomichan didn't have, like 食べん and 食べなきゃ. Yomichan conjugates 食べてった as progressive or perfect which seems to be a bug. I'm not a fan of how 食べていくis deconjugated as "-te < progressive or perfect < masu stem", it'd make more sense to just do "-te", but I think that'd require hardcoding an exception for that particular form. The spacing and the '<' looked different on my machine. I think I was trying to get rid of the padding within the labels, as the spacing between the headword and the deconjugation seemed excessive. I'll change the spacing back and see if I can fix label padding with the custom style sheets or if the flowlayout helps or something. Layout spacing was the wrong place to change anyway. How is " « " for the delimiter? |
I know I was quite opposed to this implementation in the other related PR, but this seems to work a lot better than I had initially thought. My apologies.
|
@BigBoyBarney I thought -ておる was a humble equivalent of -ている in standard Japanese, and is the standard form of -ている in 関西弁? I did find some stackexchange posts mentioning that it would be weird to hear -とる in modern standard Japanese, but either way, just quickly searching my own media, it seems to pop up a fair bit, particularly in historical fantasy from older characters. I think in general, including dialectical forms is out of scope, my knowledge of dialectical Japanese is essentially zero and I wouldn't be likely to use it. I think would make sense to include exceptions for forms you're likely to see in media targeting standard Japanese viewers. A couple strategies I can think of for tackling non-standard Japanese:
|
You are correct.
I agree, maybe hardcoding some of the more common forms could provide a reasonable coverage? How does Yomitan handle this? 関西弁 is quite common, and there are numerous shows where certain characters speak exclusively that, however I think it would be important and nice to have some kind of an indicator that the given form is not necessarily standard Japanese.
Assuming UniDic, Mecab parses 関西弁 correctly but gives no explanation as to whether it's dialectal or not. For However, given the choice between (potentially) incorrect grammar information (as it might be the case with Yomi and dialectal Japanese) and none, I personally would prefer the latter. Maybe an indicator for a "guessed" grammar pattern that does not fit the existing inflections could be an option? |
I'd just try adding the deconjugation label as the last element of
Perfect. Regarding the whole debate about whether or not to deconjugate とる and other "non-standard" ways of speaking, I think the deconjugator should support as many forms as possible. What's important is that the user understands what's being said, not necessarily the right context in which to use it themselves. There are other resources better suited to that than Memento. For strange ways of speaking, it should be pretty obvious to listeners that the character throwing around わし, とる, じゃ, and the like isn't speaking like a normal person. The likelihood of a JSL learning the language like an ESL only reading Shakespeare is very slim. Frankly if they do something like that, they get what they deserve. You have my complete support adding any regional forms you come across in media, so long as you deconjugate them correctly. |
@ripose-jp I was playing around with different layouts for the explanation label. It doesn't seem like FlowLayouts supports text wrapping around objects or splitting across lines so it doesn't seem like a good fit. I thought of three ways of doing it: A) Having a single label for kanji + kana + explanation and styling it with rich text I don't like option A btw. |
Doesn't work because Qt doesn't support
This would be cool if you could pull it off.
I'm okay with the third, but I think the way you've laid it out could be better. Instead of using a grid layout, I think you should stack layouts since it will keep the kana and kanji labels closer together if they are a single element in their own vertical layout. Here's a screenshot of how I might do it. For terms with kana and kanji, I think this is a great use of space. For terms that are just kana though, I think it would be better if the deconjugation was on the line beneath since the kana will line up with the buttons better and there's less empty space on the top and bottom. |
Could this potentially be solved with QtWebKit? |
I explicitly ban QtWebEngine as a dependency in CONTRIBUTING.md. Pulling in the entirety of Chromium just for |
88f3bce
to
305920c
Compare
To be clear, I wasn't talking about ruby, but something like qlabel.setText("<span style=\"font-size:12px\">かな</span><br><span style=\"font-size:18px\">漢字</span><span style=\"font-size:12px\">deconjugation « explaination</span>") because it lets deconjugation explanations sit on the same line as the kanji and wrap directly below it. I don't think it's a good fit here. Webkit ui layouts would be nice but at that point you may as well use electron and write the whole thing in javascript. I've changed the way queries were merged to prevent incorrect deconjugation explainations from being shown, this means it will generate multiple queries with the same deconj string. I've changed the location of the deconj string, There seems to be a bug in how QT handles word wrap and layouts which causes extra space between layouts when the window gets too small. This only seems to show up in search tool because it has a smaller width, and setting the deconj inline minimum width to 180px seems to solve it for most reasonable cases, though it'll still popup in extreme cases. An alternative might be putting the main tags on the top row and the deconjugation string under. I briefly tried to revert the def.rules back to a taglist, but that actually broke things because m_tagCache returns an empty tag so def.rules don't get populated. |
If you could get your label HTML to work, it wouldn't be a terrible idea. I don't think it will work though since you have to somehow center the kana over the kanji and I don't see a way to do that using the HTML subset supported by Qt. The two pictures you showed look excellent. I'd be fine merging those as is.
I don't find this concerning since 180px width is an extreme case and I don't see the need to make it a great user experience. Just to note, it might be worth rebasing on master since there was a bug in the duplicate filtering that was fixed by 923b7d2. I don't want you to think that's your code's fault and waste time on it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. Some things remain outstanding. Feel free to unmark this PR as a draft whenever you're ready.
I noticed that kana only entries have some extra padding above and below they didn't have before. I've modified your code a bit to fix this. Here's the diff.
diff --git a/src/gui/widgets/definition/termwidget.cpp b/src/gui/widgets/definition/termwidget.cpp
index 4ee8aeb..6f4980f 100644
--- a/src/gui/widgets/definition/termwidget.cpp
+++ b/src/gui/widgets/definition/termwidget.cpp
@@ -94,8 +94,6 @@ TermWidget::TermWidget(
m_ui->labelKanji->setStyleSheet(EXPRESSION_STYLE);
m_ui->labelKana->setStyleSheet(READING_STYLE);
- m_ui->labelDeconjUnder->setStyleSheet(CONJUGATION_STYLE);
- m_ui->labelDeconjInline->setStyleSheet(CONJUGATION_STYLE);
m_layoutTermTags = new FlowLayout(-1, 6);
m_layoutFreqTags = new FlowLayout(-1, 6);
@@ -212,19 +210,21 @@ void TermWidget::initUi(
}
m_ui->labelKanji->setText(kanjiLabelText);
- if (term.reading.isEmpty() || term.conjugationExplanation.size() > 100)
+ if (!term.conjugationExplanation.isEmpty())
{
- m_ui->labelDeconjInline->hide();
- // Do I even need to call show?
- m_ui->labelDeconjUnder->show();
- m_ui->labelDeconjUnder->setText(term.conjugationExplanation);
- }
- else
- {
- m_ui->labelDeconjUnder->hide();
- // Do I even need to call show?
- m_ui->labelDeconjInline->show();
- m_ui->labelDeconjInline->setText(term.conjugationExplanation);
+ m_labelDeconj = new QLabel;
+ m_labelDeconj->setStyleSheet(CONJUGATION_STYLE);
+ m_labelDeconj->setWordWrap(true);
+ m_labelDeconj->setTextInteractionFlags(Qt::TextSelectableByMouse);
+ if (term.reading.isEmpty() || term.conjugationExplanation.size() > 100)
+ {
+ m_ui->layoutTermWidget->insertWidget(1, m_labelDeconj);
+ }
+ else
+ {
+ m_ui->layoutButtonsDeconj->addWidget(m_labelDeconj);
+ }
+ m_labelDeconj->setText(term.conjugationExplanation);
}
for (const Frequency &freq : term.frequencies)
diff --git a/src/gui/widgets/definition/termwidget.h b/src/gui/widgets/definition/termwidget.h
index bfb6569..7dac42a 100644
--- a/src/gui/widgets/definition/termwidget.h
+++ b/src/gui/widgets/definition/termwidget.h
@@ -23,6 +23,7 @@
#include <QWidget>
+#include <QLabel>
#include <QMutex>
#include "definitionstate.h"
@@ -228,6 +229,9 @@ private:
/* Lock JSON sources */
QMutex m_lockJsonSources;
+ /* Label used for displaying deconjugation information */
+ QLabel *m_labelDeconj{nullptr};
+
/* Layout used for holding term tags. */
FlowLayout *m_layoutTermTags;
diff --git a/src/gui/widgets/definition/termwidget.ui b/src/gui/widgets/definition/termwidget.ui
index b036c5a..b7a41d4 100644
--- a/src/gui/widgets/definition/termwidget.ui
+++ b/src/gui/widgets/definition/termwidget.ui
@@ -7,7 +7,7 @@
<x>0</x>
<y>0</y>
<width>400</width>
- <height>76</height>
+ <height>65</height>
</rect>
</property>
<property name="sizePolicy">
@@ -25,7 +25,7 @@
<property name="windowTitle">
<string>Form</string>
</property>
- <layout class="QVBoxLayout" name="verticalLayout">
+ <layout class="QVBoxLayout" name="layoutTermWidget">
<item>
<layout class="QHBoxLayout" name="layoutTop">
<item>
@@ -73,7 +73,7 @@
<enum>Qt::Horizontal</enum>
</property>
<property name="sizeType">
- <enum>QSizePolicy::Policy::Fixed</enum>
+ <enum>QSizePolicy::Fixed</enum>
</property>
<property name="sizeHint" stdset="0">
<size>
@@ -179,54 +179,10 @@
</item>
</layout>
</item>
- <item>
- <widget class="QLabel" name="labelDeconjInline">
- <property name="minimumSize">
- <size>
- <width>180</width>
- <height>0</height>
- </size>
- </property>
- <property name="text">
- <string/>
- </property>
- <property name="alignment">
- <set>Qt::AlignBottom|Qt::AlignLeading|Qt::AlignLeft</set>
- </property>
- <property name="wordWrap">
- <bool>true</bool>
- </property>
- </widget>
- </item>
</layout>
</item>
</layout>
</item>
- <item>
- <widget class="QLabel" name="labelDeconjUnder">
- <property name="sizePolicy">
- <sizepolicy hsizetype="Ignored" vsizetype="Ignored">
- <horstretch>0</horstretch>
- <verstretch>0</verstretch>
- </sizepolicy>
- </property>
- <property name="minimumSize">
- <size>
- <width>0</width>
- <height>0</height>
- </size>
- </property>
- <property name="alignment">
- <set>Qt::AlignBottom|Qt::AlignLeading|Qt::AlignLeft</set>
- </property>
- <property name="wordWrap">
- <bool>true</bool>
- </property>
- <property name="textInteractionFlags">
- <set>Qt::LinksAccessibleByMouse|Qt::TextSelectableByMouse</set>
- </property>
- </widget>
- </item>
<item>
<widget class="QWidget" name="glossaryContainer" native="true">
<property name="sizePolicy">
305920c
to
4fcf1d6
Compare
I wanted to check in to see if you're ready to unmark this as a draft. I've used it for the past few weeks and functionally it has worked perfectly. |
Yes. I also haven't noticed any issues so I think it is good enough now. |
I've gone through the code and made a few changes related to optimization, style, trailing whitespace, etc. I've included the diff below. It looks like it functions the same as the current code, but I recommend you test it. I agree with you about rules now, so no need to make any changes related to that. Just update the commit with the diff and make sure the message complies with the guidelines in CONTRIBUTING.md. I'll merge it once that's done. Thanks for all the hard work. Changes Diffdiff --git a/src/dict/databasemanager.cpp b/src/dict/databasemanager.cpp
index 37e35cb..c2808a4 100644
--- a/src/dict/databasemanager.cpp
+++ b/src/dict/databasemanager.cpp
@@ -618,7 +618,13 @@ int DatabaseManager::populateTerms(const QList<SharedTerm> &terms) const
(const char *)sqlite3_column_text(stmt, COLUMN_DEF_TAGS),
def.tags
);
- def.rules = (const char *)sqlite3_column_text(stmt, COLUMN_RULES);
+ QStringList rules = QString(
+ (const char *)sqlite3_column_text(stmt, COLUMN_RULES)
+ ).split(' ');
+ def.rules = {
+ std::move_iterator{std::begin(rules)},
+ std::move_iterator{std::end(rules)}
+ };
term->definitions.append(def);
}
if (isStepError(step))
diff --git a/src/dict/deconjugationquerygenerator.cpp b/src/dict/deconjugationquerygenerator.cpp
index af3e2a8..fa90d49 100644
--- a/src/dict/deconjugationquerygenerator.cpp
+++ b/src/dict/deconjugationquerygenerator.cpp
@@ -21,7 +21,6 @@
#include "deconjugationquerygenerator.h"
#include "deconjugator.h"
-#include <QDebug>
#include <QList>
#include "util/utils.h"
@@ -54,34 +53,37 @@ std::vector<SearchQuery> DeconjugationQueryGenerator::generateQueries(
{
return {};
}
+
QList<ConjugationInfo> deconjQueries = deconjugate(text);
std::vector<SearchQuery> result;
for (ConjugationInfo &info : deconjQueries)
{
QString rule = convertWordformToRule(info.derivations[0]);
auto duplicateIt = std::find_if(
- result.begin(),
+ result.begin(),
result.end(),
- [=](SearchQuery o) {
- return o.deconj == info.base &&
- (o.ruleFilter.contains(rule) ||
- o.conjugationExplanation == info.derivationDisplay);
+ [&] (const SearchQuery &o)
+ {
+ if (o.deconj == info.base)
+ {
+ return o.ruleFilter.contains(rule) ||
+ o.conjugationExplanation == info.derivationDisplay;
+ }
+ return false;
}
);
if (duplicateIt != result.end())
{
- if(!duplicateIt->ruleFilter.contains(rule))
- {
- duplicateIt->ruleFilter.push_back(rule);
- }
+ duplicateIt->ruleFilter.insert(std::move(rule));
}
else
{
- result.push_back({
- info.base,
- info.conjugated,
+ result.emplace_back(SearchQuery{
+ info.base,
+ info.conjugated,
{ rule },
- info.derivationDisplay });
+ info.derivationDisplay
+ });
}
}
diff --git a/src/dict/deconjugator.cpp b/src/dict/deconjugator.cpp
index adcc8e5..6b7b213 100644
--- a/src/dict/deconjugator.cpp
+++ b/src/dict/deconjugator.cpp
@@ -33,7 +33,7 @@ struct Rule
WordForm conjugatedType;
};
-const QList<Rule> silentRules =
+static const QList<Rule> silentRules =
{
{ u8"ない", u8"ない", WordForm::negative, WordForm::adjective },
{ u8"たい", u8"たい", WordForm::tai, WordForm::adjective },
@@ -50,7 +50,7 @@ const QList<Rule> silentRules =
{ u8"とく", u8"とく", WordForm::toku, WordForm::godanVerb },
};
-const QList<Rule> rules = {
+static const QList<Rule> rules = {
//Negative
{ u8"る", u8"らない", WordForm::godanVerb, WordForm::negative },
{ u8"う", u8"わない", WordForm::godanVerb, WordForm::negative },
@@ -556,16 +556,16 @@ static bool isTerminalForm(WordForm wordForm)
}
static ConjugationInfo createDerivation(
- const ConjugationInfo &parent,
+ const ConjugationInfo &parent,
const Rule &rule)
{
QList<WordForm> childDerivations(parent.derivations);
if (childDerivations.isEmpty())
{
- childDerivations.insert(childDerivations.begin(), rule.conjugatedType);
+ childDerivations.prepend(rule.conjugatedType);
}
- childDerivations.insert(childDerivations.begin(), rule.baseType);
- int replacementStart = parent.base.size() - rule.conjugated.size();
+ childDerivations.prepend(rule.baseType);
+ qsizetype replacementStart = parent.base.size() - rule.conjugated.size();
QString childWord = QString(parent.base)
.replace(replacementStart, rule.conjugated.size(), rule.base);
ConjugationInfo childDetails = {
@@ -584,7 +584,7 @@ static void deconjugateRecursive(
const QString word = info.base;
for (const Rule &rule : rules)
{
- WordForm currentWordForm = info.derivations.size() == 0 ?
+ WordForm currentWordForm = info.derivations.empty() ?
WordForm::any :
info.derivations[0];
if (rule.conjugatedType != currentWordForm &&
@@ -599,19 +599,27 @@ static void deconjugateRecursive(
ConjugationInfo childDetails = createDerivation(info, rule);
if (isTerminalForm(rule.baseType))
{
- results.push_back(childDetails);
+ results.emplace_back(childDetails);
for (const Rule &silentRule : silentRules)
{
- if (silentRule.conjugatedType != rule.baseType) continue;
- if (!childDetails.base.endsWith(silentRule.base)) continue;
- Rule derivedRule = {
- rule.base,
- rule.conjugated,
- silentRule.baseType,
- rule.conjugatedType };
+ if (silentRule.conjugatedType != rule.baseType)
+ {
+ continue;
+ }
+ if (!childDetails.base.endsWith(silentRule.base))
+ {
+ continue;
+ }
+ Rule derivedRule = {
+ rule.base,
+ rule.conjugated,
+ silentRule.baseType,
+ rule.conjugatedType
+ };
ConjugationInfo derivedDetails = createDerivation(
- info,
- derivedRule);
+ info,
+ derivedRule
+ );
deconjugateRecursive(derivedDetails, results);
}
}
@@ -627,19 +635,25 @@ static QString formatDerivation(QList<WordForm> derivations)
QString result;
QList<WordForm> displayRules;
std::copy_if(
- derivations.begin(),
- derivations.end(),
- std::back_inserter(displayRules),
- [] (WordForm ruleType) {
- if (ruleType == WordForm::conjunctive) return false;
- if (isTerminalForm(ruleType)) return false;
+ derivations.begin(),
+ derivations.end(),
+ std::back_inserter(displayRules),
+ [] (WordForm ruleType)
+ {
+ if (ruleType == WordForm::conjunctive)
+ {
+ return false;
+ }
+ else if (isTerminalForm(ruleType))
+ {
+ return false;
+ }
return true;
}
);
- if (derivations.size() > 0 &&
- derivations[derivations.size() - 1] == WordForm::conjunctive)
+ if (derivations.size() > 0 && derivations.back() == WordForm::conjunctive)
{
- displayRules.push_back(WordForm::conjunctive);
+ displayRules.emplace_back(WordForm::conjunctive);
}
for (int i = 0; i < displayRules.size(); i++)
@@ -659,14 +673,14 @@ QList<ConjugationInfo> deconjugate(const QString query, bool sentenceMode)
if (sentenceMode)
{
QString word = query;
- while(!word.isEmpty())
+ while (!word.isEmpty())
{
ConjugationInfo detail = { word, word, QList<WordForm>(), "" };
deconjugateRecursive(detail, results);
word.chop(1);
- }
+ }
}
- else
+ else
{
ConjugationInfo detail = { query, query, QList<WordForm>(), ""};
deconjugateRecursive(detail, results);
@@ -678,4 +692,4 @@ QList<ConjugationInfo> deconjugate(const QString query, bool sentenceMode)
}
return results;
-}
\ No newline at end of file
+}
diff --git a/src/dict/deconjugator.h b/src/dict/deconjugator.h
index 335310d..4d3f4bd 100644
--- a/src/dict/deconjugator.h
+++ b/src/dict/deconjugator.h
@@ -24,7 +24,8 @@
#include <QString>
#include <QList>
-enum class WordForm {
+enum class WordForm
+{
godanVerb,
ichidanVerb,
suruVerb,
@@ -73,15 +74,18 @@ enum class WordForm {
/**
* A struct that contains the results of a deconjugation
*/
-struct ConjugationInfo {
+struct ConjugationInfo
+{
/* Plain form of a word. */
QString base;
+
/* The original conjugated form. */
QString conjugated;
+
/* A list of conjugations that describe the relationship
- * between base and conjugated.
- */
+ * between base and conjugated. */
QList<WordForm> derivations;
+
/* A human readable format of the derivations. */
QString derivationDisplay;
};
@@ -93,6 +97,7 @@ struct ConjugationInfo {
* find potential words by trimming the query
* @return A list of all the potential deconjugations found
*/
-QList<ConjugationInfo> deconjugate(const QString query, bool sentenceMode=true);
+QList<ConjugationInfo> deconjugate(
+ const QString query, bool sentenceMode = true);
-#endif // DECONJUGATOR_H
\ No newline at end of file
+#endif // DECONJUGATOR_H
diff --git a/src/dict/dictionary.cpp b/src/dict/dictionary.cpp
index 4f02d8c..07afb15 100644
--- a/src/dict/dictionary.cpp
+++ b/src/dict/dictionary.cpp
@@ -131,7 +131,7 @@ SharedTermList Dictionary::searchTerms(
}
sortQueries(queries);
- //filterDuplicates(queries);
+ filterDuplicates(queries);
if (index != *currentIndex)
{
return nullptr;
@@ -155,26 +155,22 @@ SharedTermList Dictionary::searchTerms(
}
if (query.ruleFilter.size() > 0)
{
- results.erase(std::remove_if(
- results.begin(),
- results.end(),
- [=](SharedTerm val) {
- for (const TermDefinition &def : val->definitions)
+ results.erase(
+ std::remove_if(
+ results.begin(),
+ results.end(),
+ [&] (const SharedTerm &val)
{
- for (const QString &rule : query.ruleFilter)
+ for (const TermDefinition &def : val->definitions)
{
- //Should technically call
- //def.rules.split(" ").contains(rule)
- //but it's not required because no rules are
- //substrings of other rules
- if(def.rules.contains(rule))
+ if (def.rules.intersects(query.ruleFilter))
{
return false;
}
}
+ return true;
}
- return true;
- }),
+ ),
results.end()
);
}
diff --git a/src/dict/expression.h b/src/dict/expression.h
index a0396ac..04e03f4 100644
--- a/src/dict/expression.h
+++ b/src/dict/expression.h
@@ -24,6 +24,7 @@
#include <QJsonArray>
#include <QList>
#include <QMetaType>
+#include <QSet>
#include <QSharedPointer>
#include <QString>
#include <QStringList>
@@ -101,7 +102,7 @@ struct TermDefinition
QList<Tag> tags;
/* A list of the rules associated with this entry. */
- QString rules;
+ QSet<QString> rules;
/* A list of glossary entries for this definition. */
QJsonArray glossary;
diff --git a/src/dict/searchquery.h b/src/dict/searchquery.h
index 5fe4f7d..32cd3bf 100644
--- a/src/dict/searchquery.h
+++ b/src/dict/searchquery.h
@@ -21,8 +21,8 @@
#ifndef SEARCHQUERY_H
#define SEARCHQUERY_H
+#include <QSet>
#include <QString>
-#include <QStringList>
/**
* A pair to search for. The deconjugated string is used for querying the
@@ -36,11 +36,11 @@ struct SearchQuery
/* The raw conjugated string */
QString surface;
- /* Filter results based on part of speach */
- QStringList ruleFilter;
+ /* Filter results based on part of speech */
+ QSet<QString> ruleFilter;
- /* The conjugation explaination of a term.
- Usually empty if the term was not conjugated. */
+ /* The conjugation explanation of a term. Usually empty if the term was not
+ * conjugated. */
QString conjugationExplanation;
};
|
Related to #109
The feature is still in early stages and I'm opening a pull request to discuss it further.
I've created a graph to give a very quick broad overview of how the deconjugator currently functions.
It works by trying to find a path from any node to a terminal node (godan verb, ichidan verb, irregular verb or adjective). This graph only shows the godan verb for simplicity.
The rules were added ad-hoc as I encountered examples of them, so they aren't currently comprehensive.
One option moving forward is that instead of specifying each transition one by one (such as
{ u8"せる", u8"せられる", WordForm::causative, WordForm::potentialPassive }
forcausative < passive
), a hidden transition can be added ({ u8"せる", u8"せる", WordForm::causative, WordForm::ichidanVerb},
), so that any conjugations that can apply to ichidan verbs can be applied to the causative form, but this allows deconjugation of forms you'd never see in real life such as 話されさせる (passive < causative
), which is in line with how yomichan seems to work.I didn't expect this to be difficult, I haven't given it much consideration because it's low on the priorities and it ties into a few other points mentioned later. it seems to be a matter of determining which part of the code is responsible for filtering duplicates: the deconjugator itself, the dictionary before it queries to the db, or the dictionary after it receives terms from the db.
I picked this example because jmdict has two separate entries for 言う and 言える, so they should be presented as two separate search results. My concern was that the longer dictionary form should be shown first because it is more likely to be relevant. It might be more clear with the example 成り立ち in the sentence この学校の成り立ちをお話しましょう, the noun form 成り立ち is more relevant, but the verb form 成り立つ is shown first with the
masu stem
deconjugation. In this example the dictionary form is the same length so results of the same length from the ExactWorker should be priotized over the DeconjugationWorker.If you search しよう (to do, volitional), it will attempt to deconjugate しよう to しる using the volitional rule, and then will return 汁 (juice/sap). This would be a valid result if 汁 was a verb but it's a noun. Another example is けれど to く (imperative) and finds the noun 句. If part of speech tags aren't guaranteed, terms without okurigana can be assumed to be nouns and filtered that way.
Because inflection explanations are usually less relevant to the user then the english translation I wanted to minimize the amount of vertical space it took up, but if this is difficult to do with QT, I don't think it's worth much effort trying.