Better Chinese support #10

Madwonk · 2022-04-18T01:59:35Z

Since Chinese doesn't as easily distinguish words/names with spaces as English, attempting to double click a word simply selects the entire sentence.

The Firefox/Chrome extension Zhongwen and Android App Pleco are examples of software which use various methods of automatically detecting words in the dictionary (CC-CEDICT is bundled in the case of Zhongwen, which comes in at only 3.6 MB zipped).

It would be advantageous to integrate CC-CEDICT as a dictionary option for Chinese, as well as leveraging it to help select words in a Chinese sentence. I'm willing to help contribute some code if necessary to help do this, but I'd like some input from the primary developer before doing so.

1over137 · 2022-04-18T02:26:36Z

I originally wanted to implement such a feature, but I couldn't quite afford to commit the time and maintenance burden needed to implement them. At one point I even implemented a Japanese parser, but Yomichan does a much better job, and it added too much in the way of dependencies. I didn't know of a dictionary-based way of splitting words before.
However, for Chinese this can be pretty useful.
If you are willing to contribute code to make this happen, feel free to do so! I would be glad to accept a patch/PR for this.
Here are a few points to keep in mind:

What kind of dependencies would this introduce? Is it a static binary file plus some code to parse it? cxfreeze which is used to create binary exes for Mac and Windows is somewhat tricky to deal with when it comes to static files, but this should be solvable. However, I think it should still be preferable if the dictionary is to be imported by the user.
The splitting function should be called in the preprocess_clipboard function in https://github.com/FreeLanguageTools/vocabsieve/blob/master/vocabsieve/dictionary.py . If the dictionary-loading and splitting is simple enough, it can simply be implemented as another function in that file. Otherwise, you can create a new file for it.
If possible, I think the dictionary should simply be incorporated to the existing dictionary infrastructure (the database), and then the database can be queried for the splitting part.

If you have any questions about the architecture of the program, feel free to ask!

1over137 · 2022-04-20T01:19:26Z

Also, I was actually considering parsing the sentences with something like jieba. That uses a more sophisticated algorithm to split the words and may work for words not covered (proper names).
In addition, I can also implement support for cedict as a format

Ceynou · 2022-04-21T00:14:34Z

I barely know a thing about programming, let alone coding, but you're talking about using spaCy right, it supports 64 languages so I guess that would work for all the other language vocabesieve supports

1over137 · 2022-04-21T00:31:00Z

I barely know a thing about programming, let alone coding, but couldn't you use spaCy instead of jieba, it supports 64 languages

I am using simplemma for lemmatization, which is simply a big text lookup program. There is no current need for spacy for this project. It's a fairly big and complicated framework for many things (including natural language understanding, tagging, classification, etc) and requires dynamically downloading resources.

GrimPixel · 2022-04-26T07:10:54Z

In fact, not only Chinese, but also Japanese, Korean has this problem. In Vietnamese, space is used to separate syllables; in Thai and Lao, space is used to separate sentences.
There is a list of such tools:
https://polyglotclub.com/wiki/Language/Multiple-languages/Culture/Text-Processing-Tools#Word_Segmentation

1over137 · 2022-04-27T02:45:01Z

@GrimPixel @BenMueller
So, anyone of you going to actually implement this?

GrimPixel · 2022-04-27T04:54:02Z

I just knew about tools for word segmentation and saw you needed them. I have no experience with them.

parthshahp · 2024-05-07T03:40:03Z

Is this something that still needs work? Has there been any progress in the last few years? I'm happy to take a look at it if it's needed.

1over137 added enhancement New feature or request good first issue Good for newcomers labels Apr 29, 2022

1over137 assigned 1over137 and unassigned 1over137 May 2, 2022

1over137 removed the good first issue Good for newcomers label Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Chinese support #10

Better Chinese support #10

Madwonk commented Apr 18, 2022

1over137 commented Apr 18, 2022 •

edited

1over137 commented Apr 20, 2022

Ceynou commented Apr 21, 2022 •

edited

1over137 commented Apr 21, 2022

GrimPixel commented Apr 26, 2022

1over137 commented Apr 27, 2022

GrimPixel commented Apr 27, 2022

parthshahp commented May 7, 2024

Better Chinese support #10

Better Chinese support #10

Comments

Madwonk commented Apr 18, 2022

1over137 commented Apr 18, 2022 • edited

1over137 commented Apr 20, 2022

Ceynou commented Apr 21, 2022 • edited

1over137 commented Apr 21, 2022

GrimPixel commented Apr 26, 2022

1over137 commented Apr 27, 2022

GrimPixel commented Apr 27, 2022

parthshahp commented May 7, 2024

1over137 commented Apr 18, 2022 •

edited

Ceynou commented Apr 21, 2022 •

edited