New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Split CJK names #2624
Comments
Hello there @ZnqbuZ, Hope you're doing well! Getting your debug log is a breeze and will save us both time. Trust me, it's way quicker than discussing why it's important. 😃 How to Share Your Debug Log:
Once you hit that submit button, you'll get a special red debug ID. Just share that with By sharing your debug log, you're giving We totally get that your time is valuable, and we appreciate your effort in helping Thanks a bunch! |
A debug log is not "not applicable" here. A debug log per point 1 gives me the entry we're discussing here -- I cannot enter Chinese names myself. |
Sorry. I've sent a log with ID YeGr1kqXOgnV-6U3RYALN |
A log with more examples ZAVVH2PE-apse/6.7.112-6 was sent. |
Thank you. |
Yes, for all modern names and almost all ancient names. I got the list from wikipedia and I'm pretty sure that list contains all surnames used by people in recent 150 years. Actually, only 81 of them are still used nowadays. However, the Chinese history is so long (~5000 years) that I doubt there exists a full list. I guess you could add a filter and let users choose if they want to use it. And store the name list in configuration so users can modify it. After some investigation, I found there seemed to be surnames of 3 chars in 2000-3000 years ago. I don't think it's possible that they happen to be authors of any document... |
Jieba puts spaces between the characters which makes each character a "word" for the citekey formatter. |
I think jieba's hanling of names is expected. By the way, I observed some strange behaviours in capitalization of Chinese titles, which should be jieba's problem. Are you still using js-jieba? It seems to be outdated. I wonder what prevent you from using C library? Have you considered using WASM? |
Still using js-jieba, indeed
Using C code in Zotero extensions is not trivial. It's not work I'm keen to pick up.
I've looked into it briefly but I'd only consider it if there was a clean javascript wrapper for an already-compiled wasm binary. I don't want to get into a whole new programming language for this. |
The wrappers that do exist either assume node as an environment, where they use node-specific libraries like |
Can you see whether https://www.npmjs.com/package/jieba-wasm offers different cutting modes for |
Also what the different cut functions and their parameters mean? |
Is there also a full list of single-character Chinese family names? |
this is incorrect. |
|
Can you export the items from |
I'm creating a testing environment.
Yes, there is, but do we really need it? I mean modern Chinese family names are either 2 chars or 1 char - so we just need a function to ensure that content in author field is truly a Chinese name, basically a utf8 range checker is ok. I can write it if needed.
It's hard for a segmentation library to deal with names. Analogically speaking, it may cut "WallaceGoodman" to "Wall Ace Good Man"
I'm sorry. I'm creating a testing library. Soon it will be uploaded. |
No need, that's already in my current tests.
Thanks. |
Is it possible to compile jieba-rs to stay within that spec (until Zotero 7 goes GA)? |
It should be |
Honestly, I'm not quite sure about how to do it now... Maybe I can manage it after several days of research. |
|
Honestly, I'm not quite sure about how to do it now... Maybe I can manage it after several days of research
Yes, and those words after the last colons are correctly capitalized. |
How do I build the package? |
Sorry, what do you mean? |
But that's how they come out of |
Have you used |
I've cloned |
My build command is |
Besides, in |
I believe it must have to do with |
CompileError: at offset 679499: bad type |
Does work on Zotero 7. |
OK... no idea what this means, so maybe it's not related to |
What is the full contents of the |
Debug log ID
NA
What happened?
I believe many people would like to keep a Chinese name in a whole:
instead of splitting it:
Still, we want the first & last names to be correctly capitalized, e.g. "杨莲亭" has pinyin "yang lian ting", and should be capitalized like "YangLianting".
Usually, hacks like
auth.substring(1,1).clean.capitalize + auth.substring(2).clean.capitalize
do the trick, since most Chinese surnames is simply 1 character.However, there are some surnames consisting of 2 characters. For example, “东方不败” has pinyin "dong fang bu bai", and should be capitalized like "DongfangBubai" rather than "DongFangbubai", which the formula I mentioned will give.
I tried using jieba, but it seems to think of a name as one word. Please correct me if that's not the case.
So, to achieve this, I wrote a simple JavaScript snippet to split Chinese names. It extracts the first 2 characters of a name, and looks them up in a dictionary, to see if the chars constitute a surname. Currently, it supports Simplified & Traditional Chinese. Korean / Japanese could also be supported once someone gives me a list of Korean / Japanese compound surnames.
I hope you could consider adding this function.
splitName('东方不败', 'zh-hans')
will gives['东方', '不败']
, and should be eventually capitalized to['Dongfang', 'Bubai']
.The text was updated successfully, but these errors were encountered: