Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

last name + first name + optional patronymic, Russian name order #85

Open
evrial opened this issue Apr 24, 2019 · 8 comments · May be fixed by #154
Open

last name + first name + optional patronymic, Russian name order #85

evrial opened this issue Apr 24, 2019 · 8 comments · May be fixed by #154

Comments

@evrial
Copy link

evrial commented Apr 24, 2019

I know this could be very tricky to implement right, but this case is very common in slavic names https://www.kmu.gov.ua/en/team

HumanName('Ivanov Ivan Ivanovich')
Out[64]:
<HumanName : [
	title: '' 
	first: 'Ivanov' 
	middle: 'Ivan' 
	last: 'Ivanovich' 
	suffix: ''
	nickname: ''
]>

In [65]:
HumanName('Ivanov Ivan')
Out[65]:
<HumanName : [
	title: '' 
	first: 'Ivanov' 
	middle: '' 
	last: 'Ivan' 
	suffix: ''
	nickname: ''
]>

In [66]:
HumanName('Ivan Ivanov')
Out[66]:
<HumanName : [
	title: '' 
	first: 'Ivan' 
	middle: '' 
	last: 'Ivanov' 
	suffix: ''
	nickname: ''
]>```
@derek73
Copy link
Owner

derek73 commented May 3, 2019

Can you clarify what the correct output should be for those names?

@evrial
Copy link
Author

evrial commented May 3, 2019

first: Ivan
middle: Ivanovich
last: Ivanov

Russian is the most flexible so any part of the name could be used in any order (only middle never written first), so without name dataset it gonna be difficult to distinguish between names like Sergey Sergeev and Sergeev Sergey.
Middle could be parsed by suffix '-ovna/evna/ovich/evich/'

@derek73
Copy link
Owner

derek73 commented May 3, 2019

How do Russian speakers know which is first vs last name? Is there an order that would be most common if a person wanted to try to get a computer to understand which was their first vs last name? Or maybe when romanized to Latin alphabets?

We could look for suffixes, but if middle is always in the middle then the parser will probably get that right already. In your first example, is "Ivan" the middle name?

@evrial
Copy link
Author

evrial commented May 4, 2019

No, middle (patronymic) always ends with suffix ovna/evna//ovich/evich/ich/ichna/inichna.

There is no common consensus on order of full formal names, Russian/Ukrainian speakers know/memorize first names in full formal Maria and short forms like Masha and last names like Petrova or Petrenko and rules of russian morphology may help to detect popular last names.
Last names could also mimic first name, like Victor Pavlik but thats rare case.
But last name also may end with suffix and mimic middle names like Roman Arkadyevich Abramovich
All men's first names could be mutated to form of middle or last name applying rules of morphology.

I can say for sure that first name never ends with suffix, and middle always ends with suffix and middle never writes before first name or after last name but could be on place of last name like Ivan Ivanovich and we can safely assume popular last names ends with -ov/ev/ova/eva/enko but careful with name like Lev and Eva, and each part of the name on its own is also ok which adds the most trouble.

Ivan Ivanovich - first, middle
Ivanovich Ivan - last, first (tricky but using suffix hint and knowing this is real name and rules of middle name ordering)
Ivanovich Ivanov - incorrect form, again memorize no such name exist
Roman Abramovich - first, last (just memorize him, otherwise impossible but mistake is tolerable for humans and machines)
Ivan Ivanov - first, last
Ivanov Ivan - last, first
Ivan Ivanovich Ivanov - first, middle, last
Ivanovich - middle most likely
Yanukovich - last, just memorize that no such male name exist

https://en.wikipedia.org/wiki/Russian_given_name#Full_(formal)_and_short_forms
Patronymic

@derek73
Copy link
Owner

derek73 commented May 4, 2019

Thanks for all that info. The Wikipedia page was really interesting, I had no idea Russian names had all that history.

It does sound like looking for suffixes could help inform correct classification of name parts, and those suffixes don't seem to clash with names from other languages so I imagine it could work without knowing beforehand wether it's parsing an English or Russian name. That's good.

Currently this parser is entirely deterministic, rule based, so there's no way to adjust probabilities. It's not a machine based classifier. It fails sometimes. For example, in English "Dean" is both a first name and occasionally a title, but the parser can only be configured to see it as one or the other. You can just choose which is more likely with your current dataset and it will be wrong for any name that uses "Dean" in the other way. Any first name that could also be a title is going to be a problem. Luckily in English there are not so many of them. The strategy of this parser is to try to choose the most common as a default and provide configuration to change it.

It seems like a machine based classifier would be a great way to handle Russian names, because that would more closely mimic how Russian speakers themselves have to do it, compare against a list of known names and suffixes and compute a probability for the most likely classification.

The Wikipedia article mentions that historically there were a limited number of Russian names, like 5000. A limited set of names is nice for a deterministic parser. That's a somewhat large list, but maybe not too many to include in a dictionary in the library. But it doesn't sound like those names are still used so it might be better training set data for a machine based classifier.

As an example, a deterministic thing we could do to names like "Ivanovich Ivan" is say if the name part ends in "ich" then it will always be a last name. Would that make it right more often? You probably know better than I. I can see though that it would still fail in the current parser because it would then assume "Ivan" is a suffix, because that's what comes after last names in English. But the parser can handle "Last, First Middle Middle", so we could potentially say if the first name part ends in "ich" then treat it like has a comma in it (e.g. "Last, First"). I'm not sure if that would parse the Russian names better or not though.

@evrial
Copy link
Author

evrial commented May 5, 2019

Many last names and middle names ends with 'ich' suffix, so thats problematic and mistakes unavoidable even for humans at least in written form. I can say parser could be 95-99% correct deterministically with dataset of ~2000 real names and patronymic dataset around 1000 or even less, without need of probability complexity. Using male names we can make middle or last name form, but prioritize first name.

@derek73
Copy link
Owner

derek73 commented May 10, 2019

Just to clarify for me, does patronymic always indicate last name?

The list of names strategy could be interesting, assuming we could somehow come up with that list. But now I'm wondering if the parsing algorithm for these Russian names would use any of the existing algorithm, or if it would be an entirely separate parsing algorithm.

Omitting the complicated parts, the main strategy of the parser algorithm is to take the first name it gets and stick it in the first name slot, then stick all the other names in the middle slot until it gets to the final name which goes in the last name slot. Would that still be a useful strategy for the remainder of the name parts when the parser encounters a name in this list of names you propose?

For example, if we had a list of patronymic names (and assuming those indicate last names), seems we might test the first name part (that is not a title, i.e. does not appear in the predefined set of titles) to test if it is in the patronymic set. If it is, then we put that name part in the last name bucket and parse the rest of the name as first name followed by a remainder of middle names. So expecting a format like "[Title] Lastovich First Middle [Middle] [Middle] [Suffix]" if "Lastovich" is in the patronymic set. Does that sound like it would be correct? If so, that's pretty much exactly how title parsing currently works so seems doable and reasonable to think that it would play nicely with everything.

Reading this over again, I think I need clarification on how we would know that the name order is supposed to be "Last First Patronymic". I guess that means Patronymic doesn't mean last name. So it's like a middle name but could let us assume a "Last First" name order? Would we test all the names, or do they only appear at the end?

@derek73 derek73 changed the title last name + first name + optional patronymic last name + first name + optional patronymic, Russian name order May 10, 2019
@evrial
Copy link
Author

evrial commented May 10, 2019

Patronymic is the fathers name+suffix depending on sex, in my case its Alexandrovich or for a woman would be Alexandrovna, always follows the first name or not used, or used on its own. Similar situation with Icelandic and Azerbaijan names. So your strategy is correct. I think we only need a list of slavic names, I don't know any other language which uses names in reverse.
I guess only TDD approach would work here :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants