MemoryError for a name with a lot of prefixes #108

Ronserruya · 2020-03-22T15:22:43Z

I don't really think this is a "bug", more like an extreme edge case.

While using the library I had to parse millions of name and encountered a user input:

"<first_name> van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der <last_name>"

This name quickly caused a MemoryError in a PC with 60+GB of RAM, more specifically
This list : https://github.com/derek73/python-nameparser/blob/master/nameparser/parser.py#L799 is growing exponentially in size very fast.

Again, Im not expecting you to fix this since this is obviously a user input error (which I bypassed by setting a maximum size to the string), but I thought you might be interested to know about this edge case.

derek73 · 2020-03-22T19:15:10Z

Thanks for the bug report. I wondered if this would ever be an issue when I wrote it that way.

When the parser encounters a new combination of titles joined with a conjunction, it saves the complete string as a new title in the module's shared config (by default) and takes another pass. So each pass would result in a title with one additional conjunction or title added to the end. That somewhat explains the exponential nature, but it might also depend on how you're using the parser. I wonder if you would have the same problem with something like this:

parser = HumanName()
parser.fullname = name1
parsed_name1 = str(parser)

parser.fullname = name2
parsed_name2 = str(parser)

This should ensure that the module level config is shared across all the instances. I guess I'm not clear why that list would grow so large as to throw a memory error. In my understanding, it seems like it should just be storing less than 50 different versions of that very long title.

Anyway, It would be nice if the library didn't throw ambiguous memory errors, so maybe we can give it a better exception. Here is where the new title with conjunctions are saved to the module level config:

https://github.com/derek73/python-nameparser/blob/master/nameparser/parser.py#L721

We could test for some maximum around there, and have a default that can be overridden with the config object. I'm not sure exactly where it would need to go though. I wonder if the problem is in that group_contiguous_integers(conj_index) call?

If you are able to poke around or have any ideas, let me know. I haven't fired up my dev environment yet to try that name string, but when I do I'll try to find someplace to put in a maximum and then maybe throw some more informative exception? or maybe a warning?

derek73 added the bug label Mar 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MemoryError for a name with a lot of prefixes #108

MemoryError for a name with a lot of prefixes #108

Ronserruya commented Mar 22, 2020

derek73 commented Mar 22, 2020 •

edited

MemoryError for a name with a lot of prefixes #108

MemoryError for a name with a lot of prefixes #108

Comments

Ronserruya commented Mar 22, 2020

derek73 commented Mar 22, 2020 • edited

derek73 commented Mar 22, 2020 •

edited