Add infer_lang function (Issue number #3) #79

tmankita · 2020-07-14T00:41:00Z

No description provided.

Update detect_language function Add padding_tuple function

tmankita · 2020-07-14T09:51:22Z

@jbesomi , why Travis CI failed? What I do wrong? on my local machine, all the tests succeeds

jbesomi · 2020-07-14T10:23:22Z

Thank you, this is a good start! 👍

If you click "details" and then select a job you will have the log: example

The problem is that you use some dependencies that are not installed. You will need to update setup.cfg.

Review:

If you use "private" functions, you should add a trailing underscore _example()
By default, we probably want to return just a single lang for each cell. And then, we might have an argument that when set to True, the function returns a list of tuples (this is the improvement)
What does this code do?

l = list(doc._.language["result"].items())
curr_size = len(l)
t = foldl(operator.add, (), l)
if max_list_size < curr_size:
   padding_list(infer_languages, curr_size)
   max_list_size = curr_size
elif curr_size < max_list_size:
   padding_tuple(t, max_list_size)

Why we cannot simply do infer_languages.append(l)? Also, if possible we should avoid for/loop ad in the rest of the code (at the exception of the for pipe loop, this is multi-threading and fast), we almost never use for/while loops as they are generally slow.

We should test the function more extensively. For instance, by passing a Pandas Series formed of X different languages
The docstring should be more precise, stating exactly which languages can be detected

…ssary code -Update infer_lang documentation

tmankita · 2020-07-14T15:52:27Z

@jbesomi Thank you! very helpful comment and review. I tried as much I can to stick to your instructions in the review.
If there anything else I need to adjust I will be glad to know about it.

jbesomi · 2020-07-14T16:16:41Z

Hi Tomer,

thank you for your improvements! You are going clearly in the right direction, still, we need to solve some issues, but we will do it together 👍

Review:

Great you mention the ISO code!
Testing: we will need to have a test for each single ISO code (i.e each language)...
If you look carefully at spacy-langdetect, you discover that it just call langdetect ... we can clearly avoid to use spacy-lang therefore and use directly langdetect. Have a look at the LanguageDetector code; it's quite simple.
There are two ways we can write infer_lang. We will need to test which one is faster on a large dataset
- Keep using spacy (by adding a custom element to the pipeline). I'm not sure this makes much sense as we are not really making use of spaCy ... the only advantage I see is that the pipe is multi-threading. But we might just want to write our own multi-threading code that computes the lang for each cell of the Pandas Series.
- Using Pandas apply. This might be slower than a multi-threading approach, but I'm not sure about that.
_infer_lang_ret_list / _infer_lang this code are about the same. What about keeping the for loop in the main function and then having an if/else condition when setting the pipe? (assuming you keep spaCy that is probably unnecessary)
The default spaCy pipe might have other tools we would like to disable, isn't? (assuming you keep spaCy that is probably unnecessary)
ret_list what about call it probability and when set to True return a list of tuples as you did? When set to False, returning the ISO nomenclature should be enough (no need to return a tuple)
_Language_to_dict: return a tuple, not a dict

… ISO code -Change name ret_list to probability - Change name _Language_to_dict to _Language_to_tuple

tmankita · 2020-07-17T12:01:42Z

Hi @jbesomi ,
I do the test on 21000 sentences
The results are:

Using Pandas apply infer_lang no probability: 1.0 min 58.464749813079834 sec
Using spacy infer_lang no probability: 3.0 min 6.3345091342926025 sec
Using Pandas apply infer_lang with probability: 2.0 min 0.9787819385528564 sec
Using spacy infer_lang with probability: 3.0 min 7.99077296257019 sec

You are right using pandas apply is definitely more time saver!

Thanks for the detailed comment, I was reading it very carefully and tried to apply everything you wrote.

jbesomi · 2020-07-17T13:23:35Z

Hi Tomer,

wow, congrats; you did such a great job! that's super cool, I'm impressed 🎉 👍

For the stats, which dataset did you use? I'm impressed by how long it takes ...
The reason why it takes that long is probably because langdetect is non-deterministic (it uses a model to find the language) as you said.

Review:

Everything looks super!
The only problem is that probably langdetect is too slow; 21k rows is quite a small dataset ... (but maybe the documents were super long?)
One way to speed up the process is to cut the length of each cell. We might want to test this (I can do it) and see if it speeds up the procedure. Can you please attach to this issue the notebook you used to do the comparison? Another approach would be to use a more simple deterministic algo (at the cost of a higher error rate)
Some of your functions are still using google docstring style :param instead of the numpy docstring style. Also, why some head titles starts with "gured out"?

Regards,

tmankita · 2020-07-17T14:38:34Z

@jbesomi
I used the same dataset I present in infer_lang issue
(link to the relevant comment #3 (comment)
at the end of the comment you have a link to download the dataset).

According to 1 +2 +3, from the infer_lang issue:
+----------------+----------------+------------------+
| library | average time | accuracy score |
+================+================+==================+
| langdetect | 0.00606 sec | 86.308% |
+----------------+----------------+------------------+
The average time is refer to process one sentence (without consider the sentence length).
So in the average case, we will end with (0.00606*21000)/60 = 2.121 minutes. Therefore, the time was taken to process all the 21k sentences is pretty good for my opinion. Maybe with cut the sentences we can improve it, take a look in the data set and see if doing so is relevant (maybe the sentences are too short). If that suggestion will fail I will recommend using Cld3 library with average time: 0.00026 sec and accuracy score: 84.973%.

According to 4, I will remove the docstring from helper functions, the "guard out" note refers to the try and catch expression in the function.

I waiting for you to examine your suggestion. Once you have some conclusions share with me and I will start working on the solution.

Thanks for your comments, I learned a lot from them.

jbesomi · 2020-07-17T14:48:04Z

🎉

Can you please attach here the Jupyter Notebook you used to test the apply vs spaCy implementation? I will start from there and test different approaches, see if we can reduce a bit the exec. time

tmankita · 2020-07-17T15:48:59Z

Sure!
The link:

ApplyVsPipe.ipynb.zip

jbesomi · 2020-07-18T05:29:05Z

Thanks, that's cool! :)

tmankita · 2020-07-27T05:31:04Z

Hi @jbesomi ,
Long time, no hear, just wonder if I can help.

Thanks

jbesomi · 2020-07-27T13:27:03Z

Hey @tmankita, I'm very sorry, I haven't yet compared the different version. I will be back to you as soon as possible. In the meantime, you can have a look at the other open issues 👍

Add infer_lang function

a18361e

vercel bot deployed to Preview July 14, 2020 00:41 View deployment

tmankita changed the title ~~Add infer_lang function~~ Add infer_lang function #3 Jul 14, 2020

tmankita changed the title ~~Add infer_lang function #3~~ Add infer_lang function (Issue number #3) Jul 14, 2020

Add infer_lang to test_cases_nlp in test_indexes

e00706e

Update detect_language function Add padding_tuple function

vercel bot deployed to Preview July 14, 2020 09:37 View deployment

reformat by format.sh

fa35af0

vercel bot deployed to Preview July 14, 2020 09:39 View deployment

-Remove for and while loops -Add more extensively test -Remove unnece…

c5b2869

…ssary code -Update infer_lang documentation

vercel bot deployed to Preview July 14, 2020 15:42 View deployment

jbesomi marked this pull request as draft July 15, 2020 08:23

- Change implementation to use pandas apply -Add test for each single…

c284cda

… ISO code -Change name ret_list to probability - Change name _Language_to_dict to _Language_to_tuple

vercel bot deployed to Preview July 17, 2020 11:52 View deployment

Remove unnecessary imports

91c17a0

vercel bot deployed to Preview July 17, 2020 12:00 View deployment

henrifroese added the enhancement New feature or request label Aug 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add infer_lang function (Issue number #3) #79

Add infer_lang function (Issue number #3) #79

tmankita commented Jul 14, 2020

tmankita commented Jul 14, 2020

jbesomi commented Jul 14, 2020

tmankita commented Jul 14, 2020

jbesomi commented Jul 14, 2020

tmankita commented Jul 17, 2020 •

edited

jbesomi commented Jul 17, 2020

tmankita commented Jul 17, 2020 •

edited

jbesomi commented Jul 17, 2020

tmankita commented Jul 17, 2020

jbesomi commented Jul 18, 2020

tmankita commented Jul 27, 2020

jbesomi commented Jul 27, 2020

Add infer_lang function (Issue number #3) #79

Are you sure you want to change the base?

Add infer_lang function (Issue number #3) #79

Conversation

tmankita commented Jul 14, 2020

tmankita commented Jul 14, 2020

jbesomi commented Jul 14, 2020

tmankita commented Jul 14, 2020

jbesomi commented Jul 14, 2020

tmankita commented Jul 17, 2020 • edited

jbesomi commented Jul 17, 2020

tmankita commented Jul 17, 2020 • edited

jbesomi commented Jul 17, 2020

tmankita commented Jul 17, 2020

jbesomi commented Jul 18, 2020

tmankita commented Jul 27, 2020

jbesomi commented Jul 27, 2020

tmankita commented Jul 17, 2020 •

edited

tmankita commented Jul 17, 2020 •

edited