Add initial Chinese support for hero.lang.zh.preprocessing #128

ryangawei · 2020-07-29T03:57:28Z

This is the first step towards the Chinese support. All the docstrings are the same with original.

Create a hero.lang.hero_zh module
Implement preprocessing.py for Chinese, removing some inapplicable functions from the original API
Add test cases and docstring test in Chinese

See #68.

henrifroese

General feedback:

To me it seems like the best way to implement this is the following:

We have some config where we can set the language. If a user does hero.set_language("Chinese"), the library automatically starts calling the functions that are in /texthero/lang/chinese/... (or sth. like that) if they are implemented there, and if they're not there, it just calls the standard texthero function.

This could probably be implemented by some other config file in texthero/lang/chinese/... that lists all functions that are overwritten for Chinese. So all in all, what I think should happen (but I'm of course open to discussion) is:

User calls texthero function xy -> texthero checks what language is set. If set to e.g. Chinese, texthero goes to texthero/lang/Chinese etc. and looks at a config file there. If xy is in the config file there, texthero calls the function from the Chinese module. If it's not, texthero calls the standard version.

This way, we don't add unnecessary overhead. Especially, developers working on standard functions don't have to worry about implementing everything in the language subfolders. Developers working on the language specific stuff can focus on that.

It seems to me like your way introduces some unnecessary overhead down the line.
Interested in what you think, maybe I misunderstand what you're trying to do!

henrifroese · 2020-07-29T18:48:34Z

texthero/_helper.py

+
+    """
+
+    @wrapt.decorator


Is the wrapt module really necessary here? Maybe have a look at the decorator implemented a few lines above that uses just the built-in functools from the standard library. Then the new dependency isn't needed.

I saw in your PR #69 that wrapt is used too. This module looks clean so I just want to try it, not really necessary and I'm fine to remove it. 😃

henrifroese · 2020-07-29T18:51:49Z

texthero/lang/hero_zh/preprocessing.py

+    return hero.preprocessing.clean(s, pipeline)
+
+
+@root_caller(hero.preprocessing)


Is there no way to only "overwrite" those functions that really are different for Chinese, and not mention the others at all? I can see this getting really tedious when we introduce more and more languages: if we now want to add a new function, we have to add a function that does nothing (just pass) in every language module.

(see my general comment above)

ryangawei · 2020-07-30T03:50:51Z

Hi @henrifroese, you've made a good point.

My starting point is that, a lot of standard functions are not applicable on Chinese. For example, remove_digits, remove_*_brackets (should be done only after tokenization), stem, remove_diacritics. As the user choose to deal with Chinese, exposing these APIs may be confusing or misleading. (Refer to #84). So I basically put everything that supports Chinese into the hero_zh module. But, as you said, it also introduce lots of redundant code and make it hard to maintain when more languages joins.

Your suggestion is great, which I thought the same. However what can we do deal with the problem above? Should we hide these N/A API to user?

jbesomi · 2020-07-30T09:51:37Z

Dear @AlfredWGA and @henrifroese,

Thank you for the PR and the respective review; good job! 👍

Regarding the order of operations in the preprocessing phase, it's probably more natural to first tokenize the text and then apply all the different functions on a tokenized text.

The only drawback of this solution is that it might be slower for certain operations. For instance, remove_punctuation on a string with regex is faster than removing punctuation by iterating over a list of tokens. Nonetheless, the second method can be highly parallelized and we might overcome the problem.

Regarding the fact that some of the functions are not relevant for a specific language (i.e stem) I don't really know what's the best solution. I'm open to any ideas. Before us probably many other programmers had the same issue; we might want to look around and see how they solved. Maybe there is even a design pattern to resolve that?

Regarding the other comment(s): agree with Henri

P.S @AlfredWGA should I call you Alfred or Guoao?

henrifroese · 2020-07-30T15:26:29Z

My starting point is that, a lot of standard functions are not applicable on Chinese. For example, remove_digits, remove_*_brackets (should be done only after tokenization), stem, remove_diacritics. As the user choose to deal with Chinese, exposing these APIs may be confusing or misleading. (Refer to #84). So I basically put everything that supports Chinese into the hero_zh module. But, as you said, it also introduce lots of redundant code and make it hard to maintain when more languages joins.

What do you think about this: When the language is set to e.g. Chinese and a user calls a function texthero.f, texthero goes to texthero/lang/Chinese/supported_functions.py. In that file, there are two lists (or rather dicts):

functions overwritten / changed / added for Chinese -> if f is here, call the "Chinese version"
functions from texthero that don't make sense for Chinese -> if f is here, throw an Exception (or a warning).

This also allows us to gradually add support. At the beginning, many functions might still be in the second list; after implementing solutions for Chinese, they are moved to the first list.

However, I also agree with this, maybe there's a well-proven path to doing this:

Before us probably many other programmers had the same issue; we might want to look around and see how they solved. Maybe there is even a design pattern to resolve that?

One other thing: maybe convert this to a draft PR: on the right side, under "Reviewers", it should say "Still in progress? Convert to draft", click that. This way, everyone can see this is a work in progress and not ready to be merged yet.

ryangawei · 2020-07-30T15:58:59Z

@jbesomi, thanks for the suggestion, I'll try to look for solutions through other NLP libraries. Also you can call me Guoao. 😃

functions overwritten / changed / added for Chinese -> if f is here, call the "Chinese version"
functions from texthero that don't make sense for Chinese -> if f is here, throw an Exception (or a warning).

@henrifroese thanks for your idea. But think of a situation: without documentation or docstring, when a user tries to call a function, he/she won't know if it supports Chinese until that function returns (correct result or Exception), which is kind of annoying, isn't it?

How about I do this under texthero/lang/zh/preprocessing.py,

# Original functions that supports Chinese
from texthero.preprocessing import fillna, drop_no_content, remove_whitespace, ...


# Functions overwritten/added for Chinese
def tokenize(s: pd.Series) -> pd.Series:
...

Through this we still maintain a list of supported/not supported functions, and also prevent user from calling unexpected functions and getting Exceptions. What do you think?

henrifroese · 2020-07-30T16:11:32Z

How about I do this under texthero/lang/zh/preprocessing.py

That is extremely simple and does the job! Makes sense!

Through this we still maintain a list of supported/not supported functions, and also prevent user from calling unexpected functions and getting Exceptions. What do you think?

I only fail to see how we still maintain a list of not supported functions? The way you suggest, if someone calls a texthero function that's not implemented for Chinese yet and doesn't work for Chinese, python will just give an error saying it does not know the function? Or would you keep a separate list of functions that need to be changed for Chinese, but aren't implemented yet?

jbesomi · 2020-07-30T16:17:34Z

Note that this PR introduces these major changes:

replace_tags / replace_hashtags:
The regex pattern changed from r"@[a-zA-Z0-9]+" to r"@[a-zA-Z0-9\u4e00-\u9fa5]+"
-> We might just add this changes in the "default" functions?
tokenizer
The function make use of spaCy Chinese tokenizer.
We probably want to have spaCy as the default tokenizers. The question is then: do we need a separate tokenize functions or an universal one might be enough?
See also tokenize with Spacy #131
clean
Call only fillna, remove_whitespace and tokenize
-> A solution is required (see above)

Review:

We probably do not need an extra test_indexes.py file, rather adapt the actual one to test functions from any lang
--> we might have just a list of "valid" function names and then iterate over __DIR__(texthero.lang) for lang in {'default', 'en'} or something similar?

ryangawei · 2020-07-31T07:27:23Z

@henrifroese I think every standard functions that isn't included in the language module hero.lang.* can be taken as not supported, and will raise AttributeError when they're called with texthero.lang.zh.precessing.*

henrifroese · 2020-07-31T13:22:25Z

@henrifroese I think every standard functions that isn't included in the language module hero.lang.* can be taken as not supported, and will raise AttributeError when they're called with texthero.lang.zh.precessing.*

Makes sense, I think that's good enough for users 👌

…-support

ryangawei · 2020-08-01T03:29:59Z

We probably do not need an extra test_indexes.py file, rather adapt the actual one to test functions from any lang
--> we might have just a list of "valid" function names and then iterate over DIR(texthero.lang) for lang in {'default', 'en'} or > > something similar?

@jbesomi Maybe put these functions in __all__ under each language package and call with getattr(),

import importlib

test_cases_preprocessing = []
for module_str in dir(texthero.lang):
    if not module_str.startswith("__"):
        module = importlib.import_module('texthero.lang.' + module_str, package='texthero.lang')
        # print(module)
        test_cases_preprocessing.extend([getattr(module, func) for func in module.__all__])

jbesomi · 2020-08-05T13:27:12Z

Exactly! What if we do that in a separate PR? Would like to do it?

jbesomi · 2020-08-05T17:12:07Z

Thank you!! :)

…-support

ryangawei · 2020-08-14T02:58:53Z

Should we avoid calling variables and functions across modules?

For example, in https://github.com/jbesomi/texthero/blob/master/texthero/preprocessing.py#L295, texthero.stopwords is used by remove_stopwords. Suppose there's a package for French texthero.lang.fr, the same remove_stopwords logic can apply properly, except the stopwords should point to texthero.lang.fr.stopwords instead of texthero.stopwords. The only way is to copy & paste the same code to texthero.lang.fr.remove_stopwords, and add from texthero.lang.fr import stopwords as _stopwords in texthero.lang.fr.preprocessing, which is quite redundant.

I think in most cases it is unnecessary for modules to call each other. Regulating this might be better for other language supports, what do you guys think?

jbesomi · 2020-08-14T14:59:50Z

For example, in ... which is quite redundant.

Hey Guoao, I agree with you that this might be redundant.

As a general comment: as for now we are not offering multilingual support, any ideas that provide flexibility or cleanness, or code robustness is very welcome.

I just don't understand what's your advice and idea about how to solve that.

ryangawei · 2020-08-15T03:21:46Z

Just add custom Series type to align with the standard preprocessing.py.

I am thinking simply not to call functions or variables from other modules, unless that module is language agnostic. For example, preprocessing and stopwords is obviously language-specific, but visualization, _helper, _types may be not. import _helper in preprocessing is okay, but import preprocessing in visualization may lead to the problem I mentioned above.

Currently I only found few such cases in the code, like stopwords used in remove_stopwords, and tokenize used in functions of representation. Removing these calling would not seriously affect the original function, we can tell the user in docstring to do these steps outside the function.

By the way, Chinese word segmentation package jieba is optional for users, but it's necessary for testing, should I add it to dev dependencies?

jbesomi · 2020-08-20T08:20:41Z

By the way, Chinese word segmentation package jieba is optional for users, but it's necessary for testing, should I add it to dev dependencies?

Yes. And we will have to make it clear to users that if they want to use jieba they need to install it separately. That's similar to what we are doing in #146 with Flair.

For the rest: I agree!

ryangawei · 2020-08-20T15:12:01Z

For the rest: I agree!

Should I create another PR? That issue should be solved before this PR could continue.

ryangawei · 2020-09-02T12:04:09Z

@henrifroese Do you think we should proceed with that?

henrifroese · 2020-09-02T19:38:18Z

@AlfredWGA I'm not sure what you mean. It's difficult for us to not to import functions from other modules at the moment (e.g. I'm not sure how we would not use the tokenize function in representation right now?). Or maybe I am misunderstanding you?

…-support

ryangawei · 2020-09-03T02:54:34Z

It's difficult for us to not to import functions from other modules at the moment (e.g. I'm not sure how we would not use the tokenize function in representation right now?).

Hi @henrifroese. Some modules need to deal with specific languages, therefore shouldn't be imported directly within other modules (because the user choose what language they want to deal with). Currently preprocessing and stopwords seems like such modules. For other modules, they're definitely okay to be imported.

For representation, if all functions receive a TokenSeries as input, there's no need to use not_tokenized_warning and preprocessing, the tokenization step is already done by users themselves.

henrifroese · 2020-09-03T06:13:13Z

Okay I get it now, thanks 👌.

So presumably solution would be

tokenize: as you said above, we can switch to using the decorator @InputSeries(TokenSeries) to force a TokenSeries input and can then leave out the tokenization
stopwords and other cases: I don't know, interested in how you solve this 💡

I think a separate PR for this definitely makes sense.

ryangawei · 2020-09-07T09:40:32Z

Why does the error The command "black --check ." exited with 1. occurs in CI? I ran this check locally and there's nothing wrong.

jbesomi · 2020-09-07T18:00:04Z

See #171

ryangawei · 2020-09-08T07:15:21Z

Just fix the problem, thank you! @jbesomi

ryangawei added 2 commits July 29, 2020 11:29

Add Chinese suppport for preprosessing.py

010ce0e

Move root_caller to _helper.py, format codes

d462446

vercel bot deployed to Preview July 29, 2020 03:57 View deployment

Add warpt to install_requires

ae28d51

vercel bot deployed to Preview July 29, 2020 04:33 View deployment

Add jieba to install_requires

4188e94

vercel bot deployed to Preview July 29, 2020 06:30 View deployment

henrifroese reviewed Jul 29, 2020

View reviewed changes

ryangawei marked this pull request as draft July 30, 2020 15:30

jbesomi mentioned this pull request Jul 31, 2020

Avoid downloading spaCy models when only using the preprocessing module #120

Open

Merge branch 'master' of github.com:jbesomi/texthero into add-hero-zh…

a85200a

…-support

Improve calling standard functions

8c710c2

vercel bot deployed to Preview August 1, 2020 03:41 View deployment

vercel bot deployed to Preview August 1, 2020 03:43 View deployment

ryangawei force-pushed the add-hero-zh-support branch from 0e07af9 to 8c710c2 Compare August 1, 2020 04:09

Rename package to hero.lang.zh

172f6e2

vercel bot deployed to Preview August 1, 2020 04:11 View deployment

henrifroese added the enhancement New feature or request label Aug 3, 2020

ryangawei changed the title ~~Add initial Chinese support as hero.lang.hero_zh~~ Add initial Chinese support as hero.lang.zh Aug 5, 2020

vercel bot deployed to Preview August 5, 2020 15:07 View deployment

jbesomi mentioned this pull request Aug 7, 2020

All preprocessing functions to receive as input TokenSeries #145

Open

Merge branch 'master' of github.com:jbesomi/texthero into add-hero-zh…

a9942bc

…-support

ryangawei changed the title ~~Add initial Chinese support as hero.lang.zh~~ Add initial Chinese support for hero.lang.zh.preprocessing Aug 15, 2020

Use custom pandas Series in preprocessing.py

99d94d9

vercel bot deployed to Preview August 15, 2020 02:11 View deployment

ryangawei force-pushed the add-hero-zh-support branch from 8e27382 to 99d94d9 Compare August 15, 2020 02:20

vercel bot deployed to Preview August 15, 2020 02:20 View deployment

Remove wrapt dependency

5a5711a

vercel bot deployed to Preview August 15, 2020 02:30 View deployment

Merge branch 'master' of github.com:jbesomi/texthero into add-hero-zh…

2d60410

…-support

ryangawei mentioned this pull request Sep 5, 2020

HeroTypes in Representation; DataFrame in _types #157

Merged

Add jieba to dev dependency

82ff42d

vercel bot deployed to Preview September 7, 2020 09:12 View deployment

Update __init__.py

526dc7d

Fix black formatting issue

5e25dfd

vercel bot deployed to Preview September 8, 2020 03:00 View deployment

ryangawei mentioned this pull request Sep 8, 2020

Infer test cases from input HeroSeries in test_indexes #179

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial Chinese support for hero.lang.zh.preprocessing #128

Add initial Chinese support for hero.lang.zh.preprocessing #128

ryangawei commented Jul 29, 2020 •

edited

henrifroese left a comment

henrifroese Jul 29, 2020

ryangawei Jul 30, 2020

henrifroese Jul 29, 2020

henrifroese Jul 29, 2020

ryangawei commented Jul 30, 2020 •

edited

jbesomi commented Jul 30, 2020

henrifroese commented Jul 30, 2020

ryangawei commented Jul 30, 2020 •

edited

henrifroese commented Jul 30, 2020

jbesomi commented Jul 30, 2020

ryangawei commented Jul 31, 2020 •

edited

henrifroese commented Jul 31, 2020

ryangawei commented Aug 1, 2020

jbesomi commented Aug 5, 2020

jbesomi commented Aug 5, 2020

ryangawei commented Aug 14, 2020 •

edited

jbesomi commented Aug 14, 2020

ryangawei commented Aug 15, 2020 •

edited

jbesomi commented Aug 20, 2020 •

edited

ryangawei commented Aug 20, 2020

ryangawei commented Sep 2, 2020

henrifroese commented Sep 2, 2020

ryangawei commented Sep 3, 2020 •

edited

henrifroese commented Sep 3, 2020 •

edited

ryangawei commented Sep 7, 2020

jbesomi commented Sep 7, 2020

ryangawei commented Sep 8, 2020

		return hero.preprocessing.clean(s, pipeline)


		@root_caller(hero.preprocessing)

Add initial Chinese support for hero.lang.zh.preprocessing #128

Are you sure you want to change the base?

Add initial Chinese support for hero.lang.zh.preprocessing #128

Conversation

ryangawei commented Jul 29, 2020 • edited

henrifroese left a comment

Choose a reason for hiding this comment

henrifroese Jul 29, 2020

Choose a reason for hiding this comment

ryangawei Jul 30, 2020

Choose a reason for hiding this comment

henrifroese Jul 29, 2020

Choose a reason for hiding this comment

henrifroese Jul 29, 2020

Choose a reason for hiding this comment

ryangawei commented Jul 30, 2020 • edited

jbesomi commented Jul 30, 2020

henrifroese commented Jul 30, 2020

ryangawei commented Jul 30, 2020 • edited

henrifroese commented Jul 30, 2020

jbesomi commented Jul 30, 2020

ryangawei commented Jul 31, 2020 • edited

henrifroese commented Jul 31, 2020

ryangawei commented Aug 1, 2020

jbesomi commented Aug 5, 2020

jbesomi commented Aug 5, 2020

ryangawei commented Aug 14, 2020 • edited

jbesomi commented Aug 14, 2020

ryangawei commented Aug 15, 2020 • edited

jbesomi commented Aug 20, 2020 • edited

ryangawei commented Aug 20, 2020

ryangawei commented Sep 2, 2020

henrifroese commented Sep 2, 2020

ryangawei commented Sep 3, 2020 • edited

henrifroese commented Sep 3, 2020 • edited

ryangawei commented Sep 7, 2020

jbesomi commented Sep 7, 2020

ryangawei commented Sep 8, 2020

ryangawei commented Jul 29, 2020 •

edited

ryangawei commented Jul 30, 2020 •

edited

ryangawei commented Jul 30, 2020 •

edited

ryangawei commented Jul 31, 2020 •

edited

ryangawei commented Aug 14, 2020 •

edited

ryangawei commented Aug 15, 2020 •

edited

jbesomi commented Aug 20, 2020 •

edited

ryangawei commented Sep 3, 2020 •

edited

henrifroese commented Sep 3, 2020 •

edited