Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial Chinese support for hero.lang.zh.preprocessing #128

Draft
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

ryangawei
Copy link

@ryangawei ryangawei commented Jul 29, 2020

This is the first step towards the Chinese support. All the docstrings are the same with original.

  • Create a hero.lang.hero_zh module
  • Implement preprocessing.py for Chinese, removing some inapplicable functions from the original API
  • Add test cases and docstring test in Chinese

See #68.

Copy link
Collaborator

@henrifroese henrifroese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General feedback:

To me it seems like the best way to implement this is the following:

We have some config where we can set the language. If a user does hero.set_language("Chinese"), the library automatically starts calling the functions that are in /texthero/lang/chinese/... (or sth. like that) if they are implemented there, and if they're not there, it just calls the standard texthero function.

This could probably be implemented by some other config file in texthero/lang/chinese/... that lists all functions that are overwritten for Chinese. So all in all, what I think should happen (but I'm of course open to discussion) is:

User calls texthero function xy -> texthero checks what language is set. If set to e.g. Chinese, texthero goes to texthero/lang/Chinese etc. and looks at a config file there. If xy is in the config file there, texthero calls the function from the Chinese module. If it's not, texthero calls the standard version.

This way, we don't add unnecessary overhead. Especially, developers working on standard functions don't have to worry about implementing everything in the language subfolders. Developers working on the language specific stuff can focus on that.

It seems to me like your way introduces some unnecessary overhead down the line.
Interested in what you think, maybe I misunderstand what you're trying to do!


"""

@wrapt.decorator
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the wrapt module really necessary here? Maybe have a look at the decorator implemented a few lines above that uses just the built-in functools from the standard library. Then the new dependency isn't needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw in your PR #69 that wrapt is used too. This module looks clean so I just want to try it, not really necessary and I'm fine to remove it. 😃

return hero.preprocessing.clean(s, pipeline)


@root_caller(hero.preprocessing)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no way to only "overwrite" those functions that really are different for Chinese, and not mention the others at all? I can see this getting really tedious when we introduce more and more languages: if we now want to add a new function, we have to add a function that does nothing (just pass) in every language module.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see my general comment above)

@ryangawei
Copy link
Author

ryangawei commented Jul 30, 2020

Hi @henrifroese, you've made a good point.

My starting point is that, a lot of standard functions are not applicable on Chinese. For example, remove_digits, remove_*_brackets (should be done only after tokenization), stem, remove_diacritics. As the user choose to deal with Chinese, exposing these APIs may be confusing or misleading. (Refer to #84). So I basically put everything that supports Chinese into the hero_zh module. But, as you said, it also introduce lots of redundant code and make it hard to maintain when more languages joins.

Your suggestion is great, which I thought the same. However what can we do deal with the problem above? Should we hide these N/A API to user?

@jbesomi
Copy link
Owner

jbesomi commented Jul 30, 2020

Dear @AlfredWGA and @henrifroese,

Thank you for the PR and the respective review; good job! 👍

Regarding the order of operations in the preprocessing phase, it's probably more natural to first tokenize the text and then apply all the different functions on a tokenized text.

The only drawback of this solution is that it might be slower for certain operations. For instance, remove_punctuation on a string with regex is faster than removing punctuation by iterating over a list of tokens. Nonetheless, the second method can be highly parallelized and we might overcome the problem.

Regarding the fact that some of the functions are not relevant for a specific language (i.e stem) I don't really know what's the best solution. I'm open to any ideas. Before us probably many other programmers had the same issue; we might want to look around and see how they solved. Maybe there is even a design pattern to resolve that?

Regarding the other comment(s): agree with Henri

P.S @AlfredWGA should I call you Alfred or Guoao?

@henrifroese
Copy link
Collaborator

My starting point is that, a lot of standard functions are not applicable on Chinese. For example, remove_digits, remove_*_brackets (should be done only after tokenization), stem, remove_diacritics. As the user choose to deal with Chinese, exposing these APIs may be confusing or misleading. (Refer to #84). So I basically put everything that supports Chinese into the hero_zh module. But, as you said, it also introduce lots of redundant code and make it hard to maintain when more languages joins.

What do you think about this: When the language is set to e.g. Chinese and a user calls a function texthero.f, texthero goes to texthero/lang/Chinese/supported_functions.py. In that file, there are two lists (or rather dicts):

  1. functions overwritten / changed / added for Chinese -> if f is here, call the "Chinese version"
  2. functions from texthero that don't make sense for Chinese -> if f is here, throw an Exception (or a warning).

This also allows us to gradually add support. At the beginning, many functions might still be in the second list; after implementing solutions for Chinese, they are moved to the first list.

However, I also agree with this, maybe there's a well-proven path to doing this:

Before us probably many other programmers had the same issue; we might want to look around and see how they solved. Maybe there is even a design pattern to resolve that?

One other thing: maybe convert this to a draft PR: on the right side, under "Reviewers", it should say "Still in progress? Convert to draft", click that. This way, everyone can see this is a work in progress and not ready to be merged yet.

@ryangawei ryangawei marked this pull request as draft July 30, 2020 15:30
@ryangawei
Copy link
Author

ryangawei commented Jul 30, 2020

@jbesomi, thanks for the suggestion, I'll try to look for solutions through other NLP libraries. Also you can call me Guoao. 😃

functions overwritten / changed / added for Chinese -> if f is here, call the "Chinese version"
functions from texthero that don't make sense for Chinese -> if f is here, throw an Exception (or a warning).

@henrifroese thanks for your idea. But think of a situation: without documentation or docstring, when a user tries to call a function, he/she won't know if it supports Chinese until that function returns (correct result or Exception), which is kind of annoying, isn't it?

How about I do this under texthero/lang/zh/preprocessing.py,

# Original functions that supports Chinese
from texthero.preprocessing import fillna, drop_no_content, remove_whitespace, ...


# Functions overwritten/added for Chinese
def tokenize(s: pd.Series) -> pd.Series:
...

Through this we still maintain a list of supported/not supported functions, and also prevent user from calling unexpected functions and getting Exceptions. What do you think?

@henrifroese
Copy link
Collaborator

How about I do this under texthero/lang/zh/preprocessing.py

That is extremely simple and does the job! Makes sense!

Through this we still maintain a list of supported/not supported functions, and also prevent user from calling unexpected functions and getting Exceptions. What do you think?

I only fail to see how we still maintain a list of not supported functions? The way you suggest, if someone calls a texthero function that's not implemented for Chinese yet and doesn't work for Chinese, python will just give an error saying it does not know the function? Or would you keep a separate list of functions that need to be changed for Chinese, but aren't implemented yet?

@jbesomi
Copy link
Owner

jbesomi commented Jul 30, 2020

Note that this PR introduces these major changes:

  • replace_tags / replace_hashtags:
    The regex pattern changed from r"@[a-zA-Z0-9]+" to r"@[a-zA-Z0-9\u4e00-\u9fa5]+"
    -> We might just add this changes in the "default" functions?

  • tokenizer
    The function make use of spaCy Chinese tokenizer.
    We probably want to have spaCy as the default tokenizers. The question is then: do we need a separate tokenize functions or an universal one might be enough?
    See also tokenize with Spacy #131

  • clean
    Call only fillna, remove_whitespace and tokenize
    -> A solution is required (see above)

Review:

  1. We probably do not need an extra test_indexes.py file, rather adapt the actual one to test functions from any lang
    --> we might have just a list of "valid" function names and then iterate over __DIR__(texthero.lang) for lang in {'default', 'en'} or something similar?

@ryangawei
Copy link
Author

ryangawei commented Jul 31, 2020

@henrifroese I think every standard functions that isn't included in the language module hero.lang.* can be taken as not supported, and will raise AttributeError when they're called with texthero.lang.zh.precessing.*

@henrifroese
Copy link
Collaborator

@henrifroese I think every standard functions that isn't included in the language module hero.lang.* can be taken as not supported, and will raise AttributeError when they're called with texthero.lang.zh.precessing.*

Makes sense, I think that's good enough for users 👌

@ryangawei
Copy link
Author

We probably do not need an extra test_indexes.py file, rather adapt the actual one to test functions from any lang
--> we might have just a list of "valid" function names and then iterate over DIR(texthero.lang) for lang in {'default', 'en'} or > > something similar?

@jbesomi Maybe put these functions in __all__ under each language package and call with getattr(),

import importlib

test_cases_preprocessing = []
for module_str in dir(texthero.lang):
    if not module_str.startswith("__"):
        module = importlib.import_module('texthero.lang.' + module_str, package='texthero.lang')
        # print(module)
        test_cases_preprocessing.extend([getattr(module, func) for func in module.__all__])

@henrifroese henrifroese added the enhancement New feature or request label Aug 3, 2020
@ryangawei ryangawei changed the title Add initial Chinese support as hero.lang.hero_zh Add initial Chinese support as hero.lang.zh Aug 5, 2020
@jbesomi
Copy link
Owner

jbesomi commented Aug 5, 2020

Exactly! What if we do that in a separate PR? Would like to do it?

@jbesomi
Copy link
Owner

jbesomi commented Aug 5, 2020

Thank you!! :)

@ryangawei
Copy link
Author

ryangawei commented Aug 14, 2020

Should we avoid calling variables and functions across modules?

For example, in https://github.com/jbesomi/texthero/blob/master/texthero/preprocessing.py#L295, texthero.stopwords is used by remove_stopwords. Suppose there's a package for French texthero.lang.fr, the same remove_stopwords logic can apply properly, except the stopwords should point to texthero.lang.fr.stopwords instead of texthero.stopwords. The only way is to copy & paste the same code to texthero.lang.fr.remove_stopwords, and add from texthero.lang.fr import stopwords as _stopwords in texthero.lang.fr.preprocessing, which is quite redundant.

I think in most cases it is unnecessary for modules to call each other. Regulating this might be better for other language supports, what do you guys think?

@jbesomi
Copy link
Owner

jbesomi commented Aug 14, 2020

For example, in ... which is quite redundant.

Hey Guoao, I agree with you that this might be redundant.

As a general comment: as for now we are not offering multilingual support, any ideas that provide flexibility or cleanness, or code robustness is very welcome.

I just don't understand what's your advice and idea about how to solve that.

@ryangawei ryangawei changed the title Add initial Chinese support as hero.lang.zh Add initial Chinese support for hero.lang.zh.preprocessing Aug 15, 2020
@ryangawei
Copy link
Author

ryangawei commented Aug 15, 2020

Just add custom Series type to align with the standard preprocessing.py.

I am thinking simply not to call functions or variables from other modules, unless that module is language agnostic. For example, preprocessing and stopwords is obviously language-specific, but visualization, _helper, _types may be not. import _helper in preprocessing is okay, but import preprocessing in visualization may lead to the problem I mentioned above.

Currently I only found few such cases in the code, like stopwords used in remove_stopwords, and tokenize used in functions of representation. Removing these calling would not seriously affect the original function, we can tell the user in docstring to do these steps outside the function.

By the way, Chinese word segmentation package jieba is optional for users, but it's necessary for testing, should I add it to dev dependencies?

@jbesomi
Copy link
Owner

jbesomi commented Aug 20, 2020

By the way, Chinese word segmentation package jieba is optional for users, but it's necessary for testing, should I add it to dev dependencies?

Yes. And we will have to make it clear to users that if they want to use jieba they need to install it separately. That's similar to what we are doing in #146 with Flair.

For the rest: I agree!

@ryangawei
Copy link
Author

For the rest: I agree!

Should I create another PR? That issue should be solved before this PR could continue.

@ryangawei
Copy link
Author

@henrifroese Do you think we should proceed with that?

@henrifroese
Copy link
Collaborator

@AlfredWGA I'm not sure what you mean. It's difficult for us to not to import functions from other modules at the moment (e.g. I'm not sure how we would not use the tokenize function in representation right now?). Or maybe I am misunderstanding you?

@ryangawei
Copy link
Author

ryangawei commented Sep 3, 2020

It's difficult for us to not to import functions from other modules at the moment (e.g. I'm not sure how we would not use the tokenize function in representation right now?).

Hi @henrifroese. Some modules need to deal with specific languages, therefore shouldn't be imported directly within other modules (because the user choose what language they want to deal with). Currently preprocessing and stopwords seems like such modules. For other modules, they're definitely okay to be imported.

For representation, if all functions receive a TokenSeries as input, there's no need to use not_tokenized_warning and preprocessing, the tokenization step is already done by users themselves.

@henrifroese
Copy link
Collaborator

henrifroese commented Sep 3, 2020

Okay I get it now, thanks 👌.

So presumably solution would be

  • tokenize: as you said above, we can switch to using the decorator @InputSeries(TokenSeries) to force a TokenSeries input and can then leave out the tokenization

  • stopwords and other cases: I don't know, interested in how you solve this 💡

I think a separate PR for this definitely makes sense.

@ryangawei
Copy link
Author

Why does the error The command "black --check ." exited with 1. occurs in CI? I ran this check locally and there's nothing wrong.

@jbesomi
Copy link
Owner

jbesomi commented Sep 7, 2020

See #171

@ryangawei
Copy link
Author

Just fix the problem, thank you! @jbesomi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants