Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

strip_accents should be None by default in WordPiece #1528

Open
sxjscience opened this issue Feb 22, 2021 · 3 comments
Open

strip_accents should be None by default in WordPiece #1528

sxjscience opened this issue Feb 22, 2021 · 3 comments
Labels
bug Something isn't working

Comments

@sxjscience
Copy link
Member

Description

@leezu @szha @xinyual I noticed that we may need to set strip_accents to None in

strip_accents: bool = False, lowercase: bool = False,
so that it will be turned on when lowercase is True.

This may impact the performance.

Error Message

(Paste the complete error message, including stack trace.)

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

What have you tried to solve it?

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here
@sxjscience sxjscience added the bug Something isn't working label Feb 22, 2021
@sxjscience
Copy link
Member Author

However, accents may have certain meanings for lots of languages, e.g., mochte vs. möchte. Thus, we may try to turn it off in nlp_process.

@leezu
Copy link
Contributor

leezu commented Feb 22, 2021

Thus, we may try to turn it off in nlp_process.

Do you mean exposing an option in nlp_process or changing the defaults in nlp_process? As English is a special case that doesn't care much about accents, I suggest we must keep the option to keep accents in nlp_process.

@sxjscience
Copy link
Member Author

sxjscience commented Feb 22, 2021 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants