Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization for phonetic languages #1009

Closed
divyeshrajpura4114 opened this issue May 14, 2024 · 3 comments
Closed

Tokenization for phonetic languages #1009

divyeshrajpura4114 opened this issue May 14, 2024 · 3 comments

Comments

@divyeshrajpura4114
Copy link

divyeshrajpura4114 commented May 14, 2024

Hi,

Is there any way we can define a set of sub-words to be not split but still considered for token generation. This is especially required for phonetically rich languages like Hindi.

Ex: मैं दिव्येश राजपुरा हूं (I am Divyesh Rajpura)
In the above example, the sub-words such as, मैं (me), दि (di), व्ये (vye), पु (pu), रा (ra), हूं (hu) should never get split and should be considered as a single unit when generating BPE tokens

Thanks & Regards,
Divyesh Rajpura

@taku910
Copy link
Collaborator

taku910 commented May 15, 2024

In general, it is not possible to define the constraint not to split the token. For instance, we cannot merge all numeric characters e.g., 0-9 we will see infinite number of tokens with this merges rule after training. Does this phonetic merge rule can generate infinite combinations of substrings?

@taku910
Copy link
Collaborator

taku910 commented May 30, 2024

Will close this issue on 5/31.

@divyeshrajpura4114
Copy link
Author

Sure. I have figured another work around and it seems working fine as of now. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants