Tokenization for phonetic languages #1009

divyeshrajpura4114 · 2024-05-14T05:48:37Z

Hi,

Is there any way we can define a set of sub-words to be not split but still considered for token generation. This is especially required for phonetically rich languages like Hindi.

Ex: मैं दिव्येश राजपुरा हूं (I am Divyesh Rajpura)
In the above example, the sub-words such as, मैं (me), दि (di), व्ये (vye), पु (pu), रा (ra), हूं (hu) should never get split and should be considered as a single unit when generating BPE tokens

Thanks & Regards,
Divyesh Rajpura

taku910 · 2024-05-15T06:57:41Z

In general, it is not possible to define the constraint not to split the token. For instance, we cannot merge all numeric characters e.g., 0-9 we will see infinite number of tokens with this merges rule after training. Does this phonetic merge rule can generate infinite combinations of substrings?

taku910 · 2024-05-30T05:39:21Z

Will close this issue on 5/31.

divyeshrajpura4114 · 2024-05-30T07:31:42Z

Sure. I have figured another work around and it seems working fine as of now. Thanks!

divyeshrajpura4114 closed this as completed May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization for phonetic languages #1009

Tokenization for phonetic languages #1009

divyeshrajpura4114 commented May 14, 2024 •

edited

taku910 commented May 15, 2024

taku910 commented May 30, 2024

divyeshrajpura4114 commented May 30, 2024

Tokenization for phonetic languages #1009

Tokenization for phonetic languages #1009

Comments

divyeshrajpura4114 commented May 14, 2024 • edited

taku910 commented May 15, 2024

taku910 commented May 30, 2024

divyeshrajpura4114 commented May 30, 2024

divyeshrajpura4114 commented May 14, 2024 •

edited