You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is there any way we can define a set of sub-words to be not split but still considered for token generation. This is especially required for phonetically rich languages like Hindi.
Ex: मैं दिव्येश राजपुरा हूं (I am Divyesh Rajpura)
In the above example, the sub-words such as, मैं (me), दि (di), व्ये (vye), पु (pu), रा (ra), हूं (hu) should never get split and should be considered as a single unit when generating BPE tokens
Thanks & Regards,
Divyesh Rajpura
The text was updated successfully, but these errors were encountered:
In general, it is not possible to define the constraint not to split the token. For instance, we cannot merge all numeric characters e.g., 0-9 we will see infinite number of tokens with this merges rule after training. Does this phonetic merge rule can generate infinite combinations of substrings?
Hi,
Is there any way we can define a set of sub-words to be not split but still considered for token generation. This is especially required for phonetically rich languages like Hindi.
Ex: मैं दिव्येश राजपुरा हूं (I am Divyesh Rajpura)
In the above example, the sub-words such as, मैं (me), दि (di), व्ये (vye), पु (pu), रा (ra), हूं (hu) should never get split and should be considered as a single unit when generating BPE tokens
Thanks & Regards,
Divyesh Rajpura
The text was updated successfully, but these errors were encountered: