Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transliterate #19

Open
christian-storm opened this issue Sep 29, 2019 · 4 comments
Open

Transliterate #19

christian-storm opened this issue Sep 29, 2019 · 4 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@christian-storm
Copy link

I was hoping you might advise me on how to incorporate transliteration into a text transformation pipeline.

Let's say I want to use a 3rd party library like from unidecode import unidecode.
I could create a bistring with new_bistr = bistr(text.modified, unidecode(text.modified))
but I would loose all the previous operations.

Is there a way to fold in a modified string that is calculated outside bistring's capabilities?

@tavianator
Copy link
Collaborator

In general no. You could use something like bistr.infer(text, unidecode(text)) to have it guess.

In your case, you could do a little better since the transliteration process probably operates character-by-character. Something like

tokenizer = CharacterTokenizer('und')  # or 'en-US', etc.
builder = BistrBuilder(text)
for token in tokenzier.tokenize(text):
    builder.replace(token.end - token.start, unidecode(token.modified))
text = builder.build()

By the way, it's on my backlog to implement support for ICU's Transliterator API which is more powerful than unidecode and similar things.

@tavianator tavianator added the question Further information is requested label Sep 30, 2019
@tavianator
Copy link
Collaborator

So since ovalhub/pyicu#107 was implemented, I've tested out an implementation that wraps a bistr in a Replaceable for ICU. It works well for simple transliterations like Latin-ASCII, but for complicated ones like Greek-Latin ICU does some strange things that I'm not sure how to cope with nicely:

('Ὀδυσσεύς' ⇋ 'Ὀδυσσεύς')
('Ὀδυσσεύς' ⇋ 'Ὀδυσσεύς')
('Ὀδυσσεύς' ⇋ 'Ὀδυσσεύς\uffff')
('Ὀδυσσεύς' ⇋ 'Ὀδυσσεύς\uffffO')
('Ὀδυσσεύς' ⇋ 'OὈδυσσεύς\uffffO')
('Ὀδυσσεύς' ⇋ 'OὈδυσσεύς')
('Ὀδυσσεύς' ⇋ 'O̓δυσσεύς')
('Ὀδυσσεύς' ⇋ 'O̓δυσσεύςO')
('Ὀδυσσεύς' ⇋ 'O̓δυσσεύςO')
('Ὀδυσσεύς' ⇋ 'O̓δυσσεύς')
('Ὀδυσσεύς' ⇋ 'Oδυσσεύς')
('Ὀδυσσεύς' ⇋ 'OδυσσεύςO')
('Ὀδυσσεύς' ⇋ 'OδυσσεύςOd')
('Ὀδυσσεύς' ⇋ 'OdδυσσεύςOd')
('Ὀδυσσεύς' ⇋ 'Odδυσσεύς')
('Ὀδυσσεύς' ⇋ 'Odυσσεύς')
('Ὀδυσσεύς' ⇋ 'Odυσσεύςd')
('Ὀδυσσεύς' ⇋ 'Odυσσεύςdy')
('Ὀδυσσεύς' ⇋ 'Odyυσσεύςdy')
('Ὀδυσσεύς' ⇋ 'Odyυσσεύς')
('Ὀδυσσεύς' ⇋ 'Odyσσεύς')
('Ὀδυσσεύς' ⇋ 'Odyσσεύςy')
('Ὀδυσσεύς' ⇋ 'Odyσσεύςys')
('Ὀδυσσεύς' ⇋ 'Odysσσεύςys')
('Ὀδυσσεύς' ⇋ 'Odysσσεύς')
('Ὀδυσσεύς' ⇋ 'Odysσεύς')
('Ὀδυσσεύς' ⇋ 'Odyssεύς')
('Ὀδυσσεύς' ⇋ 'Odyssεύςs')
('Ὀδυσσεύς' ⇋ 'Odyssεύςse')
('Ὀδυσσεύς' ⇋ 'Odysseεύςse')
('Ὀδυσσεύς' ⇋ 'Odysseεύς')
('Ὀδυσσεύς' ⇋ 'Odysseύς')
('Ὀδυσσεύς' ⇋ 'Odysseύςe')
('Ὀδυσσεύς' ⇋ 'Odysseύςeu')
('Ὀδυσσεύς' ⇋ 'Odysseuύςeu')
('Ὀδυσσεύς' ⇋ 'Odysseuύς')
('Ὀδυσσεύς' ⇋ 'Odysseúς')
('Ὀδυσσεύς' ⇋ 'Odysseúς́')
('Ὀδυσσεύς' ⇋ 'Odysseúς́s')
('Ὀδυσσεύς' ⇋ 'Odysseúsς́s')
('Ὀδυσσεύς' ⇋ 'Odysseúsς')
('Ὀδυσσεύς' ⇋ 'Odysseús')
('Ὀδυσσεύς' ⇋ 'Odysseús')
('Ὀδυσσεύς' ⇋ 'Odysseús')

@tavianator tavianator added the enhancement New feature or request label Oct 4, 2019
@christian-storm
Copy link
Author

Thank you for the great info and tips. Agreed that transliteration doesn't always make sense to do, e.g., your example.

I realize now why I didn't think to do it the way you mentioned. I had it in my mind that bistr keeps track of each operations output instead of always overriding modified, i.e., modified is a list so one could rollback to a certain state. I had built this into my own version of this. The use case being that I could see which operation the caused the string transformation train to derail.

@tavianator
Copy link
Collaborator

Ah I see, but that would be polystring, not bistring :). More seriously, I am considering adding a data type that would retain an entire history of transformations, rather than just the initial and final states. The Emacs region-specific undo buffer stuff seems to have that, for example, but I'm not sure what encoding they use. I imagine it's a persistent stack of ropes or something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants