Transliterate #19

christian-storm · 2019-09-29T20:43:15Z

I was hoping you might advise me on how to incorporate transliteration into a text transformation pipeline.

Let's say I want to use a 3rd party library like from unidecode import unidecode.
I could create a bistring with new_bistr = bistr(text.modified, unidecode(text.modified))
but I would loose all the previous operations.

Is there a way to fold in a modified string that is calculated outside bistring's capabilities?

The text was updated successfully, but these errors were encountered:

tavianator · 2019-09-30T13:48:29Z

In general no. You could use something like bistr.infer(text, unidecode(text)) to have it guess.

In your case, you could do a little better since the transliteration process probably operates character-by-character. Something like

tokenizer = CharacterTokenizer('und')  # or 'en-US', etc.
builder = BistrBuilder(text)
for token in tokenzier.tokenize(text):
    builder.replace(token.end - token.start, unidecode(token.modified))
text = builder.build()

By the way, it's on my backlog to implement support for ICU's Transliterator API which is more powerful than unidecode and similar things.

tavianator · 2019-10-04T15:46:45Z

So since ovalhub/pyicu#107 was implemented, I've tested out an implementation that wraps a bistr in a Replaceable for ICU. It works well for simple transliterations like Latin-ASCII, but for complicated ones like Greek-Latin ICU does some strange things that I'm not sure how to cope with nicely:

('Ὀδυσσεύς' ⇋ 'Ὀδυσσεύς')
('Ὀδυσσεύς' ⇋ 'Ὀδυσσεύς')
('Ὀδυσσεύς' ⇋ 'Ὀδυσσεύς\uffff')
('Ὀδυσσεύς' ⇋ 'Ὀδυσσεύς\uffffO')
('Ὀδυσσεύς' ⇋ 'OὈδυσσεύς\uffffO')
('Ὀδυσσεύς' ⇋ 'OὈδυσσεύς')
('Ὀδυσσεύς' ⇋ 'O̓δυσσεύς')
('Ὀδυσσεύς' ⇋ 'O̓δυσσεύςO')
('Ὀδυσσεύς' ⇋ 'O̓δυσσεύςO')
('Ὀδυσσεύς' ⇋ 'O̓δυσσεύς')
('Ὀδυσσεύς' ⇋ 'Oδυσσεύς')
('Ὀδυσσεύς' ⇋ 'OδυσσεύςO')
('Ὀδυσσεύς' ⇋ 'OδυσσεύςOd')
('Ὀδυσσεύς' ⇋ 'OdδυσσεύςOd')
('Ὀδυσσεύς' ⇋ 'Odδυσσεύς')
('Ὀδυσσεύς' ⇋ 'Odυσσεύς')
('Ὀδυσσεύς' ⇋ 'Odυσσεύςd')
('Ὀδυσσεύς' ⇋ 'Odυσσεύςdy')
('Ὀδυσσεύς' ⇋ 'Odyυσσεύςdy')
('Ὀδυσσεύς' ⇋ 'Odyυσσεύς')
('Ὀδυσσεύς' ⇋ 'Odyσσεύς')
('Ὀδυσσεύς' ⇋ 'Odyσσεύςy')
('Ὀδυσσεύς' ⇋ 'Odyσσεύςys')
('Ὀδυσσεύς' ⇋ 'Odysσσεύςys')
('Ὀδυσσεύς' ⇋ 'Odysσσεύς')
('Ὀδυσσεύς' ⇋ 'Odysσεύς')
('Ὀδυσσεύς' ⇋ 'Odyssεύς')
('Ὀδυσσεύς' ⇋ 'Odyssεύςs')
('Ὀδυσσεύς' ⇋ 'Odyssεύςse')
('Ὀδυσσεύς' ⇋ 'Odysseεύςse')
('Ὀδυσσεύς' ⇋ 'Odysseεύς')
('Ὀδυσσεύς' ⇋ 'Odysseύς')
('Ὀδυσσεύς' ⇋ 'Odysseύςe')
('Ὀδυσσεύς' ⇋ 'Odysseύςeu')
('Ὀδυσσεύς' ⇋ 'Odysseuύςeu')
('Ὀδυσσεύς' ⇋ 'Odysseuύς')
('Ὀδυσσεύς' ⇋ 'Odysseúς')
('Ὀδυσσεύς' ⇋ 'Odysseúς́')
('Ὀδυσσεύς' ⇋ 'Odysseúς́s')
('Ὀδυσσεύς' ⇋ 'Odysseúsς́s')
('Ὀδυσσεύς' ⇋ 'Odysseúsς')
('Ὀδυσσεύς' ⇋ 'Odysseús')
('Ὀδυσσεύς' ⇋ 'Odysseús')
('Ὀδυσσεύς' ⇋ 'Odysseús')

christian-storm · 2019-10-04T17:21:40Z

Thank you for the great info and tips. Agreed that transliteration doesn't always make sense to do, e.g., your example.

I realize now why I didn't think to do it the way you mentioned. I had it in my mind that bistr keeps track of each operations output instead of always overriding modified, i.e., modified is a list so one could rollback to a certain state. I had built this into my own version of this. The use case being that I could see which operation the caused the string transformation train to derail.

tavianator · 2019-10-07T14:40:48Z

Ah I see, but that would be polystring, not bistring :). More seriously, I am considering adding a data type that would retain an entire history of transformations, rather than just the initial and final states. The Emacs region-specific undo buffer stuff seems to have that, for example, but I'm not sure what encoding they use. I imagine it's a persistent stack of ropes or something.

tavianator added the question Further information is requested label Sep 30, 2019

tavianator added the enhancement New feature or request label Oct 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transliterate #19

Transliterate #19

christian-storm commented Sep 29, 2019

tavianator commented Sep 30, 2019

tavianator commented Oct 4, 2019

christian-storm commented Oct 4, 2019

tavianator commented Oct 7, 2019

Transliterate #19

Transliterate #19

Comments

christian-storm commented Sep 29, 2019

tavianator commented Sep 30, 2019

tavianator commented Oct 4, 2019

christian-storm commented Oct 4, 2019

tavianator commented Oct 7, 2019