Case changing for Cyrillic #675

josephwright · 2020-02-17T17:13:44Z

As noted in #671, at present

\documentclass{article}
\usepackage[T1,T2A]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{expl3}

\ExplSyntaxOn
\def\test{\text_lowercase:n}
\ExplSyntaxOff

\begin{document}
\test{\.I İ \CYRI И}
\end{document}

gives at-best an 'odd' result.

It should be possible to carry out case-changing here as it is not dependent on \lccode changes but rather on expanding И to

\u8:И ->\IeC {\CYRI }

and then doing the work.

The text was updated successfully, but these errors were encountered:

blefloch · 2020-02-18T11:43:24Z

\u8:И ->\IeC {\CYRI }

Couldn't it make more sense to extract И from \u8:И, and look up case information in some intarray?

car222222 · 2020-02-18T11:49:49Z

@blefloch
Yes!

What are these \u8:... commands anyway? Are they needed?

FrankMittelbach · 2020-02-18T11:56:53Z

@blefloch
Yes!

or maybe not Chris. One may has to deal with ^^notation in that place instead of И but on the whole I agree that looks like the better starting point

What are these \u8:... commands anyway? Are they needed?

you should know :-) your name is on the file that contains that code. Yes they are needed: in pdftex LaTeX sees bytes analyzes them and constructs a single csname from them \u8:...which holds the LICR for that utf8 char which in the above case is \IeC {\CYRI } or if the \u8:... is not defined responds with no Unicode representation for ...

car222222 · 2020-02-18T12:15:36Z

you should know :-) your name is on the file that contains that code.
But Not everything I may be responsible for is needed:-).

I agree I should look at the original code! At least to find out where the : came from.

But I should stop now in case I anger a certain person by displaying my opinions in such a public place:-).

josephwright · 2020-02-18T12:48:17Z

@blefloch There are a couple of things needed. The first is to spot a UTF-8 pair/triplet/quartet and grab it whole rather than token-by-token. That's easy enough: check for active char tokens equal to the inputenc starting point. The second phase is to know how to case change them. The reason I mentioned taking the \IeC{...} approach is then we don't need new data: it's the same way that \MakeUppercase handles them and so uses the \@uclclist data we're already collecting.

car222222 · 2020-02-18T13:05:08Z

The reason I mentioned taking the \IeC{...} approach is then we don't need new data:
Well, you may need a bit more if you want to cover absolutely every character that changes case (They may not all yet have LICRs.)

Using numbers and Unicode tables is aesthetically more appealing, of course. But if ‘tables of names’ works for now . . .

For Cyrillic, Greek, Armenian, etc etc, is it possible to use new LICRs of the form \cyr{}, a bit like accents?

josephwright · 2020-02-18T14:22:36Z

@car222222 The issue came up as there are places that current \MakeUppercase will work that \text_uppercase:n won't, which come down to things that go via u8:.... That's why I was starting with this. If we want the full Unicode range in pdfTeX (doable), we'll need to store the data manually in a integer array.

FrankMittelbach · 2020-02-18T14:33:01Z

If we want the full Unicode range in pdfTeX (doable), we'll need to store the data manually in a integer array.

Given that pdfTeX deliberately only provides utf8 chars if supported by the loaded font encodings it is questionable to first case change and then find that the result is an unsupported character. Of course, if the whole data is inside the format then there is no extra payload (other than the size taken up by it) and the initial preparation.

u-fischer · 2020-02-18T14:49:55Z

it is questionable to first case change and then find that the result is an unsupported character.

I don't find this very problematic. Lowercase and uppercase are in the same encoding, so you only would get an error on a capital alpha if you start with the unsupported lowercase alpha.

blefloch · 2020-02-18T14:55:35Z

On 2/18/20 3:49 PM, Ulrike Fischer wrote: it is questionable to first case change and then find that the result is an unsupported character. I don't find this very problematic. Lowercase and uppercase are in the same encoding, so you only would get an error on a capital alpha if you start with the unsupported lowercase alpha.

Even if there exists an encoding with lowercase alpha but not upper case alpha (this might plausibly be the case for some of the rarer accents), getting an error of Unicode char not set up seems better than accidentally getting the lowercase char.

car222222 · 2020-02-18T15:12:06Z

I agree with Ulrike and Bruno. But I am failing to imagine a realistic case (pun intended) where the upper and lower case characters are not both available/unavailable simultaneously.

car222222 · 2020-02-18T15:23:26Z

Given that pdfTeX deliberately only provides utf8 chars if supported by the loaded font encodings

Meaning what? pdfTeX does not ‘provide chars’ at all, does it? And ‘loaded font encodings’ is a LaTeX concept, not an engine one.

Maybe it means that in the way we originally set up the utf8 stuff for LaTeX, LICRs were only (and mappings were provided only ‘for known encodings’ and then only loaded for loaded encodings.

True, but there is no need to keep such restrictions these days, is there?
We can certainly now easily provide them for any subset of Unicode we wish to, and in this context we only need to cover all ‘casable characters’.

Disclaimer: I was never very keen on that restriction to known encodings:-).

FrankMittelbach · 2020-02-18T15:33:53Z

Given that pdfTeX deliberately only provides utf8 chars if supported by the loaded font encodings Meaning what? pdfTeX does not ‘provide chars’ at all, does it? And ‘loaded font encodings’ is a LaTeX concept, not an engine one.

meaning pdflatex and writing pdftex

Maybe it means that in the way we originally set up the utf8 stuff for LaTeX, LICRs were only (and mappings were provided only ‘for known encodings’ and then only loaded for loaded encodings.

yes which was a Good Thing TM because that kept the LaTeX world free of tofu and missing characters

True, but there is no need to keep such restrictions these days, is there? We can certainly now easily provide them for any subset of Unicode we wish to, and in this context we only need to cover all ‘casable characters’.

yes, there is. if you don't have the glyphs to typeset the characters it is pointless to do so, which is why claiming that you cn do unicode as as xetex or luatex (latex) does and then just generating holes ans No char XXX warnings in the log is a step backwards to the pdflatex solution, imho

Disclaimer: I was never very keen on that restriction to known encodings:-).

well, as long as you write English it usually doesn't matter if you write in other languages and your document gets corrupted without warning you you it does

car222222 · 2020-02-18T15:45:57Z

There may well be reasons for not loading LICRs for unrepresentable characters.

But here we are talking only about defining these LICRs and uppercasing characters, note ‘characters‘.
Nothing to do with typesetting them, so the encodings/fonts that are available are not relevant.
Use-case: the uppecased form is only for use in a pdf bookmark, never to be typeset (by TeX, at least!)

It turns out to be safest to deal with these directly, rather than using an expand-and-check approach.

josephwright · 2020-02-24T09:27:15Z

After looking at the problem a little more, it seemed easier to handle it using a fixed list of mappings rather than trying to do things by looking inside active chars. I had a quick look at how many codepoints there are with case-changing data: about 2000. That's possibly a bit much to do all of them, so for the present I've picked up Greek and Cyrillic ones that are covered by T2/LGR. Thoughts welcome.

FrankMittelbach · 2020-02-24T09:43:46Z

what about the idea to store all of them in an intarray?

josephwright · 2020-02-24T09:46:45Z

The thing with using an intarray is we can't make it sparse, so the size would depend on the codepoint of the final value to be stored. There's also a bit of a performace hit at point-of-use as we'd have to extract, convert to bytes and construct the active chars then, rather than doing it once at load time.

josephwright · 2020-02-24T09:48:27Z

Also, back with the 'what codepoints have glyphs' business, as far as I know, the Greek and Cyrllic ones plus the Latin ones already covered are by far the most useful

car222222 · 2020-02-24T10:14:27Z

Well, to the Greeks and Cyrills they are the most useful, yes! But not to the rest of the world?
Das heisst: how did you measure this utility?

I guess the total gets up so large due to the many latin-derivatives around, or not?
2000 is approx 30+ typical alphabets, I guess.

josephwright · 2020-02-24T10:40:43Z

'Utility' here was just starting with 'what works currently in pdfTeX', so 'what encodings are available'. I'm not sure what exactly all the mappings cover: it's possible there are false-positives. Presumably there are for a start all of the math variants (italic, sanserif, ...).

josephwright · 2020-02-24T11:01:12Z

A lot of it is accented Latin/Cyrillic/Greek, then there is Copic, Armenian, Old Hungarian, Cherokee, etc. Certainly not 30 alphabets, but probably at least 10.

josephwright · 2020-02-24T11:07:59Z

Full list of scripts:

Latin (>700 codepoints!) incl. full-width versions
Greek
Coptic
Cyrillic
Armenian
Georgian
Cherokee
Glagolitic
Deseret
Osage
Old Hungarian
Warang
Medefaidrin
Adlam

car222222 · 2020-02-24T12:49:13Z

!! Latin (>700 codepoints!) incl. full-width versions
Ah yes, not to mention ‘circled superscript’ versions,
and I am sure there must be lowercase emojis in Unicode by now:-).

josephwright · 2020-02-24T13:38:38Z

@car222222 Luckily no circled letters ;) It's mainly lots and lots of combining accent versions.

u-fischer · 2020-02-24T13:51:24Z

@josephwright but you really should implement \text_lowercase:n{\emoji{Man}} = \emoji{Boy} ;-)

josephwright · 2020-03-03T16:57:05Z

Thoughts on further coverage? Or do we go with what I've set up for the present?

moewew · 2020-03-03T20:55:21Z

The handling of \.I İ in the MWE above is different in pdfLaTeX (also compared to the Unicode engines), but I admit that İ is probably a tricky case in the generic case change code.

So I tried the Turkish case changer

\documentclass{article}
\usepackage{fontspec}
\usepackage{libertinus}
\usepackage{expl3}

\ExplSyntaxOn
\def\test{\text_lowercase:nn{tr}}
\ExplSyntaxOff

\begin{document}
\test{\.I İ \CYRI И}
\end{document}

(L3 programming layer <2020-02-25>) and LuaLaTeX and XeLaTeX are not happy

! Undefined control sequence.
<inserted text> ı

josephwright · 2020-03-03T21:19:50Z

@moewew Hmm, that's a bit odd: I'll get is sorted

josephwright · 2020-03-03T22:04:16Z

@moewew Specific issue with Turkish: now fixed

FrankMittelbach · 2020-03-03T22:16:05Z

Thoughts on further coverage? Or do we go with what I've set up for the present?

I would start with present and extend when need arrises

josephwright · 2020-03-03T22:20:19Z

OK, I think that's the best position, and also means we can keep issues moving. I'll close here and specific additions can be addressed in new issues.

josephwright added expl3 enhancement New feature or request labels Feb 17, 2020

josephwright self-assigned this Feb 17, 2020

josephwright added a commit that referenced this issue Feb 24, 2020

Case-changing support for T2 encodings (issue #675)

c79f32a

It turns out to be safest to deal with these directly, rather than using an expand-and-check approach.

josephwright added a commit that referenced this issue Mar 3, 2020

Fix issue wth Turkish case-changing (see #675)

3de4c5f

josephwright closed this as completed Mar 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Case changing for Cyrillic #675

Case changing for Cyrillic #675

josephwright commented Feb 17, 2020

blefloch commented Feb 18, 2020 via email

car222222 commented Feb 18, 2020

FrankMittelbach commented Feb 18, 2020

car222222 commented Feb 18, 2020

josephwright commented Feb 18, 2020

car222222 commented Feb 18, 2020

josephwright commented Feb 18, 2020

FrankMittelbach commented Feb 18, 2020

u-fischer commented Feb 18, 2020

blefloch commented Feb 18, 2020 via email

car222222 commented Feb 18, 2020

car222222 commented Feb 18, 2020

FrankMittelbach commented Feb 18, 2020 via email

car222222 commented Feb 18, 2020

josephwright commented Feb 24, 2020

FrankMittelbach commented Feb 24, 2020

josephwright commented Feb 24, 2020

josephwright commented Feb 24, 2020

car222222 commented Feb 24, 2020

josephwright commented Feb 24, 2020

josephwright commented Feb 24, 2020

josephwright commented Feb 24, 2020

car222222 commented Feb 24, 2020 •

edited

josephwright commented Feb 24, 2020

u-fischer commented Feb 24, 2020

josephwright commented Mar 3, 2020

moewew commented Mar 3, 2020

josephwright commented Mar 3, 2020

josephwright commented Mar 3, 2020

FrankMittelbach commented Mar 3, 2020

josephwright commented Mar 3, 2020

Case changing for Cyrillic #675

Case changing for Cyrillic #675

Comments

josephwright commented Feb 17, 2020

blefloch commented Feb 18, 2020 via email

car222222 commented Feb 18, 2020

FrankMittelbach commented Feb 18, 2020

car222222 commented Feb 18, 2020

josephwright commented Feb 18, 2020

car222222 commented Feb 18, 2020

josephwright commented Feb 18, 2020

FrankMittelbach commented Feb 18, 2020

u-fischer commented Feb 18, 2020

blefloch commented Feb 18, 2020 via email

car222222 commented Feb 18, 2020

car222222 commented Feb 18, 2020

FrankMittelbach commented Feb 18, 2020 via email

car222222 commented Feb 18, 2020

josephwright commented Feb 24, 2020

FrankMittelbach commented Feb 24, 2020

josephwright commented Feb 24, 2020

josephwright commented Feb 24, 2020

car222222 commented Feb 24, 2020

josephwright commented Feb 24, 2020

josephwright commented Feb 24, 2020

josephwright commented Feb 24, 2020

car222222 commented Feb 24, 2020 • edited

josephwright commented Feb 24, 2020

u-fischer commented Feb 24, 2020

josephwright commented Mar 3, 2020

moewew commented Mar 3, 2020

josephwright commented Mar 3, 2020

josephwright commented Mar 3, 2020

FrankMittelbach commented Mar 3, 2020

josephwright commented Mar 3, 2020

car222222 commented Feb 24, 2020 •

edited