Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case changing for Cyrillic #675

Closed
josephwright opened this issue Feb 17, 2020 · 31 comments
Closed

Case changing for Cyrillic #675

josephwright opened this issue Feb 17, 2020 · 31 comments
Assignees
Labels
enhancement New feature or request

Comments

@josephwright
Copy link
Member

As noted in #671, at present

\documentclass{article}
\usepackage[T1,T2A]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{expl3}

\ExplSyntaxOn
\def\test{\text_lowercase:n}
\ExplSyntaxOff

\begin{document}
\test{\.I İ \CYRI И}
\end{document}

gives at-best an 'odd' result.

It should be possible to carry out case-changing here as it is not dependent on \lccode changes but rather on expanding И to

\u8:И ->\IeC {\CYRI }

and then doing the work.

@josephwright josephwright added expl3 enhancement New feature or request labels Feb 17, 2020
@josephwright josephwright self-assigned this Feb 17, 2020
@blefloch
Copy link
Member

blefloch commented Feb 18, 2020 via email

@car222222
Copy link
Contributor

@blefloch
Yes!

What are these \u8:... commands anyway? Are they needed?

@FrankMittelbach
Copy link
Member

@blefloch
Yes!

or maybe not Chris. One may has to deal with ^^notation in that place instead of И but on the whole I agree that looks like the better starting point

What are these \u8:... commands anyway? Are they needed?

you should know :-) your name is on the file that contains that code. Yes they are needed: in pdftex LaTeX sees bytes analyzes them and constructs a single csname from them \u8:...which holds the LICR for that utf8 char which in the above case is \IeC {\CYRI } or if the \u8:... is not defined responds with no Unicode representation for ...

@car222222
Copy link
Contributor

you should know :-) your name is on the file that contains that code.
But Not everything I may be responsible for is needed:-).

I agree I should look at the original code! At least to find out where the : came from.

But I should stop now in case I anger a certain person by displaying my opinions in such a public place:-).

@josephwright
Copy link
Member Author

@blefloch There are a couple of things needed. The first is to spot a UTF-8 pair/triplet/quartet and grab it whole rather than token-by-token. That's easy enough: check for active char tokens equal to the inputenc starting point. The second phase is to know how to case change them. The reason I mentioned taking the \IeC{...} approach is then we don't need new data: it's the same way that \MakeUppercase handles them and so uses the \@uclclist data we're already collecting.

@car222222
Copy link
Contributor

The reason I mentioned taking the \IeC{...} approach is then we don't need new data:
Well, you may need a bit more if you want to cover absolutely every character that changes case (They may not all yet have LICRs.)

Using numbers and Unicode tables is aesthetically more appealing, of course. But if ‘tables of names’ works for now . . .

For Cyrillic, Greek, Armenian, etc etc, is it possible to use new LICRs of the form \cyr{}, a bit like accents?

@josephwright
Copy link
Member Author

@car222222 The issue came up as there are places that current \MakeUppercase will work that \text_uppercase:n won't, which come down to things that go via u8:.... That's why I was starting with this. If we want the full Unicode range in pdfTeX (doable), we'll need to store the data manually in a integer array.

@FrankMittelbach
Copy link
Member

If we want the full Unicode range in pdfTeX (doable), we'll need to store the data manually in a integer array.

Given that pdfTeX deliberately only provides utf8 chars if supported by the loaded font encodings it is questionable to first case change and then find that the result is an unsupported character. Of course, if the whole data is inside the format then there is no extra payload (other than the size taken up by it) and the initial preparation.

@u-fischer
Copy link
Member

it is questionable to first case change and then find that the result is an unsupported character.

I don't find this very problematic. Lowercase and uppercase are in the same encoding, so you only would get an error on a capital alpha if you start with the unsupported lowercase alpha.

@blefloch
Copy link
Member

blefloch commented Feb 18, 2020 via email

@car222222
Copy link
Contributor

I agree with Ulrike and Bruno. But I am failing to imagine a realistic case (pun intended) where the upper and lower case characters are not both available/unavailable simultaneously.

@car222222
Copy link
Contributor

Given that pdfTeX deliberately only provides utf8 chars if supported by the loaded font encodings

Meaning what? pdfTeX does not ‘provide chars’ at all, does it? And ‘loaded font encodings’ is a LaTeX concept, not an engine one.

Maybe it means that in the way we originally set up the utf8 stuff for LaTeX, LICRs were only (and mappings were provided only ‘for known encodings’ and then only loaded for loaded encodings.

True, but there is no need to keep such restrictions these days, is there?
We can certainly now easily provide them for any subset of Unicode we wish to, and in this context we only need to cover all ‘casable characters’.

Disclaimer: I was never very keen on that restriction to known encodings:-).

@FrankMittelbach
Copy link
Member

FrankMittelbach commented Feb 18, 2020 via email

@car222222
Copy link
Contributor

There may well be reasons for not loading LICRs for unrepresentable characters.

But here we are talking only about defining these LICRs and uppercasing characters, note ‘characters‘.
Nothing to do with typesetting them, so the encodings/fonts that are available are not relevant.
Use-case: the uppecased form is only for use in a pdf bookmark, never to be typeset (by TeX, at least!)

josephwright added a commit that referenced this issue Feb 24, 2020
It turns out to be safest to deal with these directly,
rather than using an expand-and-check approach.
@josephwright
Copy link
Member Author

After looking at the problem a little more, it seemed easier to handle it using a fixed list of mappings rather than trying to do things by looking inside active chars. I had a quick look at how many codepoints there are with case-changing data: about 2000. That's possibly a bit much to do all of them, so for the present I've picked up Greek and Cyrillic ones that are covered by T2/LGR. Thoughts welcome.

@FrankMittelbach
Copy link
Member

what about the idea to store all of them in an intarray?

@josephwright
Copy link
Member Author

The thing with using an intarray is we can't make it sparse, so the size would depend on the codepoint of the final value to be stored. There's also a bit of a performace hit at point-of-use as we'd have to extract, convert to bytes and construct the active chars then, rather than doing it once at load time.

@josephwright
Copy link
Member Author

Also, back with the 'what codepoints have glyphs' business, as far as I know, the Greek and Cyrllic ones plus the Latin ones already covered are by far the most useful

@car222222
Copy link
Contributor

Well, to the Greeks and Cyrills they are the most useful, yes! But not to the rest of the world?
Das heisst: how did you measure this utility?

I guess the total gets up so large due to the many latin-derivatives around, or not?
2000 is approx 30+ typical alphabets, I guess.

@josephwright
Copy link
Member Author

'Utility' here was just starting with 'what works currently in pdfTeX', so 'what encodings are available'. I'm not sure what exactly all the mappings cover: it's possible there are false-positives. Presumably there are for a start all of the math variants (italic, sanserif, ...).

@josephwright
Copy link
Member Author

A lot of it is accented Latin/Cyrillic/Greek, then there is Copic, Armenian, Old Hungarian, Cherokee, etc. Certainly not 30 alphabets, but probably at least 10.

@josephwright
Copy link
Member Author

Full list of scripts:

  • Latin (>700 codepoints!) incl. full-width versions
  • Greek
  • Coptic
  • Cyrillic
  • Armenian
  • Georgian
  • Cherokee
  • Glagolitic
  • Deseret
  • Osage
  • Old Hungarian
  • Warang
  • Medefaidrin
  • Adlam

@car222222
Copy link
Contributor

car222222 commented Feb 24, 2020

!! Latin (>700 codepoints!) incl. full-width versions
Ah yes, not to mention ‘circled superscript’ versions,
and I am sure there must be lowercase emojis in Unicode by now:-).

@josephwright
Copy link
Member Author

@car222222 Luckily no circled letters ;) It's mainly lots and lots of combining accent versions.

@u-fischer
Copy link
Member

@josephwright but you really should implement \text_lowercase:n{\emoji{Man}} = \emoji{Boy} ;-)

@josephwright
Copy link
Member Author

Thoughts on further coverage? Or do we go with what I've set up for the present?

@moewew
Copy link
Contributor

moewew commented Mar 3, 2020

The handling of \.I İ in the MWE above is different in pdfLaTeX (also compared to the Unicode engines), but I admit that İ is probably a tricky case in the generic case change code.

So I tried the Turkish case changer

\documentclass{article}
\usepackage{fontspec}
\usepackage{libertinus}
\usepackage{expl3}

\ExplSyntaxOn
\def\test{\text_lowercase:nn{tr}}
\ExplSyntaxOff

\begin{document}
\test{\.I İ \CYRI И}
\end{document}

(L3 programming layer <2020-02-25>) and LuaLaTeX and XeLaTeX are not happy

! Undefined control sequence.
<inserted text> ı

@josephwright
Copy link
Member Author

@moewew Hmm, that's a bit odd: I'll get is sorted

@josephwright
Copy link
Member Author

@moewew Specific issue with Turkish: now fixed

@FrankMittelbach
Copy link
Member

Thoughts on further coverage? Or do we go with what I've set up for the present?

I would start with present and extend when need arrises

@josephwright
Copy link
Member Author

OK, I think that's the best position, and also means we can keep issues moving. I'll close here and specific additions can be addressed in new issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants