New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Case changing for Cyrillic #675
Comments
\u8:И ->\IeC {\CYRI }
Couldn't it make more sense to extract И from \u8:И, and look up case
information in some intarray?
|
@blefloch What are these \u8:... commands anyway? Are they needed? |
or maybe not Chris. One may has to deal with
you should know :-) your name is on the file that contains that code. Yes they are needed: in pdftex LaTeX sees bytes analyzes them and constructs a single csname from them |
I agree I should look at the original code! At least to find out where the : came from. But I should stop now in case I anger a certain person by displaying my opinions in such a public place:-). |
@blefloch There are a couple of things needed. The first is to spot a UTF-8 pair/triplet/quartet and grab it whole rather than token-by-token. That's easy enough: check for active char tokens equal to the |
Using numbers and Unicode tables is aesthetically more appealing, of course. But if ‘tables of names’ works for now . . . For Cyrillic, Greek, Armenian, etc etc, is it possible to use new LICRs of the form \cyr{}, a bit like accents? |
@car222222 The issue came up as there are places that current |
Given that pdfTeX deliberately only provides utf8 chars if supported by the loaded font encodings it is questionable to first case change and then find that the result is an unsupported character. Of course, if the whole data is inside the format then there is no extra payload (other than the size taken up by it) and the initial preparation. |
I don't find this very problematic. Lowercase and uppercase are in the same encoding, so you only would get an error on a capital alpha if you start with the unsupported lowercase alpha. |
On 2/18/20 3:49 PM, Ulrike Fischer wrote:
it is questionable to first case change and then find that the
result is an unsupported character.
I don't find this very problematic. Lowercase and uppercase are in the
same encoding, so you only would get an error on a capital alpha if you
start with the unsupported lowercase alpha.
Even if there exists an encoding with lowercase alpha but not upper case
alpha (this might plausibly be the case for some of the rarer accents),
getting an error of Unicode char not set up seems better than
accidentally getting the lowercase char.
|
I agree with Ulrike and Bruno. But I am failing to imagine a realistic case (pun intended) where the upper and lower case characters are not both available/unavailable simultaneously. |
Meaning what? pdfTeX does not ‘provide chars’ at all, does it? And ‘loaded font encodings’ is a LaTeX concept, not an engine one. Maybe it means that in the way we originally set up the utf8 stuff for LaTeX, LICRs were only (and mappings were provided only ‘for known encodings’ and then only loaded for loaded encodings. True, but there is no need to keep such restrictions these days, is there? Disclaimer: I was never very keen on that restriction to known encodings:-). |
Given that pdfTeX deliberately only provides utf8 chars if
supported by the loaded font encodings
Meaning what? pdfTeX does not ‘provide chars’ at all, does it? And
‘loaded font encodings’ is a LaTeX concept, not an engine one.
meaning pdflatex and writing pdftex
Maybe it means that in the way we originally set up the utf8 stuff for
LaTeX, LICRs were only (and mappings were provided only ‘for known
encodings’ and then only loaded for loaded encodings.
yes which was a Good Thing TM because that kept the LaTeX world free of
tofu and missing characters
True, but there is no need to keep such restrictions these days, is there?
We can certainly now easily provide them for any subset of Unicode we
wish to, and in this context we only need to cover all ‘casable characters’.
yes, there is. if you don't have the glyphs to typeset the characters it
is pointless to do so, which is why claiming that you cn do unicode as
as xetex or luatex (latex) does and then just generating holes ans No
char XXX warnings in the log is a step backwards to the pdflatex
solution, imho
Disclaimer: I was never very keen on that restriction to known encodings:-).
well, as long as you write English it usually doesn't matter if you
write in other languages and your document gets corrupted without
warning you you it does
|
There may well be reasons for not loading LICRs for unrepresentable characters. But here we are talking only about defining these LICRs and uppercasing characters, note ‘characters‘. |
It turns out to be safest to deal with these directly, rather than using an expand-and-check approach.
After looking at the problem a little more, it seemed easier to handle it using a fixed list of mappings rather than trying to do things by looking inside active chars. I had a quick look at how many codepoints there are with case-changing data: about 2000. That's possibly a bit much to do all of them, so for the present I've picked up Greek and Cyrillic ones that are covered by |
what about the idea to store all of them in an intarray? |
The thing with using an intarray is we can't make it sparse, so the size would depend on the codepoint of the final value to be stored. There's also a bit of a performace hit at point-of-use as we'd have to extract, convert to bytes and construct the active chars then, rather than doing it once at load time. |
Also, back with the 'what codepoints have glyphs' business, as far as I know, the Greek and Cyrllic ones plus the Latin ones already covered are by far the most useful |
Well, to the Greeks and Cyrills they are the most useful, yes! But not to the rest of the world? I guess the total gets up so large due to the many latin-derivatives around, or not? |
'Utility' here was just starting with 'what works currently in pdfTeX', so 'what encodings are available'. I'm not sure what exactly all the mappings cover: it's possible there are false-positives. Presumably there are for a start all of the math variants (italic, sanserif, ...). |
A lot of it is accented Latin/Cyrillic/Greek, then there is Copic, Armenian, Old Hungarian, Cherokee, etc. Certainly not 30 alphabets, but probably at least 10. |
Full list of scripts:
|
!! Latin (>700 codepoints!) incl. full-width versions |
@car222222 Luckily no circled letters ;) It's mainly lots and lots of combining accent versions. |
@josephwright but you really should implement |
Thoughts on further coverage? Or do we go with what I've set up for the present? |
The handling of So I tried the Turkish case changer \documentclass{article}
\usepackage{fontspec}
\usepackage{libertinus}
\usepackage{expl3}
\ExplSyntaxOn
\def\test{\text_lowercase:nn{tr}}
\ExplSyntaxOff
\begin{document}
\test{\.I İ \CYRI И}
\end{document} (
|
@moewew Hmm, that's a bit odd: I'll get is sorted |
@moewew Specific issue with Turkish: now fixed |
I would start with present and extend when need arrises |
OK, I think that's the best position, and also means we can keep issues moving. I'll close here and specific additions can be addressed in new issues. |
As noted in #671, at present
gives at-best an 'odd' result.
It should be possible to carry out case-changing here as it is not dependent on
\lccode
changes but rather on expandingИ
toand then doing the work.
The text was updated successfully, but these errors were encountered: