Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue with unicode #187

Open
johnynek opened this issue Sep 13, 2019 · 1 comment
Open

issue with unicode #187

johnynek opened this issue Sep 13, 2019 · 1 comment

Comments

@johnynek
Copy link
Collaborator

we currently use String.length to see how many columns we have moved. e.g.

https://github.com/typelevel/paiges/blob/master/core/src/main/scala/org/typelevel/paiges/Chunk.scala#L108

https://github.com/typelevel/paiges/blob/master/core/src/main/scala/org/typelevel/paiges/Doc.scala#L299

https://github.com/typelevel/paiges/blob/master/core/src/main/scala/org/typelevel/paiges/Doc.scala#L579

But, for characters that don't fit in 16 bits, this will be incorrect. A possibly better, but more expensive, solution is to use: https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#codePointCount(int,%20int)

But there is also the question of halfwidth/fullwidth forms:
https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms

So, to be really pro-style, we need a way to take a String and return the width on the screen when it is printed.

@johnynek
Copy link
Collaborator Author

looks like this may work:

https://docs.oracle.com/javase/7/docs/api/java/text/BreakIterator.html

from: https://engineering.linecorp.com/en/blog/the-7-ways-of-counting-characters/

public static int getGraphemeLength(String value) {
    BreakIterator it = BreakIterator.getCharacterInstance(); 
    it.setText(value); 
    int count = 0; 
    while (it.next() != BreakIterator.DONE) { 
        count++; 
    }
    return count;
}

We could have the render method take a trait StringLength { def apply(s: String): Int } and have three implementations which are probably in increasing cost: s.length s.countCodePoint(0, s.length) and the above.

If you know you have ascii, you could use the first algorithm...

Maybe we should just benchmark this and see what the cost is for ascii, and if doing the correct thing doesn't totally blow up costs, we can use it.

Since the text() constructor is already doing a traversal to look for \n, maybe the cost won't be significant...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant