Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std/text/unicode graphemes does not return a list of grapheme clusters to iterate #458

Open
erf opened this issue Feb 1, 2024 · 1 comment · May be fixed by #461
Open

std/text/unicode graphemes does not return a list of grapheme clusters to iterate #458

erf opened this issue Feb 1, 2024 · 1 comment · May be fixed by #461
Labels

Comments

@erf
Copy link

erf commented Feb 1, 2024

I thought graphemes("hi❤️‍🔥")

would return the list: ["h", "i", "❤️‍🔥"], a list of grapheme clusters that i could iterate with:

  l.foreach fn(c)
    println(c)

which would print out single grapheme clusters like:

h
i
❤️‍🔥

also if i print l.length now it returns 6, i wish there was a function which would return the number of grapheme clusters like 3 in this case.

I'm new to koka and these libraries so sorry if i've mistaken the usage.

This Dart Characters package might be inspiration

@TimWhiting
Copy link
Collaborator

TimWhiting commented Feb 1, 2024

See

pub fun graphemes( s : string ) : list<grapheme> {
for the details on how koka currently reports graphemes.

Note that strings already are utf16, and characters are utf16 code points. So doing string.list gives you characters at that granularity.

Here are some adjustments to the current that I think gives you what you want:

import std/text/unicode

// Join combining characters with their base into a grapheme.
fun join-combining( cs : list<char>, comb : list<char> = [], acc : list<grapheme> = []) : list<grapheme> {
  match(cs) {
    Cons(zwj, cc) | zwj.int == 0x200D -> // Add zero width joiner
      match cc
        Cons(c, cc') -> cc'.join-combining(Cons(c, Cons(zwj,comb)), acc)
        Nil -> cc.join-combining(Cons(zwj, comb), acc)
    Cons(c,cc) -> if (c.is-combining2)
                   then cc.join-combining( Cons(c,comb), acc )
                   else cc.join-combining( [c], consrev(comb,acc) )
    Nil        -> consrev(comb,acc).reverse
  }
}
fun consrev(xs,xss) {
  if (xs.is-nil) then xss else Cons(xs.reverse.string,xss)
}

pub fun is-combining2( c : char ) : bool {
  val i = c.int
  ((i >= 0x0300 && i <= 0x036F) ||
   (i >= 0x1AB0 && i <= 0x1AFF) ||
   (i >= 0x1DC0 && i <= 0x1DFF) ||
   (i >= 0x20D0 && i <= 0x20FF) ||
   (i >= 0xFE20 && i <= 0xFE2F) ||
   (i >= 0xFE00 && i <= 0xFE0F)) // Added variation selectors
}

fun main()
  "Utf16 code points".println
  "hi❤️‍🔥".list.map(show).join(",").println
  "NFC".println // This is the normalization that graphemes gives you
  "hi❤️‍🔥".normalize(NFC).list.join-combining.join(",").println
  "NFD".println
  "hi❤️‍🔥".normalize(NFD).list.join-combining.join(",").println
  "NFKC".println
  "hi❤️‍🔥".normalize(NFKC).list.join-combining.join(",").println
  "NFKD".println
  "hi❤️‍🔥".normalize(NFKD).list.join-combining.join(",").println

All of the different normalization schemes give the same result in this case. I added the zero width joiner to the join-combining function and added variation selectors to the is-combining2 function. I'll have to talk to Daan to see if this is the intended operation of graphemes.

From the api description copied below it is not clear if self-contained symbol would mean to keep the heart / fire and variation selector separate or not: It seems to me that since the variation selectors and zero width joiner do not have any character representation that the above changes should be incorporated. Either way, at minimum I think there should be changes made to make join-combining a public function and have a variant that combines all non-representable (visual) code-points.

// Grapheme's are an alias for `:string`.
// Each grapheme is a self-contained symbol consisting of
// a unicode character followed by combining characters and/or
// combining marks.
pub alias grapheme = string

@TimWhiting TimWhiting linked a pull request Feb 3, 2024 that will close this issue
@TimWhiting TimWhiting added the bug label Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants