Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error message pretty printing is unaware of width of a character #370

Open
suhdonghwi opened this issue Sep 15, 2019 · 17 comments
Open

Error message pretty printing is unaware of width of a character #370

suhdonghwi opened this issue Sep 15, 2019 · 17 comments
Labels

Comments

@suhdonghwi
Copy link

errorBundlePretty is not working properly if input contains full-width character.

image

Character pointer (^) should be pointing '이', but it is pointing the wrong position. Full-width character should be replaced with two spaces, not one.

@mrkkrp
Copy link
Owner

mrkkrp commented Sep 15, 2019

Possibly related to #362.

@mrkkrp mrkkrp added this to Backlog in megaparsec-8.0.0 Sep 15, 2019
@mrkkrp mrkkrp added the bug label Sep 15, 2019
@mrkkrp mrkkrp added this to the 8.0.0 milestone Sep 15, 2019
@mrkkrp
Copy link
Owner

mrkkrp commented Oct 26, 2019

@simonvandel Can you please provide the example as unicode text, not as an image?

@suhdonghwi
Copy link
Author

@mrkkrp Did you mean to refer me? Then the example text is "123 구구 이면".

@mrkkrp
Copy link
Owner

mrkkrp commented Oct 26, 2019

Yes, sorry. Thanks for the example!

@mrkkrp
Copy link
Owner

mrkkrp commented Oct 26, 2019

OK, let's see. You're saying that full width characters should be replaced with two spaces. What we have at this point is column position. Simply put, we insert as many spaces as necessary to reach the same column position. So are you of the opinion that we need to increment current column by 2 instead of 1 for full width characters? Is this something that software working with Korean characters usually does?

@mrkkrp mrkkrp moved this from Backlog to In progress in megaparsec-8.0.0 Oct 26, 2019
@suhdonghwi
Copy link
Author

Yes, it is usual practice because in monospaced font, two half-width spaces and one full-width character (ex. Korean syllable) are always the same width. So I think it is safe to put two spaces when it encounters full-width character.

@mrkkrp
Copy link
Owner

mrkkrp commented Oct 26, 2019

What I'm more concerned about, is that it may be confusing to see position like 1:5 when you only consumed two characters.

@mrkkrp
Copy link
Owner

mrkkrp commented Oct 26, 2019

Also need to find a good way to detect full-width characters.

@suhdonghwi
Copy link
Author

I don't know how it's working inside so I can't really know what is proper solution for that problem. But maybe it is possible to make use of full-width space character? It is a single space character but full-width.

Regarding the detection of full-width characters, there is a method in ICU library to check if a character is full-width or not. But ICU library itself is pretty heavy, so you can extract the method from the library.

@mrkkrp
Copy link
Owner

mrkkrp commented Oct 26, 2019

But maybe it is possible to make use of full-width space character?

Alas, we do not have information which characters were full-width and which ones were normal by the time we're printing parse errors. As I mentioned, we only have column position, so we have to work with that.

@mrkkrp
Copy link
Owner

mrkkrp commented Oct 27, 2019

I looked at this a bit today and I do not know how to check for full-width characters efficiently and without depending on a library. Even in text-icu you can get Unicode category and check if it's HalfwidthAndFullwidthForms but according to the Wikipedia's page about this Unicode block (and it sort of follows from the name of the block) it's not guaranteed that those characters will be full width, it's in fact a mix of the two.

@mrkkrp mrkkrp removed this from the 8.0.0 milestone Oct 27, 2019
@mrkkrp mrkkrp removed this from In progress in megaparsec-8.0.0 Oct 27, 2019
@ivan-timokhin
Copy link

The relevant property here is probably EastAsianWidth. The detailed description of individual property values is given in UAX #11, but the high-level summary seems to be

Attribute value Width
EANeutral narrow
EAAmbiguous ambiguous
EAHalf narrow
EAFull wide
EANarrow narrow 
EAWide wide
EACount Not a valid value; probably an artefact of translation from C

If you don't want to depend on text-icu, all the relevant information is contained in EastAsianWidth.txt, in a fairly straightforward format.

@obfusk
Copy link

obfusk commented May 10, 2020

Using text-icu, a fix for errorBundlePretty seems pretty easy:

Change

rpadding = replicate rpshift ' '

to

rpadding = [ if isWide c then ' ' else ' ' | c <- take rpshift sline ]

Using

import Data.Text.ICU.Char (property, EastAsianWidth(..), EastAsianWidth_(..))
isWide c = property EastAsianWidth c `elem` [EAFull, EAWide]

I'd be happy to make a PR for that.

@mrkkrp
Copy link
Owner

mrkkrp commented May 10, 2020

The question is only whether or not adding an extra dependency is worth it. In the past people were really upset about each and every extra dependency.

@recursion-ninja
Copy link

I'd prefer the extra dependency if it means better Unicode support in the error message rendering. One of the main attractions of megaparsec (for me and my team) is the high quality, effortless error messages.

@obfusk
Copy link

obfusk commented May 10, 2020

I get that. In that case I see 4 options:

  1. don't do anything
  2. add the text-icu dependency (easy but not ideal)
  3. add code for EAW directly (ugly)
  4. add an errorBundlePrettyThatHandlesCharacterWidth to the API that adds the isWide predicate as an extra parameter to errorBundlePretty (also not ideal but allows downstream users like me who don't mind depending on text-icu to handle these cases fairly easily)

I'd be happy to help with any of those :)

@obfusk
Copy link

obfusk commented Sep 7, 2020

I wrote a few lines of python to generate the relevant code point ranges:

#!/usr/bin/python3
import itertools as IT, unicodedata as UD

def grouper(iterable, n):
  args = [iter(iterable)] * n
  return IT.zip_longest(*args, fillvalue = None)

ranges, start = [], None
for i in range(0x10FFFF):
  if UD.category(chr(i)) != "Cn" and UD.east_asian_width(chr(i)) in "FW":
    if start is None: start = i
  else:
    if start is not None:
      ranges.append((start, i-1))
      start = None

print("(0, {}) [".format(len(ranges)-1))
print("    " + ",\n    ".join(
  ", ".join( "(0x{:06x}, 0x{:06x})".format(*r) for r in g if r )
  for g in grouper(ranges, 3)
))
print("  ]")

which then results in this bit of haskell:

#!/usr/bin/runhaskell

import Data.Char (ord)
import Data.Array (Array, Ix, (!), listArray, bounds)
import System.Environment (getArgs)

isWide :: Char -> Bool
isWide c = go $ bounds wideRanges
  where
    go (lo, hi)
        | hi < lo           = False
        | a <= n && n <= b  = True
        | n < a             = go (lo, pred mid)
        | otherwise         = go (succ mid, hi)
      where
        mid     = (lo + hi) `div` 2
        (a, b)  = wideRanges ! mid
    n = ord c

wideRanges :: Array Int (Int, Int)
wideRanges = listArray (0, 118) [
    (0x001100, 0x00115f), (0x00231a, 0x00231b), (0x002329, 0x00232a),
    (0x0023e9, 0x0023ec), (0x0023f0, 0x0023f0), (0x0023f3, 0x0023f3),
    (0x0025fd, 0x0025fe), (0x002614, 0x002615), (0x002648, 0x002653),
    (0x00267f, 0x00267f), (0x002693, 0x002693), (0x0026a1, 0x0026a1),
    (0x0026aa, 0x0026ab), (0x0026bd, 0x0026be), (0x0026c4, 0x0026c5),
    (0x0026ce, 0x0026ce), (0x0026d4, 0x0026d4), (0x0026ea, 0x0026ea),
    (0x0026f2, 0x0026f3), (0x0026f5, 0x0026f5), (0x0026fa, 0x0026fa),
    (0x0026fd, 0x0026fd), (0x002705, 0x002705), (0x00270a, 0x00270b),
    (0x002728, 0x002728), (0x00274c, 0x00274c), (0x00274e, 0x00274e),
    (0x002753, 0x002755), (0x002757, 0x002757), (0x002795, 0x002797),
    (0x0027b0, 0x0027b0), (0x0027bf, 0x0027bf), (0x002b1b, 0x002b1c),
    (0x002b50, 0x002b50), (0x002b55, 0x002b55), (0x002e80, 0x002e99),
    (0x002e9b, 0x002ef3), (0x002f00, 0x002fd5), (0x002ff0, 0x002ffb),
    (0x003000, 0x00303e), (0x003041, 0x003096), (0x003099, 0x0030ff),
    (0x003105, 0x00312f), (0x003131, 0x00318e), (0x003190, 0x0031ba),
    (0x0031c0, 0x0031e3), (0x0031f0, 0x00321e), (0x003220, 0x003247),
    (0x003250, 0x004db5), (0x004e00, 0x009fef), (0x00a000, 0x00a48c),
    (0x00a490, 0x00a4c6), (0x00a960, 0x00a97c), (0x00ac00, 0x00d7a3),
    (0x00f900, 0x00fa6d), (0x00fa70, 0x00fad9), (0x00fe10, 0x00fe19),
    (0x00fe30, 0x00fe52), (0x00fe54, 0x00fe66), (0x00fe68, 0x00fe6b),
    (0x00ff01, 0x00ff60), (0x00ffe0, 0x00ffe6), (0x016fe0, 0x016fe3),
    (0x017000, 0x0187f7), (0x018800, 0x018af2), (0x01b000, 0x01b11e),
    (0x01b150, 0x01b152), (0x01b164, 0x01b167), (0x01b170, 0x01b2fb),
    (0x01f004, 0x01f004), (0x01f0cf, 0x01f0cf), (0x01f18e, 0x01f18e),
    (0x01f191, 0x01f19a), (0x01f200, 0x01f202), (0x01f210, 0x01f23b),
    (0x01f240, 0x01f248), (0x01f250, 0x01f251), (0x01f260, 0x01f265),
    (0x01f300, 0x01f320), (0x01f32d, 0x01f335), (0x01f337, 0x01f37c),
    (0x01f37e, 0x01f393), (0x01f3a0, 0x01f3ca), (0x01f3cf, 0x01f3d3),
    (0x01f3e0, 0x01f3f0), (0x01f3f4, 0x01f3f4), (0x01f3f8, 0x01f43e),
    (0x01f440, 0x01f440), (0x01f442, 0x01f4fc), (0x01f4ff, 0x01f53d),
    (0x01f54b, 0x01f54e), (0x01f550, 0x01f567), (0x01f57a, 0x01f57a),
    (0x01f595, 0x01f596), (0x01f5a4, 0x01f5a4), (0x01f5fb, 0x01f64f),
    (0x01f680, 0x01f6c5), (0x01f6cc, 0x01f6cc), (0x01f6d0, 0x01f6d2),
    (0x01f6d5, 0x01f6d5), (0x01f6eb, 0x01f6ec), (0x01f6f4, 0x01f6fa),
    (0x01f7e0, 0x01f7eb), (0x01f90d, 0x01f971), (0x01f973, 0x01f976),
    (0x01f97a, 0x01f9a2), (0x01f9a5, 0x01f9aa), (0x01f9ae, 0x01f9ca),
    (0x01f9cd, 0x01f9ff), (0x01fa70, 0x01fa73), (0x01fa78, 0x01fa7a),
    (0x01fa80, 0x01fa82), (0x01fa90, 0x01fa95), (0x020000, 0x02a6d6),
    (0x02a700, 0x02b734), (0x02b740, 0x02b81d), (0x02b820, 0x02cea1),
    (0x02ceb0, 0x02ebe0), (0x02f800, 0x02fa1d)
  ]

main :: IO ()
main = mapM_ (print . isWide) . concat =<< getArgs

tomjaguarpaw pushed a commit to tomjaguarpaw/megaparsec that referenced this issue Sep 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants