Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typed dict throws KeyError when keys contain any UTF-8 character ends with \xb8\x80 #9542

Open
M0gician opened this issue Apr 24, 2024 · 13 comments
Labels
bug - incorrect behavior Bugs: incorrect behavior

Comments

@M0gician
Copy link

M0gician commented Apr 24, 2024

Reporting a bug

Numba typed dict with key type of UnicodeCharSeq() of any length doesn't handle UTF-8 characters end with \xb8\x80 correctly. It seems that any of these characters is casted into empty string when __getitem__ is called, resulting KeyError

Minimum Reproduction Demo

import numba

a = numba.typed.typeddict.Dict.empty(numba.types.UnicodeCharSeq(1), numba.int64)
a['一'] = 10    # \xe4\xb8\x80
print(a)
  • this demo also works for other UTF-8 characters like 㸀 (\xe3\xb8\x80) 渀 (\xe6\xb8\x80) 縀 (\xe7\xb8\x80) 帀 (\xe5\xb8\x80)

Error Message

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...\.vscode\extensions\ms-python.python-2024.4.1\python_files\pythonrc.py", line 22, in my_displayhook
    self.original_displayhook(value)
  File "...\mamba\lib\site-packages\numba\typed\typeddict.py", line 217, in __repr__
    body = str(self)
  File "...\mamba\lib\site-packages\numba\typed\typeddict.py", line 212, in __str__
    for k, v in self.items():
  File "...\mamba\lib\_collections_abc.py", line 911, in __iter__
    yield (key, self._mapping[key])
  File "...\mamba\lib\site-packages\numba\typed\typeddict.py", line 180, in __getitem__
    return _getitem(self, key)
  File "...\mamba\lib\site-packages\numba\typed\dictobject.py", line 783, in impl
    raise KeyError()
KeyError

numba 0.59.1

@esc
Copy link
Member

esc commented Apr 26, 2024

@M0gician thank you for the report, I can reproduce this. Maybe using #9530 will yield some more insight into why this is going wrong.

@M0gician
Copy link
Author

M0gician commented Apr 26, 2024

@M0gician thank you for the report, I can reproduce this. Maybe using #9530 will yield some more insight into why this is going wrong.

Oh, I've done something similar on my end. The key here will be converted into an empty string.

KeyError: 'Key "" of type <object type:typeref[[unichr x 1]]> not found in dict'

BTW, here is the byte string for the typed dict before it gets unpickled.

b'\x80\x04\x95"\x00\x00\x00\x00\x00\x00\x00\x8c\x15numba.typed.typeddict\x94\x8c\x04Dict\x94\x93\x94.'

Another thing is if you change the typed dict's typing to [numba.types.unicode_type, numba.int64], everything works, and no KeyError anymore

@esc
Copy link
Member

esc commented Apr 26, 2024

@M0gician thank you for the report, I can reproduce this. Maybe using #9530 will yield some more insight into why this is going wrong.

Oh, I've done something similar on my end. The key here will be converted into an empty string.

KeyError: 'Key "" of type <object type:typeref[[unichr x 1]]> not found in dict'

BTW, here is the byte string for the typed dict before it gets unpickled.

b'\x80\x04\x95"\x00\x00\x00\x00\x00\x00\x00\x8c\x15numba.typed.typeddict\x94\x8c\x04Dict\x94\x93\x94.'

Another thing is if you change the typed dict's typing to [numba.types.unicode_type, numba.int64], everything works, and no KeyError anymore

Yes, as a user of typed.Dict you are in control of the key type. I looked into outlines-dev/outlines#833 and probably the issue can be fixed there?

@M0gician
Copy link
Author

M0gician commented Apr 26, 2024

@M0gician thank you for the report, I can reproduce this. Maybe using #9530 will yield some more insight into why this is going wrong.

Oh, I've done something similar on my end. The key here will be converted into an empty string.

KeyError: 'Key "" of type <object type:typeref[[unichr x 1]]> not found in dict'

BTW, here is the byte string for the typed dict before it gets unpickled.

b'\x80\x04\x95"\x00\x00\x00\x00\x00\x00\x00\x8c\x15numba.typed.typeddict\x94\x8c\x04Dict\x94\x93\x94.'

Another thing is if you change the typed dict's typing to [numba.types.unicode_type, numba.int64], everything works, and no KeyError anymore

Yes, as a user of typed.Dict you are in control of the key type. I looked into outlines-dev/outlines#833 and probably the issue can be fixed there?

Haven't heard from the outlines team yet but I did provide a solution to alleviate the issue for now

But I do have a question: what if I want to pass the keys by a numpy array? What numpy dtype is compatible with numba's unicode_type?

@esc
Copy link
Member

esc commented Apr 26, 2024

Haven't heard from the outlines team yet but I did provide a solution to alleviate the issue for now

OK, because I suspect that the type UnicodeCharSeq is either the wrong type or buggy. When I do a numba.typeof I get this:

In [1]: s = '一'

In [2]: import numba

In [3]: numba.typeof(s)
Out[3]: unicode_type

So, to narrow this down, It is possible that this isn't a typed.Dict issue.

@esc
Copy link
Member

esc commented Apr 26, 2024

But I do have a question: what if I want to pass the keys by a numpy array? What numpy dtype is compatible with numba's unicode_type?

I think I understand now what is being attempted. You want the keys of the dictionary to also be values in a numpy.ndarray? I am not sure yet how to do that, perhaps you'd need to cast the value?

@M0gician
Copy link
Author

M0gician commented Apr 26, 2024

But I do have a question: what if I want to pass the keys by a numpy array? What numpy dtype is compatible with numba's unicode_type?

I think I understand now what is being attempted. You want the keys of the dictionary to also be values in a numpy.ndarray? I am not sure yet how to do that, perhaps you'd need to cast the value?

It turns out numpy dtype "U" of any length is recognized as "unicode_type" in numba. I'll make a PR on outlines side to address this issue.

>>> import numpy as np
>>> a = np.array(["", "", ""])
>>> a
array(['', '', ''], dtype='<U1')
>>> aa = numba.typed.typeddict.Dict.empty(numba.types.unicode_type, numba.int64)
>>> for i, c in enumerate(a):
...     aa[c] = i
...
>>> print(aa)
{一: 0, 二: 1, 三: 2}

@esc
Copy link
Member

esc commented Apr 26, 2024

@M0gician I actually tried the following, and got stuck:

In [5]: @numba.njit
   ...: def function():
   ...:     s = '一'
   ...:     a = np.array([s])
   ...:     return a
   ...:

In [6]: function()
---------------------------------------------------------------------------
TypingError                               Traceback (most recent call last)
Cell In[6], line 1
----> 1 function()

File ~/git/numba/numba/core/dispatcher.py:423, in _DispatcherBase._compile_for_args(self, *args, **kws)
    419         msg = (f"{str(e).rstrip()} \n\nThis error may have been caused "
    420                f"by the following argument(s):\n{args_str}\n")
    421         e.patch_message(msg)
--> 423     error_rewrite(e, 'typing')
    424 except errors.UnsupportedError as e:
    425     # Something unsupported is present in the user code, add help info
    426     error_rewrite(e, 'unsupported_error')

File ~/git/numba/numba/core/dispatcher.py:364, in _DispatcherBase._compile_for_args.<locals>.error_rewrite(e, issue_type)
    362     raise e
    363 else:
--> 364     raise e.with_traceback(None)

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<built-in function array>) found for signature:

 >>> array(list(unicode_type)<iv=['一']>)

There are 2 candidate implementations:
  - Of which 1 did not match due to:
  Overload in function 'impl_np_array': File: numba/np/arrayobj.py: Line 5432.
    With argument(s): '(list(unicode_type)<iv=None>)':
   Rejected as the implementation raised a specific error:
     TypingError: Failed in nopython mode pipeline (step: nopython frontend)
   No implementation of function Function(<intrinsic np_array>) found for signature:

    >>> np_array(list(unicode_type)<iv=None>, none)

   There are 2 candidate implementations:
         - Of which 2 did not match due to:
         Intrinsic in function 'np_array': File: numba/np/arrayobj.py: Line 5406.
           With argument(s): '(list(unicode_type)<iv=None>, none)':
          Rejected as the implementation raised a specific error:
            NumbaNotImplementedError: unicode_type cannot be represented as a NumPy dtype
     raised from /Users/vhaenel/git/numba/numba/np/numpy_support.py:159

   During: resolving callee type: Function(<intrinsic np_array>)
   During: typing of call at /Users/vhaenel/git/numba/numba/np/arrayobj.py (5443)


   File "numba/np/arrayobj.py", line 5443:
       def impl(object, dtype=None):
           return np_array(object, dtype)
           ^

  raised from /Users/vhaenel/git/numba/numba/core/typeinfer.py:1091
  - Of which 1 did not match due to:
  Overload in function 'impl_np_array': File: numba/np/arrayobj.py: Line 5432.
    With argument(s): '(list(unicode_type)<iv=['一']>)':
   Rejected as the implementation raised a specific error:
     TypingError: Failed in nopython mode pipeline (step: nopython frontend)
   No implementation of function Function(<intrinsic np_array>) found for signature:

    >>> np_array(list(unicode_type)<iv=['一']>, none)

   There are 2 candidate implementations:
         - Of which 2 did not match due to:
         Intrinsic in function 'np_array': File: numba/np/arrayobj.py: Line 5406.
           With argument(s): '(list(unicode_type)<iv=None>, none)':
          Rejected as the implementation raised a specific error:
            NumbaNotImplementedError: unicode_type cannot be represented as a NumPy dtype
     raised from /Users/vhaenel/git/numba/numba/np/numpy_support.py:159

   During: resolving callee type: Function(<intrinsic np_array>)
   During: typing of call at /Users/vhaenel/git/numba/numba/np/arrayobj.py (5443)


   File "numba/np/arrayobj.py", line 5443:
       def impl(object, dtype=None):
           return np_array(object, dtype)
           ^

  raised from /Users/vhaenel/git/numba/numba/core/typeinfer.py:1091

During: resolving callee type: Function(<built-in function array>)
During: typing of call at <ipython-input-5-726a1c6ac68f> (4)


File "<ipython-input-5-726a1c6ac68f>", line 4:
def function():
    <source elided>
    s = '一'
    a = np.array([s])

@M0gician
Copy link
Author

M0gician commented Apr 26, 2024

@M0gician I actually tried the following, and got stuck:

In [5]: @numba.njit
   ...: def function():
   ...:     s = '一'
   ...:     a = np.array([s])
   ...:     return a
   ...:

In [6]: function()
---------------------------------------------------------------------------
TypingError                               Traceback (most recent call last)
Cell In[6], line 1
----> 1 function()

File ~/git/numba/numba/core/dispatcher.py:423, in _DispatcherBase._compile_for_args(self, *args, **kws)
    419         msg = (f"{str(e).rstrip()} \n\nThis error may have been caused "
    420                f"by the following argument(s):\n{args_str}\n")
    421         e.patch_message(msg)
--> 423     error_rewrite(e, 'typing')
    424 except errors.UnsupportedError as e:
    425     # Something unsupported is present in the user code, add help info
    426     error_rewrite(e, 'unsupported_error')

File ~/git/numba/numba/core/dispatcher.py:364, in _DispatcherBase._compile_for_args.<locals>.error_rewrite(e, issue_type)
    362     raise e
    363 else:
--> 364     raise e.with_traceback(None)

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<built-in function array>) found for signature:

 >>> array(list(unicode_type)<iv=['一']>)

There are 2 candidate implementations:
  - Of which 1 did not match due to:
  Overload in function 'impl_np_array': File: numba/np/arrayobj.py: Line 5432.
    With argument(s): '(list(unicode_type)<iv=None>)':
   Rejected as the implementation raised a specific error:
     TypingError: Failed in nopython mode pipeline (step: nopython frontend)
   No implementation of function Function(<intrinsic np_array>) found for signature:

    >>> np_array(list(unicode_type)<iv=None>, none)

   There are 2 candidate implementations:
         - Of which 2 did not match due to:
         Intrinsic in function 'np_array': File: numba/np/arrayobj.py: Line 5406.
           With argument(s): '(list(unicode_type)<iv=None>, none)':
          Rejected as the implementation raised a specific error:
            NumbaNotImplementedError: unicode_type cannot be represented as a NumPy dtype
     raised from /Users/vhaenel/git/numba/numba/np/numpy_support.py:159

   During: resolving callee type: Function(<intrinsic np_array>)
   During: typing of call at /Users/vhaenel/git/numba/numba/np/arrayobj.py (5443)


   File "numba/np/arrayobj.py", line 5443:
       def impl(object, dtype=None):
           return np_array(object, dtype)
           ^

  raised from /Users/vhaenel/git/numba/numba/core/typeinfer.py:1091
  - Of which 1 did not match due to:
  Overload in function 'impl_np_array': File: numba/np/arrayobj.py: Line 5432.
    With argument(s): '(list(unicode_type)<iv=['一']>)':
   Rejected as the implementation raised a specific error:
     TypingError: Failed in nopython mode pipeline (step: nopython frontend)
   No implementation of function Function(<intrinsic np_array>) found for signature:

    >>> np_array(list(unicode_type)<iv=['一']>, none)

   There are 2 candidate implementations:
         - Of which 2 did not match due to:
         Intrinsic in function 'np_array': File: numba/np/arrayobj.py: Line 5406.
           With argument(s): '(list(unicode_type)<iv=None>, none)':
          Rejected as the implementation raised a specific error:
            NumbaNotImplementedError: unicode_type cannot be represented as a NumPy dtype
     raised from /Users/vhaenel/git/numba/numba/np/numpy_support.py:159

   During: resolving callee type: Function(<intrinsic np_array>)
   During: typing of call at /Users/vhaenel/git/numba/numba/np/arrayobj.py (5443)


   File "numba/np/arrayobj.py", line 5443:
       def impl(object, dtype=None):
           return np_array(object, dtype)
           ^

  raised from /Users/vhaenel/git/numba/numba/core/typeinfer.py:1091

During: resolving callee type: Function(<built-in function array>)
During: typing of call at <ipython-input-5-726a1c6ac68f> (4)


File "<ipython-input-5-726a1c6ac68f>", line 4:
def function():
    <source elided>
    s = '一'
    a = np.array([s])

Somehow this code works, but the original issue of converting the specific unicode into empty string occurs again

import numba
import numpy as np

@numba.njit
def function():
    s = np.empty(3, dtype="<U1")
    s[0] = "一"
    s[1] = "二"
    s[2] = "三"
    print(s)
    a = numba.typed.List(s)
    return a

print(function())
['' '' '']
[, 二, 三]

@esc
Copy link
Member

esc commented Apr 30, 2024

@M0gician this was discussed in the developer meeting today and something fishy is going on here. This may be a Numba bug after all.

@M0gician
Copy link
Author

M0gician commented Apr 30, 2024

@M0gician this was discussed in the developer meeting today and something fishy is going on here. This may be a Numba bug after all.

I do agree. There's some inconsistency between handling python objects and numpy objects on the numba side.

When string literals or python objects like List[str] are passed to numba typed list or dict, they are treated as unicode_type. However, when they are wrapped as a numpy array, they are mostly treated as unichar and causing many casting problems.

I did find a way to solve the problem on the outlines side by tweaking either to pass python objects or numpy ones.

Numpy only have one unicode type while numba has two, and it is totally unclear to me when and why numpy "U" type was casted to one numba unicode type instead of another. I think it is a better idea to make them consistent on the numba side.

@esc
Copy link
Member

esc commented Apr 30, 2024

Numpy only have one unicode type while numba has two, and it is totally unclear to me when and why numpy "U" type was casted to one numba unicode type instead of another. I think it is a better idea to make them consistent on the numba side.

Good that you have a workaround for now and yes this sounds like an interesting brain teaser.

@sklam
Copy link
Member

sklam commented Apr 30, 2024

Found the problem.

First the '一' character is:

>>> np.array(['一']).tobytes()
b'\x00N\x00\x00'
>>> list(map(hex, np.array(['一']).tobytes()))
['0x0', '0x4e', '0x0', '0x0']

The boxer for unicodecharseq has a invalid skip on null-byte causing the copy to end prematurely:

numba/numba/core/boxing.py

Lines 237 to 238 in e467ae6

# If the char is a non-null-byte, store the next index as count
with c.builder.if_then(cgutils.is_not_null(c.builder, ch)):

The minimal patch is:

diff --git a/numba/core/boxing.py b/numba/core/boxing.py
index 39d2a6047..be6b8eb2b 100644
--- a/numba/core/boxing.py
+++ b/numba/core/boxing.py
@@ -234,9 +234,9 @@ def box_unicodecharseq(typ, val, c):
     with cgutils.loop_nest(c.builder, [fullsize], fullsize.type) as [idx]:
         # Get char at idx
         ch = c.builder.load(c.builder.gep(strptr, [c.builder.mul(idx, step)]))
-        # If the char is a non-null-byte, store the next index as count
-        with c.builder.if_then(cgutils.is_not_null(c.builder, ch)):
-            c.builder.store(c.builder.add(idx, one), count)
+        # # If the char is a non-null-byte, store the next index as count
+        # with c.builder.if_then(cgutils.is_not_null(c.builder, ch)):
+        c.builder.store(c.builder.add(idx, one), count)
     strlen = c.builder.load(count)
     return c.pyapi.string_from_kind_and_data(kind, strptr, strlen)

However, there's another problem---the boxer is returning a str instead of the unicode-charseq dtype:

return c.pyapi.string_from_kind_and_data(kind, strptr, strlen)
.

Alternative reproducer:

import numba
import numpy as np


@numba.njit
def foo(s):
    a = np.zeros(1, dtype="<U1")
    a[0] = s
    print(a)
    x = a[0]
    print(x)
    return (x,)

got = foo('一')
expect = foo.py_func('一')
print(repr(expect))
print(repr(got))

print(expect[0].tobytes())
print(got[0].tobytes())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug - incorrect behavior Bugs: incorrect behavior
Projects
None yet
Development

No branches or pull requests

3 participants