Skip to content
This repository has been archived by the owner on Jul 19, 2023. It is now read-only.

Unicode error #55

Open
jj0hns0n opened this issue Nov 1, 2016 · 5 comments
Open

Unicode error #55

jj0hns0n opened this issue Nov 1, 2016 · 5 comments

Comments

@jj0hns0n
Copy link
Member

jj0hns0n commented Nov 1, 2016

In django-osgeo-importer, visit /uploads/new and click Choose File. Select the file linked below. It should continue to the next step of the upload process but halts with a UnicodeDecodeError.

https://eventkit.s3.amazonaws.com/0e473d86-6953-47d0-8dda-721b94039e1d/osm-vector/portauprince.gpkg

@ghost
Copy link

ghost commented Nov 2, 2016

For supplementary info (supplemental info?) here is the traceback I got locally against Exchange.

Environment:


Request Method: POST
Request URL: http://exchange/uploads/new

Django Version: 1.8.16
Python Version: 2.7.11
Installed Applications:
('appearance',
 'flat',
 'exchange.core',
 'geonode',
 'geonode.contrib.geogig',
 'geonode.contrib.slack',
 'django_classification_banner',
 'maploom',
 'solo',
 'haystack',
 'corsheaders',
 'exchange-docs',
 'osgeo_importer',
 'modeltranslation',
 'django.contrib.auth',
 'django.contrib.contenttypes',
 'django.contrib.sessions',
 'django.contrib.sites',
 'django.contrib.admin',
 'django.contrib.sitemaps',
 'django.contrib.staticfiles',
 'django.contrib.messages',
 'django.contrib.humanize',
 'django.contrib.gis',
 'pagination',
 'taggit',
 'treebeard',
 'friendlytagloader',
 'geoexplorer',
 'leaflet',
 'django_extensions',
 'autocomplete_light',
 'mptt',
 'djcelery',
 'storages',
 'pinax_theme_bootstrap_account',
 'pinax_theme_bootstrap',
 'django_forms_bootstrap',
 'account',
 'avatar',
 'dialogos',
 'agon_ratings',
 'announcements',
 'actstream',
 'user_messages',
 'tastypie',
 'polymorphic',
 'guardian',
 'geonode.people',
 'geonode.base',
 'geonode.layers',
 'geonode.maps',
 'geonode.proxy',
 'geonode.security',
 'geonode.social',
 'geonode.catalogue',
 'geonode.documents',
 'geonode.api',
 'geonode.groups',
 'geonode.services',
 'geonode.geoserver',
 'geonode.upload',
 'geonode.tasks')
Installed Middleware:
('django.middleware.common.CommonMiddleware',
 'django.contrib.sessions.middleware.SessionMiddleware',
 'django.contrib.messages.middleware.MessageMiddleware',
 'django.middleware.locale.LocaleMiddleware',
 'pagination.middleware.PaginationMiddleware',
 'django.middleware.csrf.CsrfViewMiddleware',
 'django.contrib.auth.middleware.AuthenticationMiddleware',
 'django.middleware.clickjacking.XFrameOptionsMiddleware',
 'geonode.security.middleware.LoginRequiredMiddleware',
 'geonode.security.middleware.LoginRequiredMiddleware',
 'corsheaders.middleware.CorsMiddleware',
 'whitenoise.middleware.WhiteNoiseMiddleware')


Traceback:
File "/env/lib/python2.7/site-packages/django/core/handlers/base.py" in get_response
  132.                     response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/env/lib/python2.7/site-packages/django/contrib/auth/decorators.py" in _wrapped_view
  22.                 return view_func(request, *args, **kwargs)
File "/env/lib/python2.7/site-packages/django/views/generic/base.py" in view
  71.             return self.dispatch(request, *args, **kwargs)
File "/env/lib/python2.7/site-packages/django/views/generic/base.py" in dispatch
  89.         return handler(request, *args, **kwargs)
File "/env/lib/python2.7/site-packages/django/views/generic/edit.py" in post
  215.             return self.form_valid(form)
File "/mnt/importer/osgeo_importer/views.py" in form_valid
  212.                     upload.uploadlayer_set.add(upload_layer)
File "/env/lib/python2.7/site-packages/django/db/models/fields/related.py" in add
  750.                     obj.save()
File "/env/lib/python2.7/site-packages/django/db/models/base.py" in save
  734.                        force_update=force_update, update_fields=update_fields)
File "/env/lib/python2.7/site-packages/django/db/models/base.py" in save_base
  762.             updated = self._save_table(raw, cls, force_insert, force_update, using, update_fields)
File "/env/lib/python2.7/site-packages/django/db/models/base.py" in _save_table
  846.             result = self._do_insert(cls._base_manager, using, fields, update_pk, raw)
File "/env/lib/python2.7/site-packages/django/db/models/base.py" in _do_insert
  885.                                using=using, raw=raw)
File "/env/lib/python2.7/site-packages/django/db/models/manager.py" in manager_method
  127.                 return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/env/lib/python2.7/site-packages/django/db/models/query.py" in _insert
  920.         return query.get_compiler(using=using).execute_sql(return_id)
File "/env/lib/python2.7/site-packages/django/db/models/sql/compiler.py" in execute_sql
  973.             for sql, params in self.as_sql():
File "/env/lib/python2.7/site-packages/django/db/models/sql/compiler.py" in as_sql
  931.                 for obj in self.query.objs
File "/env/lib/python2.7/site-packages/django/db/models/fields/__init__.py" in get_db_prep_save
  710.                                       prepared=False)
File "/env/lib/python2.7/site-packages/jsonfield/fields.py" in get_db_prep_value
  89.         return self.get_prep_value(value)
File "/env/lib/python2.7/site-packages/jsonfield/fields.py" in get_prep_value
  96.         return json.dumps(value, **self.encoder_kwargs)
File "/usr/local/lib/python2.7/json/__init__.py" in dumps
  244.         return _default_encoder.encode(obj)
File "/usr/local/lib/python2.7/json/encoder.py" in encode
  207.         chunks = self.iterencode(o, _one_shot=True)
File "/usr/local/lib/python2.7/json/encoder.py" in iterencode
  270.         return _iterencode(o, 0)

Exception Type: UnicodeDecodeError at /uploads/new
Exception Value: 'utf8' codec can't decode byte 0xa5 in position 16: invalid start byte

@ghost
Copy link

ghost commented Nov 2, 2016

Our line of interest is probably osgeo_importer/views.py in form_valid: upload.uploadlayer_set.add(upload_layer).

Django hints The string that could not be encoded/decoded was: umber�,z�.

This substring is in layer['fields'] list / description['fields'] list:

...
{'name': 'addr_housenumber\xa5,z\x7f', 'type': 'IntegerList'},
             {'name': 'addr_streetumber\xa5,z\x7f', 'type': 'Real'},
...

To be specific, the traceback says 'utf8' codec can't decode byte 0xa5 in position 16: invalid start byte - so we know that we can't just decode this blob of bytes as utf-8. What, then, is the intended encoding?

While it isn't a correct or robust way for us to handle byte blobs we receive, chardet can give us a clue what these particular bytes were intended to be:

Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import chardet
>>> a = 'addr_housenumber\xa5,z\x7f'
>>> chardet.detect(a)
{'confidence': 0.8406238024693018, 'encoding': 'ISO-8859-2'}

That's a little funny, because Wikipedia gives the following list of languages as likely to use this character set:


    Albanian
    Bosnian
    Croatian
    Czech
    German (fully compatible with ISO/IEC 8859-1 for German texts)
    Hungarian
    Polish
    Romanian
    Serbian Latin
    Slovak
    Slovene
    Upper Sorbian
    Lower Sorbian
    Turkmen

Why can't we just use chardet? (1) some bytestrings are just nonsensical, they don't really fit in any character set. (2) Sometimes the guess is wrong, and then we will be making nonsense characters (aka mojibake).

Can we just delete the wrong characters? Well, what if there are none which aren't wrong - then do we give the field a default name? In any case, this means destroying input data. But it does seem safer to replace the whole string than to unpredictably try to salvage something from what may be totally corrupted nonsense.

The right solution would be, if the gpkg provides a clue about what encoding field names will be in, and we get GDAL to use that clue, or at worst give that clue to us so we can fix things.

If on the other hand gpkg is being abused by certain files we can find in the wild, we can feel more justified in "losing data" because the data was already lost due to someone else's mistake.

If we really have no options, then the best thing to do would be to treat field names only as bytestrings and never as characters (because the moment they are treated as characters, this encoding issue will arise again).

To sum up, in order of preference

  1. The gpkg parser handles the encoding of field names for us
  2. The gpkg parser hands us the encoding so we can decode field names
  3. We can document that this file is abusing gpkg format as far as its encoding of field names and/or its identification of their encoding, and cite this as a reason to use some nondescript default field name instead, and spit out some descriptive warning that users will notice, and document that this may happen so at least users find something when they google the issue
  4. we adopt a strict discipline of only treating field names as mere arbitrary bytestrings, never as human-readable text which must be decoded for human consumption or passed to any python callables which interpret them as human-readable txt

@ghost
Copy link

ghost commented Nov 2, 2016

We could also just explicitly and verbosely barf on the whole file and reject it for the user to worry about, that's another option.

@JivanAmara
Copy link
Collaborator

This is looking like a gpkg creation error rather than encoding confusion. I vote for rejecting the file with some actionable details for the user.

@ingenieroariel
Copy link
Member

ingenieroariel commented Dec 2, 2016 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants