Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decompress util #244

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from
Open

Decompress util #244

wants to merge 7 commits into from

Conversation

cuducos
Copy link
Contributor

@cuducos cuducos commented Sep 14, 2017

This PR implements rows.utils.decompress as suggested in #230. The API is:

import rows
from rows.utils import decompress

table1 = rows.import_from_csv(decompress('filename.csv.gz'))
table2 = rows.import_from_csv(decompress('filename.csv.xz'))
table3 = rows.import_from_csv(decompress('filename.csv.bz2'))

A RunTime error is raised if:

  • rows can't properly guess the mimetype as a known mimetype of a lzma or gzip file
  • rows identifies a lzma mimetype but the lzma module is not available (a lzma lib has to be available in the OS when compiling Python it self as explained here)

@cuducos cuducos changed the title Cuducos decompress Decompress util Sep 14, 2017
Copy link
Owner

@turicas turicas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! :)
I've requested some changes/questions and we can discuss about them. =)

rows/utils.py Outdated
lzma_mime_types = (
'application/x-xz',
'application/x-lzma'
)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mimetype detection is not needed here - we can expect the user will call this function only if she knows the file is compressed and in one of the supported algorithms; we can do this detection automatically on the command-line interface using file-magic and then pass the correct arguments to decompress. Do you agree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally : )

rows/utils.py Outdated
@@ -297,3 +310,53 @@ def export_to_uri(table, uri, *args, **kwargs):
raise ValueError('Plugin (export) "{}" not found'.format(plugin_name))

return export_function(table, uri, *args, **kwargs)


def decompress(path, **kwargs):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add an algorithm parameter to this function? It should defaults to None (if None, use file extension to define it).

rows/utils.py Outdated
msg = "Couldn't identify file mimetype, or lzma module isn't available"
raise RuntimeError(msg)

with open_compressed(path, **kwargs) as handler:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think having kwargs on decompress is really needed? Could you give me an example use case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically encoding. On UNIX for example I barely use it. But Windows user should always add utf-8 I was told.


def setUp(self):
self.contents = six.b('Ahoy')
self.temp = tempfile.TemporaryDirectory()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the tests are failing here. I've replaced self.tmp with self.temp (was receiving a NameError) but they still fail:

======================================================================
ERROR: test_decompress_with_bz2 (tests.tests_utils.UtilsDecompressTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/turicas/projects/rows/tests/tests_utils.py", line 99, in test_decompress_with_bz2
    decompressed = rows.utils.decompress(compressed)
  File "/home/turicas/projects/rows/rows/utils.py", line 359, in decompress
    raise RuntimeError(msg)
RuntimeError: Couldn't identify file mimetype, or lzma module isn't available

======================================================================
ERROR: test_decompress_with_gz (tests.tests_utils.UtilsDecompressTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/turicas/projects/rows/tests/tests_utils.py", line 107, in test_decompress_with_gz
    self.assertEqual(self.contents, decompressed.read())
  File "/home/turicas/software/pyenv/versions/3.6.2/lib/python3.6/gzip.py", line 272, in read
    self._check_not_closed()
  File "/home/turicas/software/pyenv/versions/3.6.2/lib/python3.6/_compression.py", line 14, in _check_not_closed
    raise ValueError("I/O operation on closed file")
ValueError: I/O operation on closed file

======================================================================
ERROR: test_decompress_with_incompatible_file (tests.tests_utils.UtilsDecompressTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/turicas/projects/rows/tests/tests_utils.py", line 126, in test_decompress_with_incompatible_file
    with self.assertRaises():
TypeError: assertRaises() missing 1 required positional argument: 'expected_exception'

======================================================================
ERROR: test_decompress_with_lzma (tests.tests_utils.UtilsDecompressTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/turicas/projects/rows/tests/tests_utils.py", line 112, in test_decompress_with_lzma
    with lzma.open(compressed) as compressed_handler:
  File "/home/turicas/software/pyenv/versions/3.6.2/lib/python3.6/lzma.py", line 302, in open
    preset=preset, filters=filters)
  File "/home/turicas/software/pyenv/versions/3.6.2/lib/python3.6/lzma.py", line 120, in __init__
    self._fp = builtins.open(filename, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpr5s7bse8/test.lzma'

======================================================================
ERROR: test_decompress_with_xz (tests.tests_utils.UtilsDecompressTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/turicas/projects/rows/tests/tests_utils.py", line 120, in test_decompress_with_xz
    with lzma.open(compressed) as compressed_handler:
  File "/home/turicas/software/pyenv/versions/3.6.2/lib/python3.6/lzma.py", line 302, in open
    preset=preset, filters=filters)
  File "/home/turicas/software/pyenv/versions/3.6.2/lib/python3.6/lzma.py", line 120, in __init__
    self._fp = builtins.open(filename, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpz5l7xm1q/test.gz'

----------------------------------------------------------------------
Ran 184 tests in 0.848s

Are the tests passing in your machine?

rows/utils.py Outdated
kwargs are passed to either `bz2.openn`, `gzip.open` or `lzma.open`.
:param path: (str) path to a bz2, gzip or lzma file
"""
filename = os.path.basename(path)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the current API accepts filenames or file-objects,
this function should also do (this decision was inspired in the Python stdlib modules, such as csv). You can get some help using rows.plugins.utils.get_filename_and_fobj.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was indeed pretty helpful, thanks ; )

rows/utils.py Outdated
raise RuntimeError(msg)

with open_compressed(path, **kwargs) as handler:
return handler
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very important to ensure the file object returned is open in binary mode (so the plugins will decode the data using the desired encoding). Could you please add a test for this case?

@cuducos
Copy link
Contributor Author

cuducos commented Oct 15, 2017

Ok, sorry I took too long to get back to this PR. Life's got hectic around here.

Anyway, I addressed many issues in this refactor:

  • made it work with file object
  • dumped that complexity of detecting mimetype
  • fixed tests
  • implemented the algorithm argument
  • ensures the output is always byte string

About tests: I've just ran python -m unittest tests/tests_utils.py so not sure if is tehre any Python 2 details I might have missed (ie wasn't able to test with tox).

Do you prefer to implement compress in this same PR or leave it to a separate one?

@turicas turicas force-pushed the develop branch 2 times, most recently from fa144f0 to bbb2c57 Compare November 30, 2018 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants