Retrieve a query as a NumPy structured array #1156

ilanschnell · 2023-01-27T19:27:56Z

In this PR, we a .fetchdictarray() method to the pyodbc.Cursor object. This adds numpy as an optional build and runtime dependency. Only when numpy is available at build time, is the extension src/npcontainer.cpp compiled. In addition WITH_NUMPY will be defined such that src/cursor.cpp can add the method, and src/pyodbcmodule.cpp can initialize numpy on import.

Here is the docstring of the .fetchdictarray() method:

fetchdictarray(size=-1, return_nulls=False, null_suffix='_isnull')
                               --> a dictionary of column arrays.

Fetch as many rows as specified by size into a dictionary of NumPy
ndarrays (dictarray). The dictionary will contain a key for each column,
with its value being a NumPy ndarray holding its value for the fetched
rows. Optionally, extra columns will be added to signal nulls on
nullable columns.

Parameters
----------
size : int, optional
    The number of rows to fetch. Use -1 (the default) to fetch all
    remaining rows.
return_nulls : boolean, optional
    If True, information about null values will be included adding a
    boolean array using as key a string  built by concatenating the
    column name and null_suffix.
null_suffix : string, optional
    A string used as a suffix when building the key for null values.
    Only used if return_nulls is True.

Returns
-------
out: dict
    A dictionary mapping column names to an ndarray holding its values
    for the fetched rows. The dictionary will use the column name as
    key for the ndarray containing values associated to that column.
    Optionally, null information for nullable columns will be provided
    by adding additional boolean columns named after the nullable column
    concatenated to null_suffix

Remarks
-------
Similar to fetchmany(size), but returning a dictionary of NumPy ndarrays
for the results instead of a Python list of tuples of objects, reducing
memory footprint as well as improving performance.
fetchdictarray is overall more efficient that fetchsarray.

Note: The code is based on a https://github.com/ContinuumIO/TextAdapter (which was released in 2017 by Anaconda, Inc. under the BSD license). The original authors of the numpy container are Francesc Alted and Oscar Villellas.

ndmlny-qs · 2023-04-05T20:14:12Z

pinging the PR to see if there is anything I can help with to merge this feature

I ran the unit tests with a debug version of Python and it complained that the Unicode string I was building was invalid. The original code was a modified version of an older Python tuple repr implementation, so I looked at doing that again. However, cpython now uses an internal _PyUnicode_Writer class we don't have access to, so I'm cheating by creating a tuple. Since repr should not be in the critical path of most performance sensitive DB jobs, this will do for now.

It is easier to build debug versions of Python now.

The "raw" encoding was Python 2.7 only. I originally created ODBCCHAR to replace SQLWCHAR because unixODBC would define it as wchar_t even when that was 4 bytes. The data in the buffer of 4-byte wchar_t's was still 2-byte data. Now I've just simplified to uint16_t. I added this to HACKING.md. Deleted Tuple wrapper. Use the Object wrapper and the PyTuple_ functions. This is to prepare for possibly using the ABI which would not allow me access the internal item pointers directly, so I could not use operator[] to set items. (Python has __getitem__ and __setitem__, but to overload __setitem__ in C++ you can only return a reference to the internal data.)

Used subprocess in setup.py to eliminate warnings about the process still running. Removed connect() ansi parameter. Updated SQLWChar to allow it to be declared on the stack and initialized later. Turned into a class with an operator to convert to SQLWCHAR*.

Somehow I lost some changes.

This is a fix for GitHub security advisory GHSA-pm6v-h62r-rwx8. The old code had a hardcoded buffer of 100 bytes (and a comment asking why it was hardcoded!) and fetching a decimal greater than 100 digits would cause a buffer overflow. Author arturxedex128 supplied a very simple code to reproduce the error which was put into the 3 PostgreSQL unit tests as test_large_decimal. (Thank you arturxedex128!) Unfortunately the strategy is still that we have to parse decimals, but now Python strings / Unicode objects are used so there is no arbitrary limit.

I have not ported this code path and I'm not as familiar with it as I need to be. To allow me to complete porting and testing the rest, I've temporarily commented it out. I will look into consolidating the binding for the two code paths. Also, I'd like to consider renaming it to "array binding" or "row wise binding" instead of "fast executemany". While the latter does tell us the goal of it, it is too generic. For one thing, what if we wanted to supply both row- and column-wise binding -- they are both "fast".

I'm not sure where the minor fixes came from, like PyEvel_ -> PyObject. I'll need to test with older 3.x versions. I am going to use the test file naming convention of xxx_tests.py to make it easier to use tab completion in shells and editors.

I accidentally deleted it. It is required for simple local pytest. (See comment in the file.)

I'm porting the tests one at a time and want to ensure the ones ported are successful.

I've only tested on Linux so far. Next step is to get the Windows tests working on a local machine and/or AppVeyor.

I uncommented some sections and the indentation was off. Perhaps it had tabs.

I missed a version. While in there, I simplified this code and used the year as an int and consolidated the "not SQL Server" code into the _get_sqlserver_year function.

I also added flake8 and pylint to the dev requirements.

This commit modifies the `params.cpp` file to check if the given iterable has items in it when using an empty custom iterable object. This way when executing the below code ```python import collections import pyodbc class MySequence(collections.abc.Sequence): def __getitem__(self, index): raise Exception def __len__(self): return 1 connection.execute("SELECT ?, ?", 123, MySequence()).fetchone() ``` a Python exception is returned, instead of a segfault.

...which have never worked. Maybe this will work with CIBUILDWHEEL eventually but until it does, drop them.

ndmlny-qs · 2023-08-25T17:44:48Z

@mkleehammer this has been rebased against the py3 branch

Merge conflict flags where kept in `.github/workflows/ubuntu_build.yml` for some reason. This commit removes them.

ndmlny-qs · 2023-08-25T17:57:16Z

@mkleehammer you will need to close out this PR so I can make a different one against the py3 branch. I'll resolve any tests that fail in a new PR agains the that branch.

ndmlny-qs · 2023-09-15T16:08:56Z

@mkleehammer this PR can be closed in favor of #1270 where I am still working on updating tests

ndmlny-qs · 2024-06-05T14:35:30Z

@ilanschnell You can close this PR, see #1270 for a comment about why this can be closed.

@mkleehammer if Ilan does not close out this PR, you can close it out as you clean out stale PRs

ndmlny-qs force-pushed the fetchdictarray branch from 94673fe to ad272dd Compare April 21, 2023 16:12

mkleehammer and others added 28 commits May 9, 2023 00:31

WIP: Started porting build and PostgreSQL tests to py3

53db7b5

WIP: First pass at removing Python version macros.

f141582

WIP: Eliminate compat module. Convert mem calls to PyMem.

a252c47

Replace assert macros with assert

ae5990a

It is easier to build debug versions of Python now.

WIP: Started eliminating deprecated Unicode API

e6fc6e8

Fixed removal of ansi keyword.

58c2461

Somehow I lost some changes.

WIP

d6d6a40

Remove tests3 directory

0388b96

Rename pytest tests. Minor fixes.

592f36f

I'm not sure where the minor fixes came from, like PyEvel_ -> PyObject. I'll need to test with older 3.x versions. I am going to use the test file naming convention of xxx_tests.py to make it easier to use tab completion in shells and editors.

Ported mysql tests to pytest

b350fce

Update Github actions to drop old versions of Python

da2906d

Update Github actions to use pytest

89a6f63

Move unported tests to an "old" directory.

0f8e67f

Add tests.__init__.py back

e464f66

I accidentally deleted it. It is required for simple local pytest. (See comment in the file.)

Temporarily remove SQL Server tests

2d337cb

I'm porting the tests one at a time and want to ensure the ones ported are successful.

Initial port of SQL Server tests

3205fb4

I've only tested on Linux so far. Next step is to get the Windows tests working on a local machine and/or AppVeyor.

Fix Github ubuntu build file

262662c

I uncommented some sections and the indentation was off. Perhaps it had tabs.

Remove Python 2.7 from Appveyor

738a5b9

Updated Github ubuntu build action with SQL Server 14

96a48a8

I missed a version. While in there, I simplified this code and used the year as an int and consolidated the "not SQL Server" code into the _get_sqlserver_year function.

Fix Python 3.7 incompatibility.

f2f5125

I also added flake8 and pylint to the dev requirements.

add npcontainer.cpp

74cfb2c

only attempt to compile npcontainer when numpy is installed

fc394e8

add missing defines in npcontainer.cpp

3386185

ilanschnell and others added 10 commits August 25, 2023 12:37

remove unused fetchsarray

5c94b81

add missing call to NpContainer_init()

5423551

add comment about use_unicode in npcontainer.cpp

be01a68

minor cleanup

5a32e62

fix declaration of fetchdictarray_doc in npcontainer.h

ec7aaa7

move headers in order to compile on certain systems

68cfccf

use dynamic text size limit, see comment

8b6f990

update segfault error propagation

fa68917

remove macOS ARM64 builds (mkleehammer#1247)

97da475

...which have never worked. Maybe this will work with CIBUILDWHEEL eventually but until it does, drop them.

ndmlny-qs force-pushed the fetchdictarray branch from acc14a0 to 97da475 Compare August 25, 2023 17:43

Fix merge conflict flags

ea1d83c

Merge conflict flags where kept in `.github/workflows/ubuntu_build.yml` for some reason. This commit removes them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieve a query as a NumPy structured array #1156

Retrieve a query as a NumPy structured array #1156

ilanschnell commented Jan 27, 2023

ndmlny-qs commented Apr 5, 2023

ndmlny-qs commented Aug 25, 2023

ndmlny-qs commented Aug 25, 2023

ndmlny-qs commented Sep 15, 2023

ndmlny-qs commented Jun 5, 2024

Retrieve a query as a NumPy structured array #1156

Are you sure you want to change the base?

Retrieve a query as a NumPy structured array #1156

Conversation

ilanschnell commented Jan 27, 2023

ndmlny-qs commented Apr 5, 2023

ndmlny-qs commented Aug 25, 2023

ndmlny-qs commented Aug 25, 2023

ndmlny-qs commented Sep 15, 2023

ndmlny-qs commented Jun 5, 2024