Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the use case of xnd over apache arrow? #6

Open
nmichaud opened this issue Mar 7, 2018 · 16 comments
Open

What is the use case of xnd over apache arrow? #6

nmichaud opened this issue Mar 7, 2018 · 16 comments

Comments

@nmichaud
Copy link

nmichaud commented Mar 7, 2018

Hello, was wondering what you plan on doing with xnd that isn't well supported by the arrow format. A github issue is probably not the best place for this discussion, but I couldn't find a mailing list. Thanks.

@teoliphant
Copy link
Member

teoliphant commented Mar 7, 2018 via email

@nmichaud
Copy link
Author

nmichaud commented Mar 7, 2018

Ah ok, it seems most of the underlying types and structured types (lists, structs, ragged hierarchies) are already well supported in arrow. Anyway looking at the docs it should be pretty cheap to convert from one format to another through memory copying.

@skrah
Copy link
Member

skrah commented Mar 7, 2018

Are they? I thought Arrow was limited to int32_t in sizes. Xnd is for in-memory computations, so we definitely need int64_t.

Also, the types that xnd uses (ndt_t) are a standard algebraic datatype that is relatively easy to use for traversing memory, which is needed in gumath. There is no dependency on an external C++ library like flatbuffers.

@skrah
Copy link
Member

skrah commented Mar 7, 2018

That said, we might translate Arrow to ndt_t in the future, but it is not an immediate priority (gumath is).

@wesm
Copy link

wesm commented Mar 7, 2018

Are they? I thought Arrow was limited to int32_t in sizes. Xnd is for in-memory computations, so we definitely need int64_t.

This isn't quite true, see https://issues.apache.org/jira/browse/ARROW-750 -- support for very large variable-length collections is something we will eventually need to add to the format whenever there is demand for it.

In general, datasets will not be expected to be in a contiguous columnar memory block, but instead split across a collection of smaller chunks. We have discussed the 32- vs 64-byte issue for encoding collection lengths and the consensus has been that it is not worth the extra 4 bytes of overhead per value when the "large collection" case represents a very small percentage of use cases.

@wesm
Copy link

wesm commented Mar 7, 2018

Additionally, we have changed 1-dimensional array sizes to use int64 almost a year ago apache/arrow@ced9d76#diff-520b20e87eb508faa3cc7aa9855030d7

@skrah
Copy link
Member

skrah commented Mar 7, 2018 via email

@wesm
Copy link

wesm commented Mar 7, 2018

Makes sense. In our experience, Tensors are a different beast and use case from structured columnar data, so we are handling ndarrays / tensors with metadata separate from 1D record batches: https://github.com/apache/arrow/blob/master/format/Tensor.fbs#L35. These use 64-bit shape and strides. This is used actively by the Ray project

@teoliphant
Copy link
Member

teoliphant commented Mar 7, 2018 via email

@wesm
Copy link

wesm commented Mar 7, 2018

There is overlap but the tradeoffs are quite different.

Agreed. We should look for opportunities to share code and infrastructure where possible. Note that the Arrow columnar format is but one type of data structure that we support -- it's a very important one for databases, Spark, pandas, etc. In order to implement zero-overhead memory sharing for structured datasets, many lower-levels of platform tooling must be created. I want to make sure we don't miss out on the collaboration opportunities for not having agreed on a "universal" data structure. The Arrow columnar format was never intended to be a universal data structure.

@skrah
Copy link
Member

skrah commented Mar 7, 2018

[Repost because of broken markdown in email replies.]

I agree it would be nice to have a standard low-level data structure. For C, Ndtypes is pretty standard: It describes all basic C types (including nested types, pointer types) using a regular algebraic data type. One could use it e.g. for the type part in "Modern Compiler Implementation in C" (Appel et al.) without changes.

The tagged union convention is also the same as in the quoted book, and incidentally also the same as in Python's own compiler (whose author probably also read Appel, given that he used ASDL to describe the AST :).

I think columnar data can be modeled in ndtypes as a record of arrays. The example from the Arrow home page:

>>> data = {'session_id': [1331247700, 1331247702, 1331247709, 1331247799],
...         'timestamp': [1515529735.4895875, 1515529746.2128427, 1515529756.4485607, 1515529766.2181058],
...         'source_ip': ['8.8.8.100', '100.2.0.11', '99.101.22.222', '12.100.111.200']}
x = xnd(data)
>>> x.type
ndt("{session_id : 4 * int64, timestamp : 4 * float64, source_ip : 4 * string}")

There is categorical data, the representation of which is an array of indices into the categories:

>>> levels = ['January', 'August', 'December', None]
>>> x = xnd(['January', 'January', None, 'December', 'August', 'December', 'December'], levels=levels)
>>> x.value
['January', 'January', None, 'December', 'August', 'December', 'December']
>>> x.type
ndt("7 * categorical('January', 'August', 'December', NA)")

There are nested tuples, which are more general than ragged arrays:

>>> unbalanced_tree = (((1.0, 2.0), (3.0)), 4.0, ((5.0, 6.0, 7.0), ()))
>>> x = xnd(unbalanced_tree)
>>> x.value
(((1.0, 2.0), 3.0), 4.0, ((5.0, 6.0, 7.0), ()))
>>> x.type
ndt("(((float64, float64), float64), float64, ((float64, float64, float64), ()))")
>>>
>>> x[0]
xnd(((1.0, 2.0), 3.0), type="((float64, float64), float64)")
>>> x[0][0]
xnd((1.0, 2.0), type="(float64, float64)")

In general, xnd just takes any basic Python value -- nested or not -- and unpacks
it to typed memory.

@wesm
Copy link

wesm commented Mar 8, 2018

I am skeptical about the idea of an all-powerful / can-describe-anything data structure. With generalization comes added complexity for computational frameworks and producers/consumers.

@teoliphant stated "I have not seen that arrow is general enough." What does this mean? At this point, the Arrow columnar format is only one part of a much larger project. I think this means "the Arrow columnar format is not a universal data structure", which I agree with, but that was never the goal. I see the work here in libndtypes / xnd as complementary and not in conflict -- there are problems being solved (extending the notion of NumPy's structured dtypes to support things like variable-length cells and pointers) that were never in scope for Arrow's columnar format.

The columnar format was the focus of the project at the outset because that was the most immediate and high value problem to solve around data interoperability and in-memory analytics. The rapid uptake of the project and developer community growth suggests we made a good bet on this.

At this point Arrow a multi-layered project of memory management, shared memory (Plasma), metadata serialization, IO, streaming messaging, memory formats (including the columnar format), file format interop, computation kernels, etc. The work that is being done here could even become an additional component of Apache Arrow if you wanted to work with a larger developer community. At minimum it would be helpful to have a broader design/architecture discourse about problems and use cases in a public venue.

@pearu
Copy link
Member

pearu commented Apr 4, 2019

Ideally we can convert without memory copying.

As demonstrated in ArrayViews, one can wrap Arrow arrays with xnd, and vice versa, without memory copying. However, currently the wrapping does not support null buffers (Arrow) or bitmaps (xnd) because xnd does not expose bitmaps.

@wesm
Copy link

wesm commented Apr 4, 2019

@pearu the cases where the memory is compatible IMHO reflect a minority (and a small minority at that) of real world use of Arrow. To suggest "compatible, with some exceptions" will mislead people

@pearu
Copy link
Member

pearu commented Apr 4, 2019

@wesm, I am not sure that I follow your comments meaning. If you refer to the fact xnd does not expose bitmaps, then this issue can be easily fixed as xnd bitmap is compatible with Arrow null buffer. I guess the reason of not exposing xnd bitmaps is that it is consider as internal structure while in Arrow null buffer is not that.

@wesm
Copy link

wesm commented Apr 4, 2019

You stated

As demonstrated in ArrayViews, one can wrap Arrow arrays with xnd, and vice versa, without memory copying

I think it's worth making a list of different Arrow use cases:

  • Primitive (C-like) arrays with no nulls
  • Primitive arrays with nulls
  • Dictionary-encoded arrays
  • Varbinary / utf8 (variable-length) arrays with nulls
  • Nested type arrays (e.g. list<binary> or list<int32>)
  • Unions

Do you support them all and export all of their semantics in xnd? If the answer is "no", then I think you need to qualify the statement to say that "In certain limited cases, one can wrap Arrow arrays [and expose their semantics], without memory copying"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants