GSoC_2018_project_arrow

Arrow Buffer as CFeatures memory backend

Now that more and more data science project starts to use Apache Arrow as a memory backend or at least has the support to export the data into an Arrow Buffer (see for example SPARK-13534) it would be great that some of the Shogun's CFeatures classes could use Arrow Buffer as a memory backend.

Mentors

Viktor (github: vigsterkr, IRC: wiking)
Sergey (github: lisitsyn, IRC: lisitsyn)

Difficulty & Requirements

Medium.

You need know

C++
basic software engineering

Description

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical dataIt provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, Java, JavaScript, Python, and Ruby.

Using Arrow as CFeatures would not only allow us for example to directly work over pandas DataFrame via pyarrow, but in the long run, as the number of supported languages of Arrow is getting more and more, slowly and gradually we could get rid of some of the SWIG based typemaps, which would result in a significant memory footprint reduction as well as performance.

Useful resources

Start with checking out the prototype in the feature/arrow branch.

Home
Readmes:
Development
- Getting involved
- Dev tips.
GSoC
Credits
Authors
Contributions
License
msufsort
SVMlight
Tapkee

Provide feedback

Saved searches