Skip to content

GSoC_2018_project_arrow

Viktor Gal edited this page Jan 29, 2018 · 1 revision

Arrow Buffer as CFeatures memory backend

Now that more and more data science project starts to use Apache Arrow as a memory backend or at least has the support to export the data into an Arrow Buffer (see for example SPARK-13534) it would be great that some of the Shogun's CFeatures classes could use Arrow Buffer as a memory backend.

Mentors

Difficulty & Requirements

Medium.

You need know

  • C++
  • basic software engineering

Description

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical dataIt provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, Java, JavaScript, Python, and Ruby.

Using Arrow as CFeatures would not only allow us for example to directly work over pandas DataFrame via pyarrow, but in the long run, as the number of supported languages of Arrow is getting more and more, slowly and gradually we could get rid of some of the SWIG based typemaps, which would result in a significant memory footprint reduction as well as performance.

Useful resources

Start with checking out the prototype in the feature/arrow branch.

Clone this wiki locally