Skip to content

OpenRefine integration ideas with R lang and Jupyter

Thad Guidry edited this page Mar 29, 2019 · 4 revisions

FEATURES DESIRED (gathered from discussions)

BIKESHEDDING DISCUSSIONS

Hi Thad,

Long time fan of Refine and now Open Refine, and happy to collaborate!

Renjin should make it pretty easy to integrate within OpenRefine from a technical perspective; I think the biggest challenge would be defining what R integration would look like.

I think enabling R as a language would be a great first step. This should be quite straightforward to do in a first pass, and I can help wire up our just-in-time compiler if performance becomes an issue.

I can imagine that your users would also benefit from more "guided" tools that could be powered by R and R packages. For example, there are a number of good text mining and natural language processing packages that you could embed to provide a "Sentiment Analysis Wizard" or something similar.

But for starters, I've subscribed to the ticket on GitHub and will see what I can do to help!

Best, Alex


That sounds interesting - and makes me wonder: what would it mean if OpenRefine could act as a Jupyter client?

Off the top of my head:

  • it would be able to launch / connect to a Jupyter kernel (eg R or python)
  • this would allow code based transformations to be executed using those kernels

By the by, I also note other integrations between and bits of the Jupyter ecosystem, such as launching from a notebook server menu: https://github.com/betatim/openrefineder

I guess it would also be possible to display OpenRefine in a panel in Jupyterlab? But what if OpenRefine acted on a data structure that other components in the Jupyterlab context could see and access...?

--tony


@ettore says It would be very cool to switch more easily from Open Refine to R or Python through an intermediate format like Feather (based on Apache Arrow). In general, anything that can facilitate the integration of OpenRefine in a Data Science workflow deserves to be encouraged. I feel like that there are not enough data scientists in the users base. It's a shame.


Suppose I am working on a crappy dataset, viewing a fragment of a csv file or dataframe in one panel, working with that data in another.

If OpenRefine provided a view onto the same dataframe, I could be cleaning it in OpenRefine as I work on the analysis of it in another panel.

As it currently stands, I have to import and export data from OpenRefine if I actually want to analyse it.

--tony


@JohnLittle says ...As to the Why, my sense is that Jython is often used to extend OpenRefine when the natural constraints of OR limit advanced data transformations. I can certainly imagine that similar activity can be done with R inside of an OpenRefine expression window, particularly with the Tidyverse packages which are more familiar to me than base-R. I can also imagine how an R programmer, like a Python programmer, can write and share code snippets to be pasted into the expression window by non-R (or non-Python) OpenRefine users.


@thadguidry is still trying to wrap his brain around the many parts of the ecosystem of Jupyter and R lang itself. Juypter seems it is very much an OpenRefine kind of web application but built differently for a different purpose of sharing and interactive visualizations.

Jupyter Parts:

  1. The Notebook Document Format Jupyter Notebooks are an open document format based on JSON. They contain a complete record of the user's sessions and include code, narrative text, equations and rich output.

  2. Interactive Computing Protocol The Notebook communicates with computational Kernels using the Interactive Computing Protocol, an open network protocol based on JSON data over ZMQ and WebSockets.

  3. Kernels Kernels are processes that run interactive code in a particular programming language and return output to the user. Kernels also respond to tab completion and introspection requests.

There are certainly similarities and overlap in 1 , 2, and 3 with OpenRefine.

But we don't know where the data is actually stored in Jupyter. If Notebook's can be shared, then it seems the data is also stored not only on the JUPYTER_PATH but also in the Nb Format itself ?

A. Is there a need in OpenRefine to have an Import Jupyter Notebook ? B. Or do we not even worry about that import need and just allow a user to have a Jupyter Notebook open and OpenRefine open at the same time and seamlessly they can work with the same data at the same time ?

If B is more highly valued to users, then does anyone have any idea about how that might be technically feasible ? I am clueless about Jupyter and R for the most part, and don't want to waste hours reading just to frame up some architecture integration documentation. I'd rather just cut to the chase and let the community help me draft that architecture integration. I cannot do that alone.

Clone this wiki locally