Skip to content
Sandra Fauconnier edited this page May 25, 2022 · 8 revisions

OpenRefine's roadmap presents the broad directions in which the project wants to go in the coming years, based on input from user surveys and interests in the developer community.

This roadmap is incomplete, is always work in progress, and is a collaborative effort of the OpenRefine community (users, developers and other stakeholders).

(Update in 2022) We want to start a working group of community members interested in updating this roadmap, maintaining it, and keeping it up to date. Several people have expressed interest in participating in this working group in OpenRefine's 2022 user survey. If you are interested in participating as well, please indicate so either on OpenRefine's user mailing list, OpenRefine's developer mailing list, or by emailing OpenRefine's project director Sandra (sandra [at] openrefine.org).

See also:

If there are features you would like to see that are not currently listed here or in the above-mentioned places, please add them to our issue tracker.

Planned releases

3.7

Includes support for uploading media files to Wikimedia Commons.

4.0

New backend storage option to allow using much bigger datasets at the expense of real-time feedback.

Work in progress

Alongside the planned releases there are often smaller pieces of work in progress. Check for recently updated issues and pull requests to see what is currently in the works.

In 2022, we work on the following funded projects:

Support for Structured Data on Wikimedia Commons

In 2021-22 we are adding features to OpenRefine that allow batch editing and uploading files on Wikimedia Commons, the media repository of the Wikimedia movement.

Eliminating cultural and linguistic biases in OpenRefine

In 2021-23, we are working on the project OpenRefine for Everyone, funded by the Chan Zuckerberg Initiative (D&I funding cycle). On the technical side, this project focuses on making OpenRefine more internationally usable, eliminating cultural and linguistic biases.

Subtasks include:

  • Improvements to date parsing with better support for non-Western date formats
  • Improvements to number parsing with locale support
  • Generalization of the reconciliation API to expose service-defined features
  • Localization of parts of the UI which cannot be translated yet for architectural reasons
  • Implementation of clustering heuristics suitable for non-latin alphabets
  • Better support for legacy encodings

Some of the work in this project will be done through Outreachy and/or Google Summer of Code internships, introducing new developers from international backgrounds into OpenRefine's community. We will also leverage our community of translators to identify more aspects of the tool which could be better adapted to other cultures.

Often-requested major features

Many of OpenRefine's end users are requesting the following major features, updates and functionalities, but these are not being worked on in dedicated projects (yet).

Further integration for Wikibase and Wikimedia users

OpenRefine is a popular platform for data imports in Wikidata thanks to a dedicated integration in the tool.

The success of this initial integration is calling for more:

  • Generalization to other sites (third-party Wikibase installations) or other data formats (lexicographical data on Wikidata);
  • More powerful data import tools (including better quality assurance features);
  • Continuous improvements to match new features developed by the Wikimedia Foundation and Wikimedia Deutschland.

Reproducibility and automation

One of OpenRefine’s key features is the ability to replay a transformation workflow on a new project. For instance, this makes it possible to reuse a sequence of cleaning operations on a new version of the same dataset. This places OpenRefine at the intersection between fully interactive data manipulation tools (such as spreadsheet software) and scripted transformations (which require programming), therefore combining the ease of use of interactive tools with the reproducibility of scripts.

The manipulation of workflows can be improved in many ways, including:

  • Better representation and manipulation of workflows, beyond their JSON-based representation (graph-based representation, interactive editing of workflows);
  • More flexible ways to re-apply workflows (adapting to different project shapes, better error handling in case of mismatches);
  • Ability to execute workflows on large datasets or streams, as part of broader pipelines (via bindings for programming languages or connectors for execution engines).

Online collaboration; collaborative editing support; hosted instances

Despite being a web-based application, OpenRefine is designed to be run locally and therefore requires installing. This assumption has pervasive implications throughout the tool, which is designed to be used by a single user at a time. We want to make it easier to use OpenRefine online, in a collaborative way. This can benefit the project in many ways:

  • The requirement to download and install the software before using it is a considerable hurdle for many users, in comparison to hosted services which can be used immediately. Being able to offer hosted instances would expand the user base significantly;
  • The ability to collaborate on a project would benefit large-scale cleaning projects where human review is critical (reconciliation, clustering, quality assessment) and would improve reproducibility (by making it possible to inspect the cleaning workflow interactively).
  • Hosted instances also open up avenues to financial sustainability for the project, by offering paid hosting solutions without compromising on the openness of the tool, which still remains entirely free software.

See also: documentation on the 'broker protocol'

Better reconciliation (ecosystem) support

OpenRefine comes with dedicated support for matching free-text data to unique identifiers from authoritative sources and this process is called “reconciliation”. It is one of the flagship features of the tool, with many reconciliation services being maintained by a wide range of communities. This ecosystem could be fostered in many ways:

  • Developing libraries and frameworks to build reconciliation services more easily;
  • Improve scoring mechanisms by integrating machine learning algorithms to derive scores from user annotations;
  • Provide hosted instances for reconciliation services (in the same idea of OpenRefine hosted instance for collaboration);
  • Support for greater sustainability of key reconciliation services.

Data visualization

Cleaning data requires a good understanding of its flaws. Being able to visualize distributions, correlations and other data analytics is therefore critical for a wide range of use cases, not just scientific ones. The tool offers facets which can be used to that end, but much more could be done:

  • Improved facets with greater flexibility (configurable binning size for numeric facets, client-side display for scatter plot facet);
  • Support for new facets (maps for geographical coordinates, scatter plot for discrete values);
  • Curve-fitting inside facets, outlier detection.

New UI (possibly Vue or React based)

Many of OpenRefine's users request a more modern user interface. As a first step, we have initiated a dedicated UI/UX project for OpenRefine on GitHub which monitors issues that are specifically related to improvements to OpenRefine's user interface.

On the back burner

Some aspects of OpenRefine have previously been targeted for release, but have not made it into a release and have not been worked on recently. If you would like to see features in these areas, please create an issue that describes what development you would like to see:

  • Streamlining traditional features
  • Views: map, timeline, protovis (D3.js) charts
  • Better machinery to guess and re-encode cell values (useful for fixing encoding issues)
  • Column groups
Clone this wiki locally