Skip to content

GSoC Outreachy 2024 Ideas

Antonin Delpeuch edited this page Mar 20, 2024 · 15 revisions

Here is a list of projects which could be internship topics for OpenRefine's participation in Google Summer of Code in 2024. We have not applied to participate in Outreachy this time.

Potential mentors are encouraged to add their project descriptions here, following the template below. For examples, you can check out the previous years: GSoC Outreachy 2023 Ideas, GSoC Outreachy 2022 Ideas and GSoC 2020 Ideas. We are coordinating our participation in those programmes in this thread.

Note: applicants are also welcome to come up with their own project proposals!

For support, you can reach out on:

Support for the new reconciliation protocol

  • Duration: 350h
  • Description: The reconciliation feature in OpenRefine relies on a unified protocol, which is used to communicate with various web services offering data matching functionalities for various data sources. The W3C Entity Reconciliation Community Group has been working on a new version of this protocol, addressing a number of problems in the current one. For instance, it is currently very cumbersome to reconcile entities in OpenRefine when one does not have a column with names of those entities, because the existing protocol assumes that those names must always be provided (#6044). We would like to add support for the new protocol to address this issue, and various others (see below). We would like to maintain support for the existing protocol, so that existing reconciliation services continue to work. As we do this work, we might be able to bring up issues to the W3C group, so that the new specifications can be improved accordingly. This might be a good opportunity to clean up our current reconciliation code so that it bases itself on an external library instead.
  • Expected outcomes: OpenRefine can use reconciliation services which implement the new API, and this addresses at least some of the user-facing issues listed below.
  • Skills required/preferred: JavaScript & Java
  • Possible mentors: @wetneb, perhaps jointly with @ayushrai206?
  • Relevant issues: #6234, #6053, #6044, #4715, #3139, #2332, #2075

New operation to add blank rows to a project

  • This project idea is withdrawn because it is nearly implemented already (by #6461)
  • Relevant issues: #1855

Better testing utilities for extension development

  • Duration: 350h
  • Description: OpenRefine's functionality can be augmented via extensions. However, developing extensions outside of OpenRefine's code base is not so easy, in particular regarding debugging and testing. Our sample extension is included in OpenRefine's repository, making it of little use to demonstrate how to develop an extension as a third-party. The goal of this project would be to improve this by:
    • develop a workflow to run OpenRefine with the Java debugger enabled, to debug an extension via commonly-used IDEs (IntelliJ IDEA, Eclipse, VSCode). This would help developers interactively develop their extension in its actual execution environment.
    • develop utilities to run Cypress tests on an extension. The utility should download a specific version of OpenRefine and run it on a test workspace that contains the extension to test. This could then be used to run Cypress tests against the extension.
    • provide a "model" extension, developed outside of OpenRefine's repository, following the best practices (Java and Cypress tests, CI, only using clearly documented extension points), that people can take inspiration from. As an example extension, we could either revamp an existing extension (such as the Commons extension) or create a new one. We have a page with some ideas for new extensions.
    • ideally, clean up the existing documentation, making sure it is up to date and provides all the necessary information. We may take inspiration from external resources written on this subject, such as Giuliano Tortoreto's guide or Owen Stephen's blog post.
  • Expected outcomes: We have a great model extension that people can take inspiration from to develop their own
  • Skills required/preferred: proficiency Java and Javascript. Bash scripting would also be useful
  • Possible mentors: @wetneb
  • Relevant issues: (none yet)

Better test coverage for importers and exporters

  • Duration (90, 175 or 350 hours): 175 hours
  • Description: Importers are java classes which enable the creation of OpenRefine projects from various file formats. Similarly, exporters let users download their cleaned data in various formats. Our importer and exporter classes currently come with relatively few unit tests, reaching a test coverage of about 74%. In this project we would write more tests to cover more use cases of those components. In the process, we anticipate to find bugs or possible enhancements which we would also tackle.
  • Expected outcomes: the importer and exporter classes have a higher test coverage, aiming for 90% for instance
  • Skills required/preferred: familiarity with Java
  • Possible mentors: @wetneb
  • Relevant issues: (none yet)

Reconciliation server within OpenRefine

  • Duration: 350 hours
  • Description: OpenRefine could expose reconciliation services for the data stored in its own projects. This would make it possible to reconcile data from one project to another, providing a sort of "fuzzy join" between two projects. This requires implementing the reconciliation API as a server in the backend. Such an implementation would be useful even if not all of the features of the reconciliation API are implemented initially.
  • Expected outcomes: the OpenRefine backend can expose reconciliation services for projects in its workspace
  • Skills required/preferred: primarily backend-side, so familiarity with Java is important.
  • Possible mentors: @wetneb
  • Relevant issues: #2003, #941, #176

User-defined clustering

  • Duration: 350 hours
  • Description: Our binning clusterers let the user choose between various methods to generate bins in which the values are spread. Extensions can define new binning methods, but writing an extension is still quite some work. It would be even better if users could simply provide an expression (GREL, Jython, Clojure…) which would compute the bin in which a given value falls in. That would potentially let users better adapt the binning strategy to their own uses cases. User-defined distances could also be used for kNN-based clustering.
  • Expected outcomes: A new clustering method which accepts a user-defined expression, either as a binning or kNN clusterer, potentially both.
  • Skills required/preferred: both the backend (Java) and frontend (HTML/CSS/JS) will need adapting
  • Possible mentors: @wetneb
  • Relevant issues: #4301

Project template

  • Duration (90, 175 or 350 hours):
  • Description:
  • Expected outcomes:
  • Skills required/preferred:
  • Possible mentors:
  • Relevant issues:
Clone this wiki locally