Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a "delete_missing" option to CKAN harvester #542

Open
danielcoelhocgu opened this issue Nov 20, 2023 · 1 comment
Open

Add a "delete_missing" option to CKAN harvester #542

danielcoelhocgu opened this issue Nov 20, 2023 · 1 comment

Comments

@danielcoelhocgu
Copy link

In brazilian government we have a very decentralized structure in which several entities have their own CKAN instances. We collect all data from these entities trough the harvest extension.

We have quite a lot of trouble when a dataset is deleted in one of those harvested CKAN portals because the CKAN harvester does not delete it in our CKAN, so it keeps showing many datasets with broken links or out of date information.

We propose to add an option to the CKAN harvester called delete_missing (boolean type), which will check for datasets that no longer exist in the harvested CKAN portal and delete them.

A near identical demand was reported on issue #396 about 2 years ago. The author of the issue even said he wrote some custom code to solve it, but he never shared the code, so I am opening this new issue aiming to submit a future pull request.

My idea is to copy the same logic from the DCAT JSON harvester from ckanext-dcat:

  1. Inside gather_stage function:
    1.2. List all dataset UIDs that were imported through the current harvest source (by querying the harvest_object table).
    1.3. List all remote CKAN datasets, then check for local UIDs that are missing in the remote CKAN list.
    1.4. Create harvest objects with delete state for all of those missing datasets.
  2. Inside import_stage function:
    2.1. Effectively delete (but not purge) all those missing datasets.

About step 1.2, I don't know if it would be better to look into the harvest_object table or to look for datasets with the extra field harvest_source_id that matches the harvest source of the job. It seems that the extension normally uses the havest_object table, but it won't work if we use the clear_history command on the source.

I kindly appreciate any feedback about this implementation idea, since this is my first contribution to the project.

danielcoelhocgu added a commit to danielcoelhocgu/ckanext-harvest that referenced this issue May 8, 2024
danielcoelhocgu added a commit to danielcoelhocgu/ckanext-harvest that referenced this issue May 9, 2024
@danielcoelhocgu
Copy link
Author

I wrote the proposed code in PR #548.

Regarding my question about step 1.2, I chose to fetch from the harvest_object table, to keep the same logic from ckanext-dcat harvester.

I also changed one line in the base harvester to force package update whenever the package already exists but is in the deleted state.

This is necessary to address a situation when the remote CKAN instance has technical problems (with Solr) that cause pacakge_search API call to not list one or more datasets. In this case, if using the delete_missing option to harvest, CKAN would delete this dataset, which is correct. But whenever the remote CKAN fixes the issue, the dataset will appear again in package_search response, but it wouldn't be updated by the harvest process because the metadata_modified field does not change in this scenario.

It seems a very unlikely situation, but it has already happened in Brazilian government data portal.

This has the inconvenient that it would also take out of trash a dataset which was harvested and then manually deleted. Anyway, I think this should be the right behaviour, since if we purge a harvested dataset, it will be reimported in the next harvest run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant