Allow full clear of completed jobs #503

bonnland · 2022-05-18T05:44:34Z

The current behavior of the source clear-history -k true command is to retain all job ids where some harvested dataset came from that job id. This seems unnecessary, and for our CKAN instance, with many thousands of harvested datasets that get added/updated in small batches, it means that there is always a very long job history list.

This PR will require a small 2-line change to the ckan/ckanext-spatial WAF harvesting code, to handle the case where a harvest job ID has been cleared. It is possible that other harvesters will need small adjusts like this; I've only tested the WAF harvesting code because this is my organization's particular use case.

I have also made a suggested change to the options for this command, so that "keep currently running jobs" is always done. It seems potentially confusing, possibly unhelpful, and unnecessary to clear the history of jobs that are still running.

I have also removed the option of clearing currently harvested objects as part of the "history clear" behavior. See discussion below about why this may be a good idea.

I know that this initial PR version has failing tests. I do not have testing set up on my current vagrant VM instance, and will try to fix the tests by looking at the Github tests output.

Please feel free to suggest possible alternatives or edits. Thank you!

bonnland · 2022-05-18T05:59:31Z

Related PR: ckan/ckanext-spatial#284

bonnland · 2022-05-18T06:48:21Z

Note: I've verified on my vagrant VM that currently running harvest jobs are not cleared when the "job history clear" command is given. It is a source-based CKAN 2.9.5 install running uwsgi+nginx, with WAF harvesting enabled.

bonnland · 2022-05-18T08:26:32Z

Related to #293

Zharktas · 2022-05-18T08:48:44Z

ckanext/harvest/model/__init__.py

@@ -356,7 +357,7 @@ def define_harvester_tables():
        Column('state', types.UnicodeText, default=u'WAITING'),
        Column('metadata_modified_date', types.DateTime),
        Column('retry_times', types.Integer, default=0),
-        Column('harvest_job_id', types.UnicodeText, ForeignKey('harvest_job.id')),
+        Column('harvest_job_id', types.UnicodeText, ForeignKey('harvest_job.id', ondelete='SET NULL'), nullable=True),


Changing table structure would require migration script, otherwise it would not be applied to any instance.

I agree that a schema change creates some challenges for existing databases. However, the ckanapi package gives very nice support for exporting and re-importing users, packages, and organizations, so that a fresh database instance with imported users, etc is not very time consuming. I could write up a set of instructions for the Wiki if this is useful.

seitenbau-govdata · 2022-05-18T08:53:45Z

Hi @bonnland. Thanks for the interesting pull request. Could you explain the advantage of deleting the harvest jobs and keeping the harvest objects?

The current implementation with the option -k true preserves the harvest jobs with current harvest objects as long as a harvest job has at least one current harvest object. And these harvest jobs with their reports are still available in the UI.

bonnland · 2022-05-18T16:29:21Z

I just realized that the user option --keep_current for keeping current harvest objects would no longer be needed with this pull request.

If there are organizations who want duplicate packages in the package table, this pull request should not be accepted. It's hard to think of a reason why duplicate records would be good, though, from my perspective.

bonnland · 2022-05-18T16:50:12Z

Hi @bonnland. Thanks for the interesting pull request. Could you explain the advantage of deleting the harvest jobs and keeping the harvest objects?

Hi @seitenbau-govdata, thanks for the attention and interest. Keeping the harvest objects allows tracking of what has been harvested already. If all harvest objects are cleared as part of the history clear, then the harvester will re-harvest datasets that already appear in the package table, and duplicate rows in the package table will be created. For our organization, this is costly because our harvest sources have thousands of datasets. The package table fills up quickly with duplicate rows if harvest objects are not kept, and it can take hours to re-harvest everything.

So I would argue that keeping current harvested objects should be the default behavior, at the very least. Currently, it requires passing a flag. Users who do not realize they must pass a flag discover eventually that their package table gets large with repeated uses of the "history clear" command. They also discover that every harvested dataset URL has this strange integer at the end of it, which increments every time the "history clear" command is used.

Also, the advantage of clearing all completed harvest jobs is that it becomes much easier to track how often a harvest source is being updated. Without this pull request, even when using the -k flag for keeping harvest objects, I end up with a "cleared" job history list that is very large. It seems unnecessary and confusing to not have the option to fully clear all completed jobs.

…rvested datasets

bonnland · 2022-05-18T18:50:00Z

And these harvest jobs with their reports are still available in the UI.

For our purposes, the harvest job reports help in the short term, to fix harvesting errors. Their usefulness rarely goes past a few days.

Is it useful to have job reports for cases where the datasets were harvested successfully? Because it seems that harvest job ids are kept when datasets are harvested successfully. If you can explain how this is helpful, I would appreciate knowing the advantages.

bonnland · 2022-05-18T21:40:07Z

ckanext/harvest/utils.py

@@ -216,18 +216,12 @@ def clear_harvest_source_history(source_id, keep_current):
    if source_id is not None:
        tk.get_action("harvest_source_job_history_clear")(context, {
            "id": source_id,
-            "keep_current": keep_current
-        })
+            })
        return "Cleared job history of harvest source: {0}".format(source_id)


Note the language used in the return statement. This command is most useful if it clears the "job history", not the entire source history. Perhaps a change in the command name to "clear-job-history" would be better, as it more clearly states the eventual outcome of the command.

bonnland · 2022-05-18T23:39:42Z

I should probably not forget to add that our organization adds new records individually, at an average rate of 3-4 per week. That means I have a "cleared" job history list with over 100 entries for some of our WAFs. This might not be the usual case for others, so the urgency for this PR is probably not as relevant for others as it might be for us.

Zharktas · 2022-05-19T04:52:04Z

And these harvest jobs with their reports are still available in the UI.

For our purposes, the harvest job reports help in the short term, to fix harvesting errors. Their usefulness rarely goes past a few days.

Is it useful to have job reports for cases where the datasets were harvested successfully? Because it seems that harvest job ids are kept when datasets are harvested successfully. If you can explain how this is helpful, I would appreciate knowing the advantages.

Not all instances harvest daily. For you the history might be relevant only for few days, but other might have a need for what happened three months ago.

What I would do is have separate command for what you are trying to achieve or at least an option in the current one. It would be a nasty surprise for someone running this command and noticing the functionality has changed.

bonnland · 2022-05-19T13:23:22Z

What I would do is have separate command for what you are trying to achieve or at least an option in the current one. It would be a nasty surprise for someone running this command and noticing the functionality has changed.

I am not sure I understand. The command is run when someone wants to clear the job history. If they don't want to clear the job history, then they do not run the command.

What this pull request does is make this command behave as it did before 2016 or 2017. I am not sure why it was changed; it worked very well before.

Zharktas · 2022-05-19T13:28:10Z

There's discussion on #484 and #397 why the change was made originally.

bonnland · 2022-05-19T13:44:38Z

There's discussion on #484 and #397 why the change was made originally.

OK, perhaps it would be better if a new command option is made available. How does "harvester source clear-job-history" sound? It would still require a change to the foreign key constraint on the harvest_object table.

seitenbau-govdata · 2022-05-19T13:51:20Z

What this pull request does is make this command behave as it did before 2016 or 2017. I am not sure why it was changed; it worked very well before.

We have introduced the command clearsource_history (as click command source clear-history) with #268 at the end of 2016. And I mean the only change afterwards was the fix #484.

bonnland · 2022-05-19T14:00:24Z

We have introduced the command clearsource_history (as click command source clear-history) with #268 at the end of 2016. And I mean the only change afterwards was the fix #484.

There used to be a command that would clear all completed jobs. It was very helpful. I suppose it disappeared when the clearsource_history command was added.

seitenbau-govdata · 2022-05-19T14:11:22Z

There used to be a command that would clear all completed jobs. It was very helpful. I suppose it disappeared when the clearsource_history command was added.

No, there wasn't removed any command when adding the command clearsource_history. The only command was and still is the clearsource (as click command clear source) which deletes the source with all datasets. That was the reason why we introduced the clearsource_history command, because there was no command like deleting only the harvest job history. If there had exists such a command we didn't introduces the clearsource_history command.

bonnland · 2022-05-19T14:19:55Z

No, there wasn't removed any command when adding the command clearsource_history. The only command was and still is the clearsource (as click command clear source) which deletes the source with all datasets. That was the reason why we introduced the clearsurce_history command, because there was no command like deleting only the harvest job history. If there had exists such a command we didn't introduces the clearsurce_history command.

I remember a time when I could clear all completed jobs, and it would not cause duplicate rows in the package table after it was used. This issue of the package table filling up is potentially serious, and I am surprised it has not come up somewhere before.

seitenbau-govdata · 2022-05-19T14:55:14Z

I remember a time when I could clear all completed jobs, and it would not cause duplicate rows in the package table after it was used. This issue of the package table filling up is potentially serious, and I am surprised it has not come up somewhere before.

Maybe somewhere else in a fork? But unfortunately not in ckanext-dcat. Yes, I agree. With many thousands of datasets it is really serious and a huge problem. After going into production in early 2016 with more than 20.000 datasets and a harvesting interval of 2 days we pointed out after a few month that the size of our database was increasing and the harvest UI was getting slower. Therefore, we started to implement the new command.

bonnland · 2022-05-19T15:05:22Z

With many thousands of datasets it is really serious and a huge problem.

It is helpful to know that our organization is not the only one who has had this problem. The current interface does not provide any way of clearing past jobs without creating an entire new set of duplicate rows in the package table.

seitenbau-govdata · 2022-05-19T15:24:40Z

It is helpful to know that our organization is not the only one who has had this problem. The current interface does not provide any way of clearing past jobs without creating an entire new set of duplicate rows in the package table.

Actually this should be possible with the new option --keep-current respectively -k. We use harvester source clear-history -k true to delete the old harvest jobs and keep the latest harvest jobs with the current harvest objects. Does this not work for you?

bonnland · 2022-05-19T15:29:38Z

Actually this should be possible with the new option --keep-current respectively -k. We use harvester source clear-history -k true to delete the old harvest jobs and keep the latest harvest jobs with the current harvest objects. Does this not work for you?

It keeps hundreds of harvest jobs when I use this command. Any harvest job with a successfully harvested dataset is kept. For us, this is hundreds of harvest jobs. Many of the jobs involve adding a single new record to the harvest source. All of these jobs are kept. There is very little useful information in the jobs that are kept, because they all represent successful harvests. An always-growing, long list of successful harvests has very little useful information in it for our organization, and it becomes difficult to track changes to the harvesting behavior.

After more thought, perhaps some organizations want to track the rate of successful harvests over time. Maybe this is useful to some, but that is not true for us. And it seems that some organizations would want to "reset" their tracking by clearing all jobs at some point, without creating a full set of duplicate rows in the package table.

bonnland · 2022-05-19T17:14:58Z

We use harvester source clear-history -k true to delete the old harvest jobs and keep the latest harvest jobs with the current harvest objects.

@seitenbau-govdata Do you find the remaining job reports useful in any way? I am really trying to understand the benefits of keeping old job reports that have no errors in them. EDIT: I can see how tracking harvest history could be valuable. See below for possible ways to allow no changes to keep_current=true.

bonnland · 2022-05-19T19:06:16Z

If the keep_current=true behavior as it is now is useful to some organizations, there are a few ways we could go that would also work for our organization. Here are some possible choices for how to change the current interface behavior that could work:

When keep_current=false, also delete and purge harvested datasets in the package table. This would prevent duplicate entries from being created in the package table over the long term. But then it would be very similar to the behavior of the "clear source" command, and it would make datasets disappear from CKAN until they are harvested again.
When keep_current=false, do not delete all information about what has been harvested. Retain all harvest objects that are still current, but clear the job_id field. This would prevent duplicate entries in the package table by preserving the prior state of harvesting for the source.
Add a new command "clear-job-history" that removes past jobs without losing information about what has been harvested already. This would also prevent duplicate entries in the package table. Retain all harvest objects that are still current, but clear the job_id field in the harvest_object table.

At first, it seemed to me that the second and third choices would require setting the "job_id" field in the harvest_object table to NULL, but it might also work very well to set the job_ids to be the same (non-NULL) value, perhaps the value of a constant PRIOR_JOBS_ID. This would create a single "prior job" to replace (and summarize!) the potentially large number of "add new record" jobs that existed before. It might be an elegant and simple solution if there are not other foreign key constraints to prevent it.

I would really like to know how the current behavior of keep_current=false is helpful to organizations. It does almost the same thing as the "clear source" command, except that it keeps harvested datasets in the package table and makes it very easy to create a huge package table over time. It also has the side-effect of changing the dataset URLs because of existing URL collisions in the package table. Every time the command is run with keep_current=false, all datasets are re-harvested and dataset URLs get a new integer ending. The end-users at our organization have sharp eyes, and in the past, before I knew I had to pass the -k flag, they asked why the URLs were always changing.

bonnland · 2022-05-21T22:29:20Z

Maybe somewhere else in a fork?

You may be correct that clearsource_history without keep_current=true has always led to the package table growing over time. Perhaps it was harder to notice before because the harvested dataset URLs did not change on every re-harvest after the command was used.

If this pull request is too controversial for existing users, then I will create a second pull request with a new command called clear_job_history or something similar. I will aim for setting out-of-date job_id values to some predetermined constant, so that it avoids changing the database schema.

CKAN User added 2 commits May 16, 2022 21:43

Make job_id nullable, to allow job history clear

5042af9

Initial try at SQL logic

36a6f93

Update update.py

66daebf

bonnland mentioned this pull request May 18, 2022

Allow objects with cleared job ids ckan/ckanext-spatial#284

Open

CKAN User added 2 commits May 18, 2022 06:38

Remove flag that should always be true

843961f

Match tests to updated behavior

5981be3

CKAN User added 8 commits May 18, 2022 06:55

Match tests to updated behavior

39d3551

Match tests to updated behavior

60f7580

Match tests to updated behavior

1793585

Match tests to updated behavior

5d8a759

Match tests to updated behavior

9bfd600

Match tests to updated behavior

81c5305

Match tests to updated behavior

f1b6c68

Match tests to updated behavior

08252ea

Clarify documentation/description

80fb8f6

Zharktas reviewed May 18, 2022

View reviewed changes

CKAN User added 3 commits May 18, 2022 17:52

Remove the possibility of --keep-curent==False, to avoid duplicate ha…

9f22820

…rvested datasets

Remove duplicate tests

95e5b48

Update comment

2d31e13

Update comment

fa17f7d

bonnland commented May 18, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow full clear of completed jobs #503

Allow full clear of completed jobs #503

bonnland commented May 18, 2022 •

edited

bonnland commented May 18, 2022

bonnland commented May 18, 2022

bonnland commented May 18, 2022

Zharktas May 18, 2022

bonnland May 18, 2022

seitenbau-govdata commented May 18, 2022

bonnland commented May 18, 2022

bonnland commented May 18, 2022 •

edited

bonnland commented May 18, 2022

bonnland May 18, 2022 •

edited

bonnland commented May 18, 2022

Zharktas commented May 19, 2022

bonnland commented May 19, 2022 •

edited

Zharktas commented May 19, 2022

bonnland commented May 19, 2022 •

edited

seitenbau-govdata commented May 19, 2022 •

edited

bonnland commented May 19, 2022

seitenbau-govdata commented May 19, 2022 •

edited

bonnland commented May 19, 2022

seitenbau-govdata commented May 19, 2022

bonnland commented May 19, 2022

seitenbau-govdata commented May 19, 2022

bonnland commented May 19, 2022 •

edited

bonnland commented May 19, 2022 •

edited

bonnland commented May 19, 2022 •

edited

bonnland commented May 21, 2022 •

edited

Allow full clear of completed jobs #503

Are you sure you want to change the base?

Allow full clear of completed jobs #503

Conversation

bonnland commented May 18, 2022 • edited

bonnland commented May 18, 2022

bonnland commented May 18, 2022

bonnland commented May 18, 2022

Zharktas May 18, 2022

Choose a reason for hiding this comment

bonnland May 18, 2022

Choose a reason for hiding this comment

seitenbau-govdata commented May 18, 2022

bonnland commented May 18, 2022

bonnland commented May 18, 2022 • edited

bonnland commented May 18, 2022

bonnland May 18, 2022 • edited

Choose a reason for hiding this comment

bonnland commented May 18, 2022

Zharktas commented May 19, 2022

bonnland commented May 19, 2022 • edited

Zharktas commented May 19, 2022

bonnland commented May 19, 2022 • edited

seitenbau-govdata commented May 19, 2022 • edited

bonnland commented May 19, 2022

seitenbau-govdata commented May 19, 2022 • edited

bonnland commented May 19, 2022

seitenbau-govdata commented May 19, 2022

bonnland commented May 19, 2022

seitenbau-govdata commented May 19, 2022

bonnland commented May 19, 2022 • edited

bonnland commented May 19, 2022 • edited

bonnland commented May 19, 2022 • edited

bonnland commented May 21, 2022 • edited

bonnland commented May 18, 2022 •

edited

bonnland commented May 18, 2022 •

edited

bonnland May 18, 2022 •

edited

bonnland commented May 19, 2022 •

edited

bonnland commented May 19, 2022 •

edited

seitenbau-govdata commented May 19, 2022 •

edited

seitenbau-govdata commented May 19, 2022 •

edited

bonnland commented May 19, 2022 •

edited

bonnland commented May 19, 2022 •

edited

bonnland commented May 19, 2022 •

edited

bonnland commented May 21, 2022 •

edited