Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not duplicate harvest_extras if exist in root schema #521

Merged
merged 3 commits into from Mar 14, 2023

Conversation

pdelboca
Copy link
Member

@pdelboca pdelboca commented Mar 9, 2023

This PR refactors and document the logic of our dataset_before_index method.

With the new changes, harvest metadata will not be added to extras if it exist in the root schema of the dataset. This will allow other extensions or implementations to add harvest metadata to the main package schema without getting duplicated errors when trying to update or patch the dataset.

@pdelboca pdelboca added the WIP label Mar 9, 2023

# Add harvest extras to main indexed pkg_dict
for key, value in harvest_extras:
pkg_dict[key] = value
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code has been here but I don't see it working.

Indexed packages do not have this attributes when calling Solr directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done to add the harvest field as the "catch-all" string field in Solr. These values are indexed but not stored, so that's why you don't see them when calling Solr directly.
With your changes, we should only add them if not already there in pkg_dict

@pdelboca pdelboca requested a review from amercader March 9, 2023 13:42
@pdelboca
Copy link
Member Author

pdelboca commented Mar 9, 2023

@amercader As a continuation of this PR I want to add some documentation about pkg_dict, data_dict and validated_data_dict on CKAN Docs.

I still don't get quite well the uses and difference betwen them.

@amercader
Copy link
Member

@pdelboca This looks good!

re the different dicts:

  • pkg_dict means "the dataset dict that will be sent to Solr for indexing"
  • data_dict means the default dataset metadata dict without any validation applied (eg all custom fields are extras)
  • validated_data_dict means the dataset metadata dict after applying validation (eg what you get on package_show with the schema customizations applied)

This is a responsibility of the package. We are skiping any override since users will expect the behaviour of the custom logic added in package schema and validators.
@amercader amercader merged commit 5451308 into master Mar 14, 2023
@pdelboca pdelboca removed the WIP label Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants