Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put Harvest Records to DB after Harvest Runner Compare #4733

Closed
5 tasks
btylerburton opened this issue May 6, 2024 · 6 comments
Closed
5 tasks

Put Harvest Records to DB after Harvest Runner Compare #4733

btylerburton opened this issue May 6, 2024 · 6 comments
Assignees
Labels
H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0

Comments

@btylerburton
Copy link
Contributor

btylerburton commented May 6, 2024

User Story

In order to create an accurate baseline for our catalog records, datagovteam wants to put all harvested records that should be created, updated, and deleted into the harvest_records DB table.

Depends upon:

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN we have a harvest source in our harvest runner
    WHEN we have completed the compare step
    AND we have a list of records that should be created, updated, or deleted
    THEN those records should be posted to the harvest_records db table in order to establish a baseline for comparison against incoming records in a future harvest.

Background

Our Harvest DB should be the source of truth.

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!] None

Sketch

  • Wire in database interface into harvest runner
  • Once compare process has run, create new DB records for all record objects that should be created/updated/deleted
    • For records to delete, a slim record containing the identifier and UUID should suffice
  • Save newly created records to DB
@btylerburton btylerburton added the H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0 label May 6, 2024
@rshewitt
Copy link
Contributor

rshewitt commented May 8, 2024

wanted to put this somewhere

reading harvest records

  • Option 1
    • Complicated query to derive the most recent successful harvest records from db
  • Option 2
    • Add unchanged records to the db
  • Option 3
    • Use db/solr query

@btylerburton
Copy link
Contributor Author

open question on whether we should record unchanged records

@btylerburton
Copy link
Contributor Author

updated ticket with dependency on #4744

@GSA/data-gov-dev-team i'm dropping this at top of Harvester 2.0 backlog, but please do review AC/Sketch for completeness

@rshewitt rshewitt self-assigned this May 17, 2024
@rshewitt
Copy link
Contributor

rshewitt commented May 17, 2024

status may need to be updated to include another value in the enum. something like "pending" because we write the compare results prior to writing them on ckan. a status of "success" doesn't really make sense if the record isn't on ckan yet right? and "error" doesn't make sense in the case where no error has occurred. my thinking is that status represents the sync status.

NVM it's a nullable field

@jbrown-xentity
Copy link
Contributor

Correct. Status should be one of three things (I would think):

  • success
  • failure
  • pending
    Eventually we'll have something to clean up any items that are stuck in pending indefinitely, but that's a cleanup job to be defined at a later date.

@btylerburton
Copy link
Contributor Author

Should we just make status a nullable value? So we wouldn't post any status until we get a success or failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0
Projects
Status: 🗄 Closed
Development

No branches or pull requests

3 participants