-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify approach for deriving latest set of records for a harvest source #4744
Comments
So the way to accomplish this is through a thing called window functions. Essentially it subselects everything and gives a row number and overwrites the columns you want that are different, and allows you to do the group by to make sure you get unique values. At this point you'll want to exclude "failures" from the initial window function subselect, that way if there should have been an update but it failed the update will re-try; and if the initial load failed it will retry as well (like the last job for that record didn't happen) You'll probably need to create some mock data and test to make sure this comes out correctly, but it should be able to pull the most recent hash for every ID record in the dataset, and then we exclude all deleted ones (ID's marked for deletion) |
CREATE TABLE record(
id int PRIMARY KEY,
identifier VARCHAR(50),
date_created TIMESTAMP,
source_raw VARCHAR(50),
state VARCHAR(15)
)
INSERT INTO record (id, identifier, date_created, source_raw, state ) VALUES
(1, 'a', '2024-01-01 00:00:00.001', 'data', 'success'),
(2, 'a', '2024-03-01 00:00:00.001', 'data_1', 'success'),
(3, 'b', '2022-03-01 00:00:00.001', 'data_10', 'success'),
(4, 'b', '2022-05-01 00:00:00.001', 'data_30', 'failed' ),
(5, 'c', '2024-05-01 00:00:00.001', 'data_12', 'success' )
SELECT * FROM record
id | identifier | date_created | source_raw | state
----+------------+-------------------------+------------+---------
1 | a | 2024-01-01 00:00:00.001 | data | success
2 | a | 2024-03-01 00:00:00.001 | data_1 | success
3 | b | 2022-03-01 00:00:00.001 | data_10 | success
4 | b | 2022-05-01 00:00:00.001 | data_30 | failed
5 | c | 2024-05-01 00:00:00.001 | data_12 | success
(5 rows)
SELECT DISTINCT ON (identifier) identifier, state, date_created, source_raw
FROM record
WHERE state = ‘success’
ORDER BY identifier, date_created DESC
identifier | state | date_created | source_raw
------------+---------+-------------------------+------------
a | success | 2024-03-01 00:00:00.001 | data_1
b | success | 2022-03-01 00:00:00.001 | data_10
c | success | 2024-05-01 00:00:00.001 | data_12
(3 rows)
this is meant to proof out being able to get the latest unique record based on a condition |
Good approach. Do we need an extra column
|
implementation using sqlalchemy (source) |
need to update the query to properly handle deleted records. using the query made in my comment above could return a historic record with action = "create". it shouldn't do that if the latest record is deleted. none of the historic records should be considered if the latest has been deleted. |
validate the distinct on approach with pure sql with the exclusion of deleted records before implementing it in sqlalchemy. after getting a working sql command run it on staging to benchmark performance |
postgres playground. i believe that test data should account for our use cases which are...
|
Schema (PostgreSQL v15) CREATE TABLE record(
id int PRIMARY KEY,
identifier VARCHAR(50),
date_created TIMESTAMP,
source_raw VARCHAR(50),
state VARCHAR(15),
action VARCHAR(15),
source_id VARCHAR(30)
); INSERT INTO record (id, identifier, date_created, source_raw, state, action, source_id ) VALUES
(1, 'a', '2024-01-01 00:00:00.001', 'data', 'success', 'create', 'example-source'),
(2, 'a', '2024-03-01 00:00:00.001', 'data_1', 'success', 'update', 'example-source'),
(3, 'b', '2022-03-01 00:00:00.001', 'data_10', 'success', 'create', 'example-source'),
(4, 'b', '2022-05-01 00:00:00.001', 'data_30', 'failed', 'update', 'example-source' ),
(5, 'c', '2024-05-01 00:00:00.001', 'data_12', 'success', 'create', 'example-source' ),
(6, 'd', '2024-05-01 00:00:00.001', 'data_2', 'success', 'delete', 'example-source' ),
(7, 'd', '2024-04-01 00:00:00.001', 'data_5', 'success', 'create', 'example-source' ),
(8, 'e', '2024-04-01 00:00:00.001', 'data_123', 'success', 'create','example-source' ),
(9, 'e', '2024-04-02 00:00:00.001', 'data_123', 'success', 'delete','example-source' ),
(10, 'e', '2024-04-03 00:00:00.001', 'data_123', 'success', 'create','example-source' ),
(11, 'f', '2024-04-01 00:00:00.001', 'data_87', 'success', 'create','example-source-other' ); SELECT * FROM record;
Option 1 ( using subquery ) SELECT * FROM (
SELECT DISTINCT ON (identifier) identifier, date_created, source_raw, state, action, source_id
FROM record
WHERE state = 'success' AND source_id = 'example-source'
ORDER BY identifier, date_created DESC ) sq
WHERE sq.action != 'delete';
|
the subquery execution time against staging catalog db is less than 200-400 ms.
|
thanks @FuhuXia for benchmarking those queries! looks like we can proceed with using the subquery option in the sketch. |
User Story
In order to accurately run the compare, datagov wants to read the latest complete set of records from the db for a given harvest source
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
WHEN the db read happens
THEN the latest complete set of records should be returned
Background
Security Considerations (required)
[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]
Sketch
given...
the result should be...
[Notes or a checklist reflecting our understanding of the selected approach]
The text was updated successfully, but these errors were encountered: