Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify approach for deriving latest set of records for a harvest source #4744

Closed
1 task
rshewitt opened this issue May 8, 2024 · 10 comments
Closed
1 task
Assignees
Labels
H2.0/Harvest-DB Postgres and related DB tickets

Comments

@rshewitt
Copy link
Contributor

rshewitt commented May 8, 2024

User Story

In order to accurately run the compare, datagov wants to read the latest complete set of records from the db for a given harvest source

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN a harvest source
    WHEN the db read happens
    THEN the latest complete set of records should be returned

Background

  • a "complete set" refers to all successful and non-deleted unique records of a harvest source according to an identifier ( "identifier" in dcatus or wav_url + dataset_name in waf )
  • the harvest record table contains current and historic versions of a harvest record which prompts the question of how do we derive the latest complete set of successful records for a given harvest source?
  • a "historic" record, in the context of what needs to be used in the compare, is any record older than the most recent with a state = "success"
  • we expect records to be inserted according to the compare result

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

given...

identifier date_created raw_source state
a 2024-01-01 "data" "success"
a 2024-03-01 "data_1" "success"
b 2022-03-01 "data_10" "success"
b 2022-05-01 "data_30" "failed"
c 2020-03-01 "data_12" "success"

the result should be...

identifier date_created raw_source state
a 2024-03-01 "data_1" "success"
b 2022-03-01 "data_10" "success"
c 2020-03-01 "data_12" "success"
SELECT * FROM (
      SELECT DISTINCT ON (identifier) identifier, date_created, source_raw, state, action, source_id
      FROM record 
      WHERE state = 'success' AND source_id = 'example-source'
      ORDER BY identifier, date_created DESC ) sq
WHERE sq.action != 'delete';

[Notes or a checklist reflecting our understanding of the selected approach]

@rshewitt rshewitt added the H2.0/Harvest-DB Postgres and related DB tickets label May 8, 2024
@jbrown-xentity
Copy link
Contributor

So the way to accomplish this is through a thing called window functions. Essentially it subselects everything and gives a row number and overwrites the columns you want that are different, and allows you to do the group by to make sure you get unique values. At this point you'll want to exclude "failures" from the initial window function subselect, that way if there should have been an update but it failed the update will re-try; and if the initial load failed it will retry as well (like the last job for that record didn't happen)

You'll probably need to create some mock data and test to make sure this comes out correctly, but it should be able to pull the most recent hash for every ID record in the dataset, and then we exclude all deleted ones (ID's marked for deletion)

@rshewitt
Copy link
Contributor Author

rshewitt commented May 9, 2024

CREATE TABLE record( 
	id int PRIMARY KEY, 
	identifier VARCHAR(50), 
	date_created TIMESTAMP, 
	source_raw VARCHAR(50), 
	state VARCHAR(15)
)

INSERT INTO record (id, identifier, date_created, source_raw, state ) VALUES 
(1, 'a', '2024-01-01 00:00:00.001', 'data', 'success'), 
(2, 'a', '2024-03-01 00:00:00.001', 'data_1', 'success'), 
(3, 'b', '2022-03-01 00:00:00.001', 'data_10', 'success'), 
(4, 'b', '2022-05-01 00:00:00.001', 'data_30', 'failed' ),
(5, 'c', '2024-05-01 00:00:00.001', 'data_12', 'success' )

SELECT * FROM record

 id | identifier |      date_created       | source_raw |  state  
----+------------+-------------------------+------------+---------
  1 | a          | 2024-01-01 00:00:00.001 | data       | success
  2 | a          | 2024-03-01 00:00:00.001 | data_1     | success
  3 | b          | 2022-03-01 00:00:00.001 | data_10    | success
  4 | b          | 2022-05-01 00:00:00.001 | data_30    | failed
  5 | c          | 2024-05-01 00:00:00.001 | data_12    | success
(5 rows)

SELECT DISTINCT ON (identifier) identifier, state, date_created, source_raw
FROM record
WHERE state = ‘success’
ORDER BY identifier, date_created DESC
 
 identifier |  state  |      date_created       | source_raw 
------------+---------+-------------------------+------------
 a          | success | 2024-03-01 00:00:00.001 | data_1
 b          | success | 2022-03-01 00:00:00.001 | data_10
 c          | success | 2024-05-01 00:00:00.001 | data_12
(3 rows)

this is meant to proof out being able to get the latest unique record based on a condition

@FuhuXia
Copy link
Member

FuhuXia commented May 10, 2024

Good approach. Do we need an extra column action? and deleted record such as d should not be in the result.

  6 | d          | 2022-03-01 00:00:00.001 | data_3    | success | add
  7 | d          | 2024-05-01 00:00:00.001 |           | success | delete

@rshewitt
Copy link
Contributor Author

implementation using sqlalchemy (source)

@rshewitt rshewitt self-assigned this May 10, 2024
@rshewitt
Copy link
Contributor Author

need to update the query to properly handle deleted records. using the query made in my comment above could return a historic record with action = "create". it shouldn't do that if the latest record is deleted. none of the historic records should be considered if the latest has been deleted.

@rshewitt
Copy link
Contributor Author

rshewitt commented May 10, 2024

validate the distinct on approach with pure sql with the exclusion of deleted records before implementing it in sqlalchemy. after getting a working sql command run it on staging to benchmark performance

@rshewitt
Copy link
Contributor Author

rshewitt commented May 10, 2024

postgres playground. i believe that test data should account for our use cases which are...

  • get records based on a source id
  • get unique records based on identifier
  • get the latest records based on date_created
  • exclude records that were not successful
  • exclude latest records that are deleted
  • include latest records that were previously deleted

@rshewitt
Copy link
Contributor Author

rshewitt commented May 10, 2024

Schema (PostgreSQL v15)

    CREATE TABLE record( 
    	id int PRIMARY KEY, 
    	identifier VARCHAR(50), 
    	date_created TIMESTAMP, 
    	source_raw VARCHAR(50), 
    	state VARCHAR(15),
        action VARCHAR(15),
        source_id VARCHAR(30)
    );

    INSERT INTO record (id, identifier, date_created, source_raw, state, action, source_id ) VALUES 
    (1, 'a', '2024-01-01 00:00:00.001', 'data', 'success', 'create', 'example-source'), 
    (2, 'a', '2024-03-01 00:00:00.001', 'data_1', 'success', 'update', 'example-source'), 
    (3, 'b', '2022-03-01 00:00:00.001', 'data_10', 'success', 'create', 'example-source'), 
    (4, 'b', '2022-05-01 00:00:00.001', 'data_30', 'failed', 'update', 'example-source' ),
    (5, 'c', '2024-05-01 00:00:00.001', 'data_12', 'success', 'create', 'example-source' ),
    (6, 'd', '2024-05-01 00:00:00.001', 'data_2', 'success', 'delete', 'example-source' ),
    (7, 'd', '2024-04-01 00:00:00.001', 'data_5', 'success', 'create', 'example-source' ),
    (8, 'e', '2024-04-01 00:00:00.001', 'data_123', 'success', 'create','example-source' ),
    (9, 'e', '2024-04-02 00:00:00.001', 'data_123', 'success', 'delete','example-source' ),
    (10, 'e', '2024-04-03 00:00:00.001', 'data_123', 'success', 'create','example-source' ),
    (11, 'f', '2024-04-01 00:00:00.001', 'data_87', 'success', 'create','example-source-other' );

    SELECT * FROM record;
id identifier date_created source_raw state action source_id
1 a 2024-01-01T00:00:00.001Z data success create example-source
2 a 2024-03-01T00:00:00.001Z data_1 success update example-source
3 b 2022-03-01T00:00:00.001Z data_10 success create example-source
4 b 2022-05-01T00:00:00.001Z data_30 failed update example-source
5 c 2024-05-01T00:00:00.001Z data_12 success create example-source
6 d 2024-05-01T00:00:00.001Z data_2 success delete example-source
7 d 2024-04-01T00:00:00.001Z data_5 success create example-source
8 e 2024-04-01T00:00:00.001Z data_123 success create example-source
9 e 2024-04-02T00:00:00.001Z data_123 success delete example-source
10 e 2024-04-03T00:00:00.001Z data_123 success create example-source
11 f 2024-04-01T00:00:00.001Z data_87 success create example-source-other

Option 1 ( using subquery )

    SELECT * FROM (
      SELECT DISTINCT ON (identifier) identifier, date_created, source_raw, state, action, source_id
      FROM record 
      WHERE state = 'success' AND source_id = 'example-source'
      ORDER BY identifier, date_created DESC ) sq
    WHERE sq.action != 'delete';
identifier date_created source_raw state action source_id
a 2024-03-01T00:00:00.001Z data_1 success update example-source
b 2022-03-01T00:00:00.001Z data_10 success create example-source
c 2024-05-01T00:00:00.001Z data_12 success create example-source
e 2024-04-03T00:00:00.001Z data_123 success create example-source

View on DB Fiddle

@FuhuXia
Copy link
Member

FuhuXia commented May 10, 2024

the subquery execution time against staging catalog db is less than 200-400 ms.

=> select count(*) from (
SELECT DISTINCT ON (package_id) package_id, gathered, state, report_status, current
FROM harvest_object 
WHERE state = 'COMPLETE' AND harvest_source_id='58f92550-7a01-4f00-b1b2-8dc953bd598f'
ORDER BY package_id, gathered DESC 
) t
 where report_status!='deleted';
 count 
-------
 37369
(1 row)

Time: 328.239 ms

=> select * from (
SELECT DISTINCT ON (package_id) package_id, gathered, state, report_status, current
FROM harvest_object 
WHERE state = 'COMPLETE' AND harvest_source_id='58f92550-7a01-4f00-b1b2-8dc953bd598f'
ORDER BY package_id, gathered DESC 
) t
 where report_status!='deleted' limit 3;
              package_id              |          gathered          |  state   | report_status | current 
--------------------------------------+----------------------------+----------+---------------+---------
 00025aef-24c7-4090-ba92-1fb962c8483b | 2022-11-30 17:21:03.005029 | COMPLETE | updated       | t
 0002b09c-f26d-4229-9ac6-7617eb0fc10f | 2020-11-11 17:02:35.063411 | COMPLETE | added         | t
 000336b7-85bc-48c7-bb01-5def319b7961 | 2023-02-03 20:27:01.00436  | COMPLETE | updated       | t
(3 rows)

Time: 284.748 ms

@rshewitt
Copy link
Contributor Author

thanks @FuhuXia for benchmarking those queries! looks like we can proceed with using the subquery option in the sketch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
H2.0/Harvest-DB Postgres and related DB tickets
Projects
Status: 🗄 Closed
Development

No branches or pull requests

4 participants