Identify approach for deriving latest set of records for a harvest source #4744

rshewitt · 2024-05-08T21:04:54Z

User Story

In order to accurately run the compare, datagov wants to read the latest complete set of records from the db for a given harvest source

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

GIVEN a harvest source
WHEN the db read happens
THEN the latest complete set of records should be returned

Background

a "complete set" refers to all successful and non-deleted unique records of a harvest source according to an identifier ( "identifier" in dcatus or wav_url + dataset_name in waf )
the harvest record table contains current and historic versions of a harvest record which prompts the question of how do we derive the latest complete set of successful records for a given harvest source?
a "historic" record, in the context of what needs to be used in the compare, is any record older than the most recent with a state = "success"
we expect records to be inserted according to the compare result

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

given...

identifier	date_created	raw_source	state
a	2024-01-01	"data"	"success"
a	2024-03-01	"data_1"	"success"
b	2022-03-01	"data_10"	"success"
b	2022-05-01	"data_30"	"failed"
c	2020-03-01	"data_12"	"success"

the result should be...

identifier	date_created	raw_source	state
a	2024-03-01	"data_1"	"success"
b	2022-03-01	"data_10"	"success"
c	2020-03-01	"data_12"	"success"

SELECT * FROM (
      SELECT DISTINCT ON (identifier) identifier, date_created, source_raw, state, action, source_id
      FROM record 
      WHERE state = 'success' AND source_id = 'example-source'
      ORDER BY identifier, date_created DESC ) sq
WHERE sq.action != 'delete';

[Notes or a checklist reflecting our understanding of the selected approach]

jbrown-xentity · 2024-05-09T20:41:02Z

So the way to accomplish this is through a thing called window functions. Essentially it subselects everything and gives a row number and overwrites the columns you want that are different, and allows you to do the group by to make sure you get unique values. At this point you'll want to exclude "failures" from the initial window function subselect, that way if there should have been an update but it failed the update will re-try; and if the initial load failed it will retry as well (like the last job for that record didn't happen)

You'll probably need to create some mock data and test to make sure this comes out correctly, but it should be able to pull the most recent hash for every ID record in the dataset, and then we exclude all deleted ones (ID's marked for deletion)

rshewitt · 2024-05-09T23:21:57Z

CREATE TABLE record( 
	id int PRIMARY KEY, 
	identifier VARCHAR(50), 
	date_created TIMESTAMP, 
	source_raw VARCHAR(50), 
	state VARCHAR(15)
)

INSERT INTO record (id, identifier, date_created, source_raw, state ) VALUES 
(1, 'a', '2024-01-01 00:00:00.001', 'data', 'success'), 
(2, 'a', '2024-03-01 00:00:00.001', 'data_1', 'success'), 
(3, 'b', '2022-03-01 00:00:00.001', 'data_10', 'success'), 
(4, 'b', '2022-05-01 00:00:00.001', 'data_30', 'failed' ),
(5, 'c', '2024-05-01 00:00:00.001', 'data_12', 'success' )

SELECT * FROM record

 id | identifier |      date_created       | source_raw |  state  
----+------------+-------------------------+------------+---------
  1 | a          | 2024-01-01 00:00:00.001 | data       | success
  2 | a          | 2024-03-01 00:00:00.001 | data_1     | success
  3 | b          | 2022-03-01 00:00:00.001 | data_10    | success
  4 | b          | 2022-05-01 00:00:00.001 | data_30    | failed
  5 | c          | 2024-05-01 00:00:00.001 | data_12    | success
(5 rows)

SELECT DISTINCT ON (identifier) identifier, state, date_created, source_raw
FROM record
WHERE state = ‘success’
ORDER BY identifier, date_created DESC
 
 identifier |  state  |      date_created       | source_raw 
------------+---------+-------------------------+------------
 a          | success | 2024-03-01 00:00:00.001 | data_1
 b          | success | 2022-03-01 00:00:00.001 | data_10
 c          | success | 2024-05-01 00:00:00.001 | data_12
(3 rows)

this is meant to proof out being able to get the latest unique record based on a condition

FuhuXia · 2024-05-10T13:38:25Z

Good approach. Do we need an extra column action? and deleted record such as d should not be in the result.

  6 | d          | 2022-03-01 00:00:00.001 | data_3    | success | add
  7 | d          | 2024-05-01 00:00:00.001 |           | success | delete

rshewitt · 2024-05-10T15:17:37Z

implementation using sqlalchemy (source)

rshewitt · 2024-05-10T17:23:14Z

need to update the query to properly handle deleted records. using the query made in my comment above could return a historic record with action = "create". it shouldn't do that if the latest record is deleted. none of the historic records should be considered if the latest has been deleted.

rshewitt · 2024-05-10T17:30:02Z

validate the distinct on approach with pure sql with the exclusion of deleted records before implementing it in sqlalchemy. after getting a working sql command run it on staging to benchmark performance

rshewitt · 2024-05-10T18:55:21Z

postgres playground. i believe that test data should account for our use cases which are...

get records based on a source id
get unique records based on identifier
get the latest records based on date_created
exclude records that were not successful
exclude latest records that are deleted
include latest records that were previously deleted

rshewitt · 2024-05-10T19:21:10Z

Schema (PostgreSQL v15)

    CREATE TABLE record( 
    	id int PRIMARY KEY, 
    	identifier VARCHAR(50), 
    	date_created TIMESTAMP, 
    	source_raw VARCHAR(50), 
    	state VARCHAR(15),
        action VARCHAR(15),
        source_id VARCHAR(30)
    );

    INSERT INTO record (id, identifier, date_created, source_raw, state, action, source_id ) VALUES 
    (1, 'a', '2024-01-01 00:00:00.001', 'data', 'success', 'create', 'example-source'), 
    (2, 'a', '2024-03-01 00:00:00.001', 'data_1', 'success', 'update', 'example-source'), 
    (3, 'b', '2022-03-01 00:00:00.001', 'data_10', 'success', 'create', 'example-source'), 
    (4, 'b', '2022-05-01 00:00:00.001', 'data_30', 'failed', 'update', 'example-source' ),
    (5, 'c', '2024-05-01 00:00:00.001', 'data_12', 'success', 'create', 'example-source' ),
    (6, 'd', '2024-05-01 00:00:00.001', 'data_2', 'success', 'delete', 'example-source' ),
    (7, 'd', '2024-04-01 00:00:00.001', 'data_5', 'success', 'create', 'example-source' ),
    (8, 'e', '2024-04-01 00:00:00.001', 'data_123', 'success', 'create','example-source' ),
    (9, 'e', '2024-04-02 00:00:00.001', 'data_123', 'success', 'delete','example-source' ),
    (10, 'e', '2024-04-03 00:00:00.001', 'data_123', 'success', 'create','example-source' ),
    (11, 'f', '2024-04-01 00:00:00.001', 'data_87', 'success', 'create','example-source-other' );

    SELECT * FROM record;

id	identifier	date_created	source_raw	state	action	source_id
1	a	2024-01-01T00:00:00.001Z	data	success	create	example-source
2	a	2024-03-01T00:00:00.001Z	data_1	success	update	example-source
3	b	2022-03-01T00:00:00.001Z	data_10	success	create	example-source
4	b	2022-05-01T00:00:00.001Z	data_30	failed	update	example-source
5	c	2024-05-01T00:00:00.001Z	data_12	success	create	example-source
6	d	2024-05-01T00:00:00.001Z	data_2	success	delete	example-source
7	d	2024-04-01T00:00:00.001Z	data_5	success	create	example-source
8	e	2024-04-01T00:00:00.001Z	data_123	success	create	example-source
9	e	2024-04-02T00:00:00.001Z	data_123	success	delete	example-source
10	e	2024-04-03T00:00:00.001Z	data_123	success	create	example-source
11	f	2024-04-01T00:00:00.001Z	data_87	success	create	example-source-other

Option 1 ( using subquery )

    SELECT * FROM (
      SELECT DISTINCT ON (identifier) identifier, date_created, source_raw, state, action, source_id
      FROM record 
      WHERE state = 'success' AND source_id = 'example-source'
      ORDER BY identifier, date_created DESC ) sq
    WHERE sq.action != 'delete';

identifier	date_created	source_raw	state	action	source_id
a	2024-03-01T00:00:00.001Z	data_1	success	update	example-source
b	2022-03-01T00:00:00.001Z	data_10	success	create	example-source
c	2024-05-01T00:00:00.001Z	data_12	success	create	example-source
e	2024-04-03T00:00:00.001Z	data_123	success	create	example-source

View on DB Fiddle

FuhuXia · 2024-05-10T20:55:17Z

the subquery execution time against staging catalog db is less than 200-400 ms.

=> select count(*) from (
SELECT DISTINCT ON (package_id) package_id, gathered, state, report_status, current
FROM harvest_object 
WHERE state = 'COMPLETE' AND harvest_source_id='58f92550-7a01-4f00-b1b2-8dc953bd598f'
ORDER BY package_id, gathered DESC 
) t
 where report_status!='deleted';
 count 
-------
 37369
(1 row)

Time: 328.239 ms

=> select * from (
SELECT DISTINCT ON (package_id) package_id, gathered, state, report_status, current
FROM harvest_object 
WHERE state = 'COMPLETE' AND harvest_source_id='58f92550-7a01-4f00-b1b2-8dc953bd598f'
ORDER BY package_id, gathered DESC 
) t
 where report_status!='deleted' limit 3;
              package_id              |          gathered          |  state   | report_status | current 
--------------------------------------+----------------------------+----------+---------------+---------
 00025aef-24c7-4090-ba92-1fb962c8483b | 2022-11-30 17:21:03.005029 | COMPLETE | updated       | t
 0002b09c-f26d-4229-9ac6-7617eb0fc10f | 2020-11-11 17:02:35.063411 | COMPLETE | added         | t
 000336b7-85bc-48c7-bb01-5def319b7961 | 2023-02-03 20:27:01.00436  | COMPLETE | updated       | t
(3 rows)

Time: 284.748 ms

rshewitt · 2024-05-10T21:45:01Z

thanks @FuhuXia for benchmarking those queries! looks like we can proceed with using the subquery option in the sketch.

rshewitt added the H2.0/Harvest-DB Postgres and related DB tickets label May 8, 2024

rshewitt self-assigned this May 10, 2024

btylerburton mentioned this issue May 9, 2024

Put Harvest Records to DB after Harvest Runner Compare #4733

Closed

5 tasks

This was referenced May 13, 2024

Update harvest DB tables #4731

Closed

Get latest harvest records from db GSA/datagov-harvesting-logic#72

Merged

btylerburton closed this as completed May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify approach for deriving latest set of records for a harvest source #4744

Identify approach for deriving latest set of records for a harvest source #4744

rshewitt commented May 8, 2024 •

edited

jbrown-xentity commented May 9, 2024

rshewitt commented May 9, 2024 •

edited

FuhuXia commented May 10, 2024

rshewitt commented May 10, 2024

rshewitt commented May 10, 2024

rshewitt commented May 10, 2024 •

edited

rshewitt commented May 10, 2024 •

edited

rshewitt commented May 10, 2024 •

edited

FuhuXia commented May 10, 2024

rshewitt commented May 10, 2024

Identify approach for deriving latest set of records for a harvest source #4744

Identify approach for deriving latest set of records for a harvest source #4744

Comments

rshewitt commented May 8, 2024 • edited

User Story

Acceptance Criteria

Background

Security Considerations (required)

Sketch

jbrown-xentity commented May 9, 2024

rshewitt commented May 9, 2024 • edited

FuhuXia commented May 10, 2024

rshewitt commented May 10, 2024

rshewitt commented May 10, 2024

rshewitt commented May 10, 2024 • edited

rshewitt commented May 10, 2024 • edited

rshewitt commented May 10, 2024 • edited

FuhuXia commented May 10, 2024

rshewitt commented May 10, 2024

rshewitt commented May 8, 2024 •

edited

rshewitt commented May 9, 2024 •

edited

rshewitt commented May 10, 2024 •

edited

rshewitt commented May 10, 2024 •

edited

rshewitt commented May 10, 2024 •

edited