You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The DataONE object store is queried to determine if new pids are available for harvest and metadata scoring.
It appears that the entries in the object store are not sorted by date, such that an entry with an earlier date
can appear after an entry with a later date. SInce the metadig harvest task keeps track of which pids it has harvested based on sysmeta modified date, it may miss a pid that is added later that has an older date then the previous last entry.
For example, here are 4 entries returned from the object store, where the 2nd to the last entry has a later date than the latest entry. Metadig engine had store the date of the 2nd entry on one run, then used that date on the next run, so missed the first entry.
This is illustrated below with a listObjects listing, with annotations that show that the date recorded for run 1 caused a later added entry to be missed (as it's date was older)
http://cn.dataone.org/cn/v2/object/?fromDate=2020-10-05T05:05:46.831Z&nodeId=urn:node:ARM
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><?xml-stylesheet type="text/xsl" href="/cn/xslt/dataone.types.v2.xsl" ?>
<ns2:objectList xmlns:ns2="http://ns.dataone.org/service/types/v1" count="4" start="0" total="4">
<objectInfo>
<identifier>5fd20e807187f456d1ff3a4366faa14f</identifier>
<formatId>http://www.isotc211.org/2005/gmd</formatId>
<checksum algorithm="MD5">5fd20e807187f456d1ff3a4366faa14f</checksum>
[ on run 2, the previously recorded latest date of 2020-10-05T05:05:48.195+00:00 was used to retrieve new
entries, so this run was missed. ]
<dateSysMetadataModified>2020-10-05T05:05:48.153+00:00</dateSysMetadataModified>
<size>16767</size>
</objectInfo>
<objectInfo>
<identifier>b138597621570059d2396462f070a81e</identifier>
<formatId>http://www.isotc211.org/2005/gmd</formatId>
<checksum algorithm="MD5">b138597621570059d2396462f070a81e</checksum>
[ on run 1, this date was stored as the latest run ]
<dateSysMetadataModified>2020-10-05T05:05:48.195+00:00</dateSysMetadataModified>
<size>16767</size>
</objectInfo>
<objectInfo>
<identifier>cc5388fa712593353fb7a398b495c849</identifier>
<formatId>http://www.isotc211.org/2005/gmd</formatId>
<checksum algorithm="MD5">cc5388fa712593353fb7a398b495c849</checksum>
<dateSysMetadataModified>2020-10-05T05:05:46.874+00:00</dateSysMetadataModified>
<size>16700</size>
</objectInfo>
<objectInfo>
<identifier>d84368db2564a40922e76c9c71b1560e</identifier>
<formatId>http://www.isotc211.org/2005/gmd</formatId>
<checksum algorithm="MD5">d84368db2564a40922e76c9c71b1560e</checksum>
<dateSysMetadataModified>2020-10-05T05:05:46.831+00:00</dateSysMetadataModified>
<size>16700</size>
</objectInfo>
</ns2:objectList>
The text was updated successfully, but these errors were encountered:
It's not apparent to me how the CN sync handles this problem (and pids are not missed), as it uses the DataONE MN listObject service and a lastHarvestDate to track which pids have been harvested. This is performed in d1_synchronization org.dataone.cn.batch.synchronization.tasks.ObjectListHarvestTask
No special processing seems to happen to resolve the problem described above.
Comments from this class:
* Harvest from the MemberNode and add all of the pids to the synchronization queue
* The strategy is to accumulate the objectInfos for the full time window and sort
* the items into ascending chronological order. Periodically the lastHarvestedDate
* will be updated to avoid a complete reharvest in the face of system failures.
* Exceptions thrown means some or all will be reharvested (depending on whether an
* intermediate update of lastHarvestedDate was able to take place).
For metadig-engine, I propose that an overlapping harvest window is used:
record the 'lastHarvestDatetime' of the pid with the latest dateSysmetadataModified for harvest X
on the next harvest, manually set the startDatetime of the harvest to a certain number of minutes before the previous lastHarvestDatetime
add additional book keeping to ensure that new pids are not submitted multiple times
e.g. if a pid had been submitted for quality assessment and is waiting in the queue, make sure it is not resubmitted
The DataONE object store is queried to determine if new pids are available for harvest and metadata scoring.
It appears that the entries in the object store are not sorted by date, such that an entry with an earlier date
can appear after an entry with a later date. SInce the metadig harvest task keeps track of which pids it has harvested based on sysmeta modified date, it may miss a pid that is added later that has an older date then the previous last entry.
For example, here are 4 entries returned from the object store, where the 2nd to the last entry has a later date than the latest entry. Metadig engine had store the date of the 2nd entry on one run, then used that date on the next run, so missed the first entry.
This is illustrated below with a
listObjects
listing, with annotations that show that the date recorded for run 1 caused a later added entry to be missed (as it's date was older)The text was updated successfully, but these errors were encountered: