Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality harvest still missing pids #272

Open
gothub opened this issue Oct 7, 2020 · 1 comment
Open

Quality harvest still missing pids #272

gothub opened this issue Oct 7, 2020 · 1 comment
Assignees
Labels
metadig All issues related to metadig

Comments

@gothub
Copy link
Contributor

gothub commented Oct 7, 2020

The DataONE object store is queried to determine if new pids are available for harvest and metadata scoring.
It appears that the entries in the object store are not sorted by date, such that an entry with an earlier date
can appear after an entry with a later date. SInce the metadig harvest task keeps track of which pids it has harvested based on sysmeta modified date, it may miss a pid that is added later that has an older date then the previous last entry.

For example, here are 4 entries returned from the object store, where the 2nd to the last entry has a later date than the latest entry. Metadig engine had store the date of the 2nd entry on one run, then used that date on the next run, so missed the first entry.

This is illustrated below with a listObjects listing, with annotations that show that the date recorded for run 1 caused a later added entry to be missed (as it's date was older)

http://cn.dataone.org/cn/v2/object/?fromDate=2020-10-05T05:05:46.831Z&nodeId=urn:node:ARM

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><?xml-stylesheet type="text/xsl" href="/cn/xslt/dataone.types.v2.xsl" ?>
<ns2:objectList xmlns:ns2="http://ns.dataone.org/service/types/v1" count="4" start="0" total="4">
    <objectInfo>
        <identifier>5fd20e807187f456d1ff3a4366faa14f</identifier>
        <formatId>http://www.isotc211.org/2005/gmd</formatId>
        <checksum algorithm="MD5">5fd20e807187f456d1ff3a4366faa14f</checksum>

[ on run 2, the previously recorded latest date of 2020-10-05T05:05:48.195+00:00 was used to retrieve new
entries, so this run was missed. ]

        <dateSysMetadataModified>2020-10-05T05:05:48.153+00:00</dateSysMetadataModified>
        <size>16767</size>
    </objectInfo>
    <objectInfo>
        <identifier>b138597621570059d2396462f070a81e</identifier>
        <formatId>http://www.isotc211.org/2005/gmd</formatId>
        <checksum algorithm="MD5">b138597621570059d2396462f070a81e</checksum>

[ on run 1, this date was stored as the latest run ]

        <dateSysMetadataModified>2020-10-05T05:05:48.195+00:00</dateSysMetadataModified>
        <size>16767</size>
    </objectInfo>
    <objectInfo>
        <identifier>cc5388fa712593353fb7a398b495c849</identifier>
        <formatId>http://www.isotc211.org/2005/gmd</formatId>
        <checksum algorithm="MD5">cc5388fa712593353fb7a398b495c849</checksum>
        <dateSysMetadataModified>2020-10-05T05:05:46.874+00:00</dateSysMetadataModified>
        <size>16700</size>
    </objectInfo>
    <objectInfo>
        <identifier>d84368db2564a40922e76c9c71b1560e</identifier>
        <formatId>http://www.isotc211.org/2005/gmd</formatId>
        <checksum algorithm="MD5">d84368db2564a40922e76c9c71b1560e</checksum>
        <dateSysMetadataModified>2020-10-05T05:05:46.831+00:00</dateSysMetadataModified>
        <size>16700</size>
    </objectInfo>
</ns2:objectList>
@gothub gothub added this to the 2.4.0 milestone Oct 7, 2020
@gothub gothub self-assigned this Oct 7, 2020
@gothub
Copy link
Contributor Author

gothub commented Dec 12, 2020

It's not apparent to me how the CN sync handles this problem (and pids are not missed), as it uses the DataONE MN listObject service and a lastHarvestDate to track which pids have been harvested. This is performed in d1_synchronization org.dataone.cn.batch.synchronization.tasks.ObjectListHarvestTask
No special processing seems to happen to resolve the problem described above.
Comments from this class:

     * Harvest from the MemberNode and add all of the pids to the synchronization queue
     * The strategy is to accumulate the objectInfos for the full time window and sort
     * the items into ascending chronological order.  Periodically the lastHarvestedDate
     * will be updated to avoid a complete reharvest in the face of system failures.
     * Exceptions thrown means some or all will be reharvested (depending on whether an
     * intermediate update of lastHarvestedDate was able to take place).

For metadig-engine, I propose that an overlapping harvest window is used:

  • record the 'lastHarvestDatetime' of the pid with the latest dateSysmetadataModified for harvest X
  • on the next harvest, manually set the startDatetime of the harvest to a certain number of minutes before the previous lastHarvestDatetime
  • add additional book keeping to ensure that new pids are not submitted multiple times
    • e.g. if a pid had been submitted for quality assessment and is waiting in the queue, make sure it is not resubmitted

@mbjones mbjones added the metadig All issues related to metadig label Apr 29, 2021
@mbjones mbjones modified the milestones: 2.4.0, 2.5.0 Jul 14, 2022
@jeanetteclark jeanetteclark removed this from the 2.5.0 milestone Jul 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metadig All issues related to metadig
Projects
None yet
Development

No branches or pull requests

3 participants