Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discover Granules workflow failures when Granule files missing from Cumulus Collections configuration #3193

Open
micahThor opened this issue Dec 15, 2022 · 0 comments

Comments

@micahThor
Copy link

micahThor commented Dec 15, 2022

Hello from the ASDC developer team! I am creating this Issue to track a PR [1].

Issue

As Cumulus operators, we experienced failures with the Discover Granules workflow. We came across Cumulus ingest failures and realized that not all necessary files were present in the workflow's input that were specified in the Cumulus Collection configuration file.

The root cause of this issue happens when data provider teams do not supply the necessary files for Granules for successful ingestion.

Proposed solution

To mitigate failures, especially when ingesting multiple granules, we have incorporated behavior to detect and remove Granules from the current workflow that do not satisfy their Cumulus Collection configuration.

This PR change introduces an optional allFilesPresent flag which checks for necessary Granule files before continuing with ingest workflow tasks. When configured in the Cumulus Collection configuration, this new field acts like a filter, removing Granules that are missing files from the Cumulus Collection configuration. This flag can be added to the meta field for a Cumulus Collection, as seen in the example section below.

This change allows Discover Granules to perform workflows without failures due to missing files. This is desired when large sums of Granules are being ingested at once, and we do not wish to fail otherwise good Granules.

This change has been incorporated into our own repository [2], and we have been pulling the Discover Granules task from this location where the proposed changes have already been made. We have been successfully using this implementation since October '22.

Example

Consider this Cumulus Collection

{
	"name": "CER_SYN1deg_1Hour_Terra_Aqua_MODIS",
	"version": "Edition4A",
	"url_path": "CERES/{cmrMetadata.CollectionReference.ShortName}_{cmrMetadata.CollectionReference.Version}/{dateFormat(cmrMetadata.TemporalExtent.RangeDateTime.BeginningDateTime, YYYY.MM.DD)}",
	"duplicateHandling": "skip",
	"granuleId": "^CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_.*$",
	"granuleIdExtraction": "^(CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_\\d{6}.\\d{8})(|\\.met|\\.cmr\\.json)$",
	"reportToEms": true,
	"sampleFileName": "CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_407406.20190301",
	"meta": {
		"allFilesPresent": true,
		"granuleRecoveryWorkflow": "OrcaRecoveryWorkflow"
	},
	"files": [
		{
			"bucket": "protected",
			"regex": "^CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_\\d{6}.\\d{8}$",
			"sampleFileName": "CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_407406.20190301",
			"type": "data",
			"reportToEms": true
		},
		{
			"bucket": "protected",
			"regex": "^CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_.*\\.met$",
			"sampleFileName": "CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_407406.20190301.met",
			"type": "metadata",
			"reportToEms": true
		},
		{
			"bucket": "private",
			"regex": "^CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_.*\\.cmr\\.json$",
			"sampleFileName": "CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_407406.20190301.cmr.json",
			"type": "metadata",
			"reportToEms": true
		}
	],
	"updatedAt": 1634826195492
}

This Cumulus Collection enforces that all files must be present as seen in the meta field's "allFilesPresent": true configuration.

So data providers submitting these Granules will experience the following behavior.

  1. All files present (files matching ..\\d{8}$", ..*\\.met$, and ..*\\.cmr\\.json$ are detected) -> Successful DiscoverGranules workflow
  2. Any file missing -> Successful DiscoverGranules workflow, but Granule was removed

Other noteworthy behavior when allFilesPresent is set to true

  • Assuming there is only one granule in S3 and it gets removed. Then the Discover Granules would just discover nothing so there would be no workflow. If a granule is missing a file it's not even discovered, it's completely filtered out.
  • The Granules could be removed or just left in the S3 location. The missing files for the Granule could appear later on in S3. Every time the Discover Granules is ran, it will check all the files again
  • If you do remove the single Granule nothing should happen to the Granule, it won't even go to the next stage. Only the Granules with all files present will go to the next stage

[1] #3200
[2] https://git.earthdata.nasa.gov/projects/ASDCCLOUD/repos/asdc-cumulus/pull-requests/71/overview

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant