Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAPREDUCE-7474. Improve Manifest committer resilience (#6716) #6825

Conversation

steveloughran
Copy link
Contributor

Improve task commit resilience everywhere
and add an option to reduce delete IO requests on
job cleanup (relevant for ABFS and HDFS).

Task Commit Resilience

Task manifest saving is re-attempted on failure; the number of attempts made is configurable with the option:

mapreduce.manifest.committer.manifest.save.attempts

  • The default is 5.
  • The minimum is 1; asking for less is ignored.
  • A retry policy adds 500ms of sleep per attempt.
  • Move from classic rename() to commitFile() to rename the file, after calling getFileStatus() to get its length and possibly etag. This becomes a rename() on gcs/hdfs anyway, but on abfs it does reach the ResilientCommitByRename callbacks in abfs, which report on the outcome to the caller...which is then logged at WARN.
  • New statistic task_stage_save_summary_file to distinguish from other saving operations (job success/report file). This is only saved to the manifest on task commit retries, and provides statistics on all previous unsuccessful attempts to save the manifests
  • test changes to match the codepath changes, including improvements in fault injection.

Directory size for deletion

New option

mapreduce.manifest.committer.cleanup.parallel.delete.base.first

This attempts an initial attempt at deleting the base dir, only falling back to parallel deletes if there's a timeout.

This option is disabled by default; Consider enabling it for abfs to reduce IO load. Consult the documentation for more details.

Success file printing

The command to print a JSON _SUCCESS file from this committer and any S3A committer is now something which can be invoked from the mapred command:

mapred successfile

Contributed by Steve Loughran

How was this patch tested?

yetus's work, if happy will validate on abfs.

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

Improve task commit resilience everywhere
and add an option to reduce delete IO requests on
job cleanup (relevant for ABFS and HDFS).

Task Commit Resilience
----------------------

Task manifest saving is re-attempted on failure; the number of 
attempts made is configurable with the option:

  mapreduce.manifest.committer.manifest.save.attempts

* The default is 5.
* The minimum is 1; asking for less is ignored.
* A retry policy adds 500ms of sleep per attempt.
* Move from classic rename() to commitFile() to rename the file,
  after calling getFileStatus() to get its length and possibly etag.
  This becomes a rename() on gcs/hdfs anyway, but on abfs it does reach
  the ResilientCommitByRename callbacks in abfs, which report on
  the outcome to the caller...which is then logged at WARN.
* New statistic task_stage_save_summary_file to distinguish from
  other saving operations (job success/report file).
  This is only saved to the manifest on task commit retries, and
  provides statistics on all previous unsuccessful attempts to save
  the manifests
+ test changes to match the codepath changes, including improvements
  in fault injection.

Directory size for deletion
---------------------------

New option

  mapreduce.manifest.committer.cleanup.parallel.delete.base.first

This attempts an initial attempt at deleting the base dir, only falling
back to parallel deletes if there's a timeout.

This option is disabled by default; Consider enabling it for abfs to
reduce IO load. Consult the documentation for more details.

Success file printing
---------------------

The command to print a JSON _SUCCESS file from this committer and
any S3A committer is now something which can be invoked from
the mapred command:

  mapred successfile <path to file>

Contributed by Steve Loughran
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 14m 5s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 shelldocs 0m 1s Shelldocs was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+0 🆗 xmllint 0m 0s xmllint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 11 new or modified test files.
_ branch-3.3 Compile Tests _
+0 🆗 mvndep 14m 53s Maven dependency ordering for branch
+1 💚 mvninstall 40m 34s branch-3.3 passed
+1 💚 compile 19m 32s branch-3.3 passed
+1 💚 checkstyle 3m 3s branch-3.3 passed
+1 💚 mvnsite 4m 9s branch-3.3 passed
+1 💚 javadoc 2m 35s branch-3.3 passed
+1 💚 spotbugs 7m 15s branch-3.3 passed
+1 💚 shadedclient 40m 21s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 32s Maven dependency ordering for patch
+1 💚 mvninstall 3m 13s the patch passed
+1 💚 compile 18m 42s the patch passed
+1 💚 javac 18m 42s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 2m 50s /results-checkstyle-root.txt root: The patch generated 1 new + 22 unchanged - 0 fixed = 23 total (was 22)
+1 💚 mvnsite 3m 59s the patch passed
+1 💚 shellcheck 0m 32s No new issues.
+1 💚 javadoc 2m 32s the patch passed
+1 💚 spotbugs 7m 58s the patch passed
+1 💚 shadedclient 41m 21s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 8m 49s hadoop-mapreduce-client-core in the patch passed.
+1 💚 unit 159m 20s hadoop-mapreduce-project in the patch passed.
+1 💚 unit 2m 33s hadoop-azure in the patch passed.
+1 💚 asflicense 1m 4s The patch does not generate ASF License warnings.
407m 26s
Subsystem Report/Notes
Docker ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6825/1/artifact/out/Dockerfile
GITHUB PR #6825
Optional Tests dupname asflicense codespell detsecrets shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xmllint
uname Linux f6a1aab3dfb3 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision branch-3.3 / d16eb0a
Default Java Private Build-1.8.0_362-8u372-gaus1-0ubuntu118.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6825/1/testReport/
Max. process+thread count 1238 (vs. ulimit of 5500)
modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project hadoop-tools/hadoop-azure U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6825/1/console
versions git=2.17.1 maven=3.6.0 spotbugs=4.2.2 shellcheck=0.4.6
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@steveloughran steveloughran merged commit 1025c44 into apache:branch-3.3 May 15, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants