Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add property to skip empty files in Hive Connector #22727

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

cvarelad-denodo
Copy link

@cvarelad-denodo cvarelad-denodo commented May 13, 2024

Ignore empty files when hive.skip_empty_files is set to true (false by default). Otherwise, it produces an error when iterating through these files.

Description

This support allows to configure a hive connector property to decide if Presto should ignore these empty files or it should fail (default behavior). The new connector property is called hive.skip_empty_files. The following diagram shows the flow of the property to the HiveFileIterator:

skipEmptyFilesPropertyFlow

Motivation and Context

Currently, if Presto reads a Parquet folder where one of the files is an empty file the query fails. This is a problem in some environments.

Solves issue: 14759

Impact

No impact

Test Plan

TestHiveSkipEmptyFiles contains two tests. Both reads a directory that contains an empty file. One has the property set to true and the query execution finish with no errors skipping the empty file. The other one has the property set to false and the execution fails as before these changes

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

Hive Connector Changes
* Add support to skip empty files using configuration property ``hive.skip_empty_files`` :pr:`22727`

Ignore empty files when hive.skip_empty_files is set to true. Otherwise, it produces an error when iterating through these files.
Copy link

linux-foundation-easycla bot commented May 13, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@tdcmeehan tdcmeehan self-assigned this May 13, 2024
elharo
elharo previously requested changes May 13, 2024
@@ -195,6 +195,9 @@ Property Name Description
S3SelectPushdown.

``hive.metastore.load-balancing-enabled`` Enable load balancing between multiple Metastore instances

``hive.skip-empty-files`` Enable skipping empty files. Otherwise, it will produce an ``false``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a false
explicitly specify the default value of this property
What is a "false error"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

screenshot of local docs build - the doc when built may answer all of these issues

Screenshot 2024-05-13 at 11 59 57 AM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it's a table.

@@ -1816,4 +1818,17 @@ public HiveClientConfig setAffinitySchedulingFileSectionSize(DataSize affinitySc
this.affinitySchedulingFileSectionSize = affinitySchedulingFileSectionSize;
return this;
}

@Config("hive.skip-empty-files")
@ConfigDescription("Enables skip of parquet empty files avoiding output error")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip empty files rather than reporting an error
What about ORC files?


@Config("hive.skip-empty-files")
@ConfigDescription("Enables skip of parquet empty files avoiding output error")
public HiveClientConfig setSkipEmptyFilesEnabled(boolean parallelParsingOfPartitionValuesEnabled)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change argument name

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

private void createQueryFailRunner()
throws Exception
{
logger.info("Creating 'QueryFailRunner'");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no logs in tests please.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

protected QueryRunner createQueryRunner()
throws Exception
{
logger.info("Creating 'QueryRunner'");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no logs in tests please, only failure messages. Our logs are already too full of random messages to efficiently locate the actual failures.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*/
private static void deleteMetadata(File temporaryDirectory)
{
File[] data = temporaryDirectory.listFiles();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data --> files or metadataFiles or something like that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

boolean isDirectory = file.isDirectory();
if (file.delete()) {
if (isDirectory) {
logger.info(" deleted temporary directory: " + filePath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no logs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

deleteAndLog(temporaryDirectory);
}

private static void deleteAndLog(File file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know JUNit has easier ways to manage temp files these days. Does TestNG?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could use method FileUtils.deleteDirectory from commons-io but that module does not currently have that dependency and it's probably not worth it to add it just for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In JUnit you can just annotate a field with @TempFile and the framework manages creating and disposing it. Again, I don't know if TestNG has that yet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found anything similar in TestNG

if (resourceUrl == null) {
throw new RuntimeException("Cannot find resource path for table name: " + tableName);
}
logger.info("resource url: " + resourceUrl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no logs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*/
private void executeCreationTestAndDropCycle(DistributedQueryRunner queryRunner, String tableName, String externalLocation, boolean shouldFail, @Language("RegExp") String errorMessage)
{
logger.info("Executing Create - Test - Drop for: " + tableName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no logs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, local build of doc looks good! One very minor nit of a suggestion to consider.

@@ -195,6 +195,9 @@ Property Name Description
S3SelectPushdown.

``hive.metastore.load-balancing-enabled`` Enable load balancing between multiple Metastore instances

``hive.skip-empty-files`` Enable skipping empty files. Otherwise, it will produce an ``false``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

screenshot of local docs build - the doc when built may answer all of these issues

Screenshot 2024-05-13 at 11 59 57 AM

presto-docs/src/main/sphinx/connector/hive.rst Outdated Show resolved Hide resolved
steveburnett
steveburnett previously approved these changes May 15, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)
Pull updated branch, local docs build, looks good. Thanks!

@steveburnett
Copy link
Contributor

Nit suggestion for the release note entry:

== RELEASE NOTES ==

Hive Connector Changes
* Add support to skip empty files using configuration property ``hive.skip_empty_files`` :pr:`22727`

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. Could you look into adding a test when bucketed execution is enabled? I just want to make sure this continues to work when there's a definitive order required of the underlying files.


private Iterator<HiveFileInfo> remoteIterator = Collections.emptyIterator();

public HiveFileIterator(
Path path,
ListDirectoryOperation listDirectoryOperation,
NamenodeStats namenodeStats,
NestedDirectoryPolicy nestedDirectoryPolicy)
NestedDirectoryPolicy nestedDirectoryPolicy,
boolean skipEmptyFiles)// Tipo Boolean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
boolean skipEmptyFiles)// Tipo Boolean
boolean skipEmptyFiles)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bucketed execution is enabled by default. You mean add a test when it is disabled?

@@ -195,6 +195,9 @@ Property Name Description
S3SelectPushdown.

``hive.metastore.load-balancing-enabled`` Enable load balancing between multiple Metastore instances

``hive.skip-empty-files`` Enable skipping empty files. Otherwise, it will produce an ``false``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it's a table.

* @param shouldFail {@code true} if the table creation should fail, {@code false} otherwise
* @param errorMessage a {@link String} containing the expected error message. Will be checked if {@code shouldFail} is {@code true}
*/
private void checkBucketedResult(DistributedQueryRunner queryRunner, String tableName, boolean shouldFail, @Language("RegExp") String errorMessage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boolean parameters to change behavior are an antipattern that increases cyclomatic complexity. Use two methods, one for things that should fail and one for things that should pass.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

throws Exception
{
// obtains the root resource directory in order to create temporary tables
URL url = TestHiveSkipEmptyFiles.class.getClassLoader().getResource(".");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporary files should be located with either the Java temp fiels API or TestNG APIs, not the classloader.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

// obtains the root resource directory in order to create temporary tables
URL url = TestHiveSkipEmptyFiles.class.getClassLoader().getResource(".");
if (url == null) {
throw new RuntimeException("Could not obtain resource URL");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be more specific. Never throw a raw RuntimeException.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (!created) {
throw new RuntimeException("Could not create resource directory: " + temporaryDirectory.getPath());
}
File firstParquetFile = new File(temporaryDirectory, randomUUID().toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're reinventing a lot of temp file logic that already exists in the JDK, and probably in TestNG too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

* @param temporaryDirectory a {@link File} pointing to the directory to delete
* @param hasPattern a {@code true} if it is necessary to delete only one directory file, {@code false} otherwise
*/
private static void deleteMetadata(File temporaryDirectory, boolean hasPattern)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've done far more work here than you needed, I'm afraid.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

String tableName = "skip_empty_files_success";
File resourcesLocation = generateMetadata(tableName);
executeCreationTestAndDropCycle(queryRunner, tableName, getResourceUrl(tableName), false, null);
deleteMetadata(resourcesLocation, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deletion in tests should be handled in an after test method to ensure cleanup in the face of exceptions

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants