FAQ

Q1: Why test data at rest rather than before it gets to Hadoop?

Answer: data should be tested in both places, but it can be most reliably & effectively tested within its target location since Hadoop has more ways of expressing a test against data at rest than ETL processes have, and have access to all data that's been loaded, not just a single batch. Specifically:

Testing of uniqueness & completeness doesn't work as well in the ETL process since a tested-file may get loaded twice, dropped before load, or in the case of uniqueness may require analysis of multiple batches.
Testing of data distributions can be problematic within ETL since the distribution problem may not be revealed until enough data has been processed that some has already been loaded.
Policy testing doesn't even apply to ETL at all. This includes directory, file & table naming conventions, ownership, file types, compression, security, completeness & freshness of stats, etc.
Certain types of tests are difficult to write and get good performance for within ETL - like foreign key constraints across two tables.
Consistency testing between two peer tables (ex: for failover in two separate clusters) with a single ETL process loaded data into one - isn't even possible.

So, for this reason HadoopInspector can be used either after final loading or prior in the ETL pipeline by targeting staging tables or interim files.

Q2: What types of tests should still be performed within the ETL process?

Answer: since the cost of repair is generally lower when invalid data is caught by the ETL, whatever can be effectively tested at that stage should be. This includes, for example:

testing of logic conditions completely contained within a single batch of data
testing of field formats, lengths, valid values, unknown values, min & max values, etc
testing of foreign key relationships when practical

Q3: Why not write all test results directly into Hadoop rather than SQLite?

Answer: The data volumes are generally small, SQLite is ubiquitous and there is some need for transactions - say in annotating bad data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAQ

Clone this wiki locally