-
Notifications
You must be signed in to change notification settings - Fork 0
Q1: Why test data at rest rather than before it gets to Hadoop?
Answer: data should be tested in both places, but it can be most reliably & effectively tested within its target location since Hadoop has more ways of expressing a test against data at rest than ETL processes have, and have access to all data that's been loaded, not just a single batch. Specifically:
- Testing of uniqueness & completeness doesn't work as well in the ETL process since a tested-file may get loaded twice, dropped before load, or in the case of uniqueness may require analysis of multiple batches.
- Testing of data distributions can be problematic within ETL since the distribution problem may not be revealed until enough data has been processed that some has already been loaded.
- Policy testing doesn't even apply to ETL at all. This includes directory, file & table naming conventions, ownership, file types, compression, security, completeness & freshness of stats, etc.
- Certain types of tests are difficult to write and get good performance for within ETL - like foreign key constraints across two tables.
- Consistency testing between two peer tables (ex: for failover in two separate clusters) with a single ETL process loaded data into one - isn't even possible.
So, for this reason HadoopInspector can be used either after final loading or prior in the ETL pipeline by targeting staging tables or interim files.
Q2: What types of tests should still be performed within the ETL process?
Answer: since the cost of repair is generally lower when invalid data is caught by the ETL, whatever can be effectively tested at that stage should be. This includes, for example:
- testing of logic conditions completely contained within a single batch of data
- testing of field formats, lengths, valid values, unknown values, min & max values, etc
- testing of foreign key relationships when practical
Q3: Why not write all test results directly into Hadoop rather than SQLite?
Answer: The data volumes are generally small, SQLite is ubiquitous and there is some need for transactions - say in annotating bad data.