Skip to content
ken farmer edited this page Jul 13, 2016 · 2 revisions

Administrator checks new data every day for problems

  • Checks for 9+ types of problems:
    • relationship (primary/foreign key)
    • uniqueness (pk, uniqueness)
    • simple logic (end_date < start_date)
    • formatting (ip address of 300.300.300.300)
    • consistency between base & aggregate table
    • consistency between identical tables on two clusters
    • consistency between source & target tables on separate hosts
    • security policy (correct privs for table or hdfs folder)
    • data management policy (stats age, table names)
  • The user sets the mode within the registry in order to run the checks against the full data volumes, or incrementally against just the new data from the prior run.

Administrator easily adds new checks:

  • Adds existing, reusable checks:
    • This could be due to adding new data to test wanting to test existing data more comprehensively
    • reusable checks exist within checks directory
    • the user simply has to add entries to the registry json file that maps the check name to a data field.
  • Checks can be written multiple ways:
    • as any executable program or script within the checks directory

End-Users see data annotations along with regular data on charts & graphs

  • TBD

End-Users investigate data anomalies using Inspector

  • TBD