DQ Launcher is a Python library that provides functions for performing data quality validations on Spark DataFrames. This library can help you ensure the integrity, consistency, and accuracy of your data before conducting analysis or training models with it.
You can install DQ Launcher using 'pip': pip install dqlauncher
First, import the DQ Launcher Session class and the Validator classes: from dqlauncher.session import DQLauncherSession
You need to initialize a DQLauncherSession object. It is a heritage class from SparkSession that allows you to create Validator objects in Spark context.
You can initialize a Validator object as follows: dq_launcher = DQLauncherSession(name_of your_app)
validator = dq_launcher.CreateValidator(data, columns)
Now you have a Validator Object. It is a full functional PysPark.DataFrame with added functionalities. You can use all methods and features a DataFrame can handle, and perform Data Quality Validations over this object.
- Check for null values
- Check for unique values
- Validate data types
- Validate fields against a reference table
- Validate data length
- Standardize names
- Count and get the percentage of null values
- Count and get the percentage of unique values
If you want to contribute to this project, please refer to: CONTRIBUTING.md
This project is licensed under the MIT License. See the LICENSE file for more details.
Pablo Sierra Lorente
2023