Elevate the reliability of your datasets!
The Dataset Trust Score Framework is a powerful, user-friendly tool designed to evaluate and validate datasets for quality, consistency, and trustworthiness. By leveraging advanced metrics and automated validation techniques, this framework gives your data the credibility it deserves for analytical or machine learning purposes.
This was built during a Hackathon at BTU conducted by FSR AI.
Our framework doesn't just check boxes — it ensures your dataset is ready for action! Here’s what it offers:
- 🗋l Schema Validation: Guarantees your dataset meets required structural standards, including column names and data types.
- 🔉 Completeness Check: Flags missing data and evaluates the impact on your analysis.
- 📉 Outlier Detection: Hunts down rogue values that could skew your insights using methods like IQR and Z-Score.
- 🔒 Integrity Checks: Validates primary keys, foreign key relationships, and data consistency.
- 📊 Distribution Analysis: Ensures data aligns with expected patterns, helping you avoid unwelcome surprises.
- 🧺 Statistical Insights: Provides detailed summaries (mean, median, variance, etc.) for a snapshot of your data's health.
- 🗲 Text Quality Assessment: Reviews text columns for irregularities and evaluates their overall quality.
- 🕢 Temporal Validation: Checks dates for consistency and logical ordering.
- 🤝 Domain-Specific Rules: Adapts validations for unique datasets like geographic, financial, or healthcare data.
- 🏆 Trust Score: Combines all validation checks into a single, actionable trust score to guide decision-making.
web_ui/
: A sleek, interactive web interface built with Streamlit for hands-on dataset analysis.data_analyzer/
: Core logic for performing validation checks and generating the trust score.data_fetcher/
: Fetch datasets and metadata from files or APIs.trust_score_calculator/
: Combines all metrics into a comprehensive trust score.main_app.py
: Your starting point! Launches the Streamlit app to bring everything together.
-
Clone the repository:
git clone https://github.com/aloha1357/denkkraft.git
-
Install dependencies:
pip install -r requirements.txt
-
Run the application:
streamlit run main_app.py
- Upload a dataset (CSV, JSON, or API endpoint). Currently the csv is hardcoded.
- Get an easy-to-understand trust score, along with detailed reports on potential issues.
At the heart of this framework lies the Trust Score — a single metric that represents how reliable your dataset is. Think of it as the ultimate confidence boost (or warning sign!) for your data. Whether you’re preparing a report or training an ML model, the Trust Score gives you the green light (or prompts you to dive deeper).
We’re building this framework for the data community, and your input matters! Feel free to:
- Fork the repo and make improvements.
- Submit pull requests for new features or bug fixes.
- Open issues to share feedback or suggest enhancements.
Together, we can make datasets more reliable and analysis more impactful!
This project is licensed under the MIT License, so feel free to use it in your projects, big or small. Check the LICENSE
file for details.
Data is at the core of modern decision-making — let’s ensure it’s accurate, reliable, and ready to shine. 🚀
Created with passion by Denkkraft.
Get started with the Dataset Trust Score Framework today!