Skip to content

victorcouste/data-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Not exhaustive and personnal list of "modern" Data Tools and Projects

Suggest a Data Tool !

No (file systems) storage or (traditional) databases, and for now, no data science, virtualization, or streaming tools. And no all embedded tools and services proposed by the 3 main public Cloud providers (Google Cloud, Microsoft Azure and AWS).

Data Architecture

Data Ingestion / Data Onboarding / ETL / ELT

  • Flatfile Data Onboarding platform
  • Fivetran Cloud data integration platform
  • Matillion Cloud data integration platform
  • Apache Gobblin Open Source distributed data integration framework
  • Singer "Open Source standard for writing scripts that move data"
  • Meltano Open Source ELT for the DataOps
  • Airbyte Open Source data integration platform
  • Stitch Simple, extensible Cloud ETL platform (Talend)
  • Hevo No-code data pipeline as a service
  • Apache Hop Open Source data integration platform project
  • Meroxa Real-time data ingestion infrastructure
  • Portable Cloud Hosted ELT Platform
  • Talend, StreamSets, Alooma (Google), Xplenty, Striim, Panoply, Stambia, HVR

Reverse ETL

  • Census Operational analytics platform, move data from data warehouse to apps
  • Hightouch Sync customer data to SaaS business platforms
  • Grouparoo Open Source framework to move data between database and Cloud apps

Data Collection / Product Analytics / Customer Data

  • Segment Customer data platform (CDP) (Twilio)
  • RudderStack Customer data pipeline, event tracking
  • Snowplow Data collection platform
  • Freshpaint Collect, control, and deliver customer data
  • PostHog Open Source Product Analytics platform
  • Amplitude Product Analytics platform
  • Iteratively Product Analytics platform « Capture customer data you trust »
  • Avo Product Analytics platform
  • Mixpanel Product analytics platform
  • Indicative Product analytics platform 
  • Heap Product analytics platform
  • Supermetrics Get marketing data for reporting, analytics and storage

Transformation / Preparation / Cleaning / Wrangling

  • Trifacta Data Wrangling for Cloud (or Hadoop) platforms and storages
  • dbt Transform with SQL from command line (Open Source) or Cloud
  • Dataform Collaboration on SQL pipelines in Cloud data warehouses (Google)
  • Pano Open Source data preparation for Cloud data warehouses
  • Rasgo Data preparation for Data Scientists
  • Mito Jupyter Lab extension to generate panda Python code from a spreadsheet
  • DataPrep Prepare data in Python
  • OpenRefine "A free, open source, powerful tool for working with messy data"

SQL Tools / Editors

  • Count "The BI notebook built for analysts"
  • PopSQL "Modern SQL editor"
  • DataGrip IDE for SQL (JetBrains)
  • DBeaver Free (or Enterprise and Cloud editions) universal database tool
  • sq "swiss-army knife for data", SQL in command line for relational data
  • SqlDBM Develop Database Models
  • Querybook Open Source SQL query and Big Data IDE via a notebook interface
  • Soda SQL Data testing, monitoring, and profiling for SQL-accessible data
  • SQLFluff SQL Linting and Auto-formatting for Humans

SQL Engines

  • Trino Open Source high perf and distributed SQL query engine (formerly PrestoSQL)
  • Starburst Cloud or On-premises SQL engine (based on Trino)
  • AWS Athena Interactive SQL query service for Amazon S3 (based on Presto)
  • DataFusion Query execution engine using Apache Arrow as its in-memory format

BI / Reporting / Data Visualization

  • Metabase Open Source business intelligence tool
  • Apache Superset Open Source modern data exploration and visualization platform
  • Apache ECharts Open Source JavaScript Visualization Library
  • Cube.js Open Source Analytical API platform
  • Grafana Open Source analytics & monitoring solution
  • Looker BI and Analytics Platform (Google)
  • Redash Data visualisation and Dashboarding with SQL (Databricks)
  • Mode Collaborative data platform that combines SQL, R, Python, and visual analytics
  • Sigma Cloud analytics solution
  • Hex Collaborative SQL + Python-based notebooks
  • Lux Python library and API for Intelligent Visual Discovery
  • y42 "No-Code Business Intelligence" platform
  • Knowage Open Source Business Analytics Suite
  • Rakam Data platform for building analytics interface (dbt integration)
  • Datawrapper Enrich stories and articles with data visualization
  • D3 JavaScript library for visualizing data with HTML, SVG, and CSS
  • Lightdash Open source BI tool fully integrated with dbt projects
  • Tableau, PowerBI, Sisense, Qlik, Spotfire, ThoughtSpot, Chartio (Atlassian), Domo, Toucan Toco

Data Quality / Profiling / Observability

  • Monte Carlo "Data Reliability Delivered"
  • Datafold Data Observability platform
  • Great Expectations Open Source data quality, profiling & validation
  • Bigeye Automatic data quality monitoring
  • Anomalo Validate and document your data warehouse
  • Trackplan "Schema Management for Behavioural Data Tracking"
  • lightup Cloud data quality indicators provider

Data Management / Lineage / Catalog / Governance

  • Datakin DataOps solution, Data Lineage
  • Marquez Open Source metadata and data governance project
  • DataHub Open Source metadata search & discovery tool
  • Amundsen Open Source data discovery and metadata engine
  • Data Galaxy Data Governance platform with Data Catalog and Data Lineage
  • Zeenea Cloud-native Data Catalog
  • Alation Data Governance and Data Catalog platform
  • Collibra Data Governance and Data Catalog platform
  • Secoda Data Discovery and Data Catalog
  • MANTA Data Lineage platform
  • data.world Cloud-native Data Catalog
  • Stemma SaaS managed version of Amundsen
  • Egeria Open Metadata and Governance

DataOps / Data Fabric

  • Altan "the modern data workspace", Data Management & DataOps
  • Nessie DataOps for Data Lakes, a "Git-Like Experience for your Data Lake"
  • Nexla DataOps platform "to delivery data for Analytics, AI and Operations"
  • Keboola DataOps platform
  • Saagie DataOps platform
  • DataKitchen DataOps platform
  • DAGsHub GitHub for data
  • Unravel DataOps platform
  • Upsolver "Compute and pipeline layer between your data lake and the analytics tools"
  • Cinchy "Autonomous Data Fabric" and Data Management platform

Orchestration / Workflow

  • Apache Airflow Open Source workflow scheduler platform
  • Dagster Open Source "Data orchestrator for machine learning, analytics, and ETL"
  • Prefect Workflow management system and platform for dataflow automation
  • Apache DolphinScheduler Distributed and visual workflow scheduler system
  • Luigi Python package to build complex pipelines of batch jobs

Storage / Database

  • DuckDB In-process SQL OLAP database (Sqlite like column oriented)
  • ClickHouse Open-source OLAP database management system
  • DoltHub "the true Git for data experience in a SQL database"
  • DVC Data Version Control
  • Materialize Event Streaming Database
  • Warp 10 Advanced Time Series Platform
  • Snowflake, Firebolt, BigQuery, Redshift, Apache Cassandra, MongoDB, InfluxDB, QuestDB, Neo4j, SingleStore(MemSQL)

Data Privacy / Security / Identity

  • Immuta "Self-Service Data Access with Automated Privacy Control"
  • Okera Cloud data security, "Universal Data Authorization"
  • Privacera SaaS Access Governance Solution
  • Apache Ranger Framework to enable, monitor and manage comprehensive data security
  • Baffle Cloud security with a "transparent data security mesh"
  • Privitar Enterprise Data Privacy Software
  • ReachFive Identity & Access Management
  • Okta Trusted platform to secure identities, from customers to workforce

Others

  • Opendatasoft Data sharing platform
  • Streamlit Turns data scripts into shareable data web apps
  • Transform Data Shared data interface and metrics repository
  • White Label Data Platform for building and deploying custom data applications
  • Flat Data Bring working datasets into your GitHub repositories and versioning them

And finally don't hesitate to:

Victor