Open Source Data Quality Software
From Open Risk Manual
Open Source Data Quality Software
The table below organizes available Open Source Data Quality software distributions that cover some aspect of Data Quality assessment.
Criteria for inclusion
- Any open source distribution that is publicly accessible in one of the repositories. For brevity when a repository contains a number of distinct tools, only one link is provided
- Libraries / frameworks need not be exlusively data quality focused as the functionality is frequently bundled with Data Cleansing or Exploratory Data Analysis.
- Data quality assessment is important in widely different contexts / workflows (from validating an excel sheet to big data pipelines, offline / versus online etc) so the list includes a diverse set
- The star/issue/fork count is included as a rough measure of maturity. Use at your own risk
1. Name | 2. Description | 3. Language | 4. Online Docs | 5. URL | 6. Stars | 7. Issues | 8. Forks |
---|---|---|---|---|---|---|---|
pyeve/cerberus | cerberus is a lightweight, extensible data validation library for Python | Python | docs | github | 2246 | 33 | 202 |
datacleaner/DataCleaner | DataCleaner Community Edition | Java | docs | github | 371 | 172 | 136 |
pandas-profiling/pandas-profiling | pandas-profiling generates profile reports from a pandas DataFrame | Python | docs | github | 6338 | 44 | 962 |
OpenRefine/OpenRefine | openRefine is a tool for working with messy data | Java | docs | github | 7735 | 595 | 1376 |
data-cleaning/validate | validate: Data cleaning for statistical purposes | R | docs | github | 236 | 21 | 18 |
ResidentMario/missingno | missingno is a missing data visualization module for Python | Python | github | 2540 | 15 | 334 | |
great-expectations/great_expectations | Great Expectations helps data teams eliminate pipeline debt, through data testing, documentation, and profiling | Python | docs | github | 3127 | 147 | 348 |
daveoncode/pyvaru | pyvaru: Rule based data validation library for python | Python | docs | github | 14 | 1 | 3 |
awslabs/deequ | Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets | Scala | github | 1328 | 90 | 256 | |
WeBankFinTech/Qualitis | Qualitis is a data quality management platform that supports quality verification, notification, and management for various datasources | Java | docs | github | 208 | 16 | 107 |
whylabs/whylogs-python | whylogs-python is a Python implementation of whylogs | Python | docs | github | 191 | 10 | 7 |