Difference between revisions of "Open Source Data Quality Software"
From Open Risk Manual
Wiki admin (talk | contribs) (→Criteria for inclusion) |
(No difference)
|
Latest revision as of 16:30, 27 November 2020
Open Source Data Quality Software
The table below organizes available Open Source Data Quality software distributions that cover some aspect of Data Quality assessment.
Criteria for inclusion
- Any open source distribution that is publicly accessible in one of the repositories. For brevity when a repository contains a number of distinct tools, only one link is provided
- Libraries / frameworks need not be exlusively data quality focused as the functionality is frequently bundled with Data Cleansing or Exploratory Data Analysis.
- Data quality assessment is important in widely different contexts / workflows (from validating an excel sheet to big data pipelines, offline / versus online etc) so the list includes a diverse set
- The star/issue/fork count is included as a rough measure of maturity. Use at your own risk
1. Name | 2. Description | 3. Language | 4. Online Docs | 5. URL | 6. Stars | 7. Issues | 8. Forks |
---|---|---|---|---|---|---|---|
pyeve/cerberus | cerberus is a lightweight, extensible data validation library for Python | Python | docs | github | 2246 | 33 | 202 |
datacleaner/DataCleaner | DataCleaner Community Edition | Java | docs | github | 371 | 172 | 136 |
pandas-profiling/pandas-profiling | pandas-profiling generates profile reports from a pandas DataFrame | Python | docs | github | 6338 | 44 | 962 |
OpenRefine/OpenRefine | openRefine is a tool for working with messy data | Java | docs | github | 7735 | 595 | 1376 |
data-cleaning/validate | validate: Data cleaning for statistical purposes | R | docs | github | 236 | 21 | 18 |
ResidentMario/missingno | missingno is a missing data visualization module for Python | Python | github | 2540 | 15 | 334 | |
great-expectations/great_expectations | Great Expectations helps data teams eliminate pipeline debt, through data testing, documentation, and profiling | Python | docs | github | 3127 | 147 | 348 |
daveoncode/pyvaru | pyvaru: Rule based data validation library for python | Python | docs | github | 14 | 1 | 3 |
awslabs/deequ | Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets | Scala | github | 1328 | 90 | 256 | |
WeBankFinTech/Qualitis | Qualitis is a data quality management platform that supports quality verification, notification, and management for various datasources | Java | docs | github | 208 | 16 | 107 |
whylabs/whylogs-python | whylogs-python is a Python implementation of whylogs | Python | docs | github | 191 | 10 | 7 |