Open Source Data Quality Software

From Open Risk Manual

Open Source Data Quality Software

The table below organizes available Open Source Data Quality software distributions that cover some aspect of Data Quality assessment.

Criteria for inclusion

  • Any open source distribution that is publicly accessible in one of the repositories. For brevity when a repository contains a number of distinct tools, only one link is provided
  • Libraries / frameworks need not be exlusively data quality focused as the functionality is frequently bundled with Data Cleansing or Exploratory Data Analysis.
  • Data quality assessment is important in widely different contexts / workflows (from validating an excel sheet to big data pipelines, offline / versus online etc) so the list includes a diverse set
  • The star/issue/fork count is included as a rough measure of maturity. Use at your own risk
Open Source Data Quality Software
1. Name 2. Description 3. Language 4. Online Docs 5. URL 6. Stars 7. Issues 8. Forks
pyeve/cerberus cerberus is a lightweight, extensible data validation library for Python Python docs github 2246 33 202
datacleaner/DataCleaner DataCleaner Community Edition Java docs github 371 172 136
pandas-profiling/pandas-profiling pandas-profiling generates profile reports from a pandas DataFrame Python docs github 6338 44 962
OpenRefine/OpenRefine openRefine is a tool for working with messy data Java docs github 7735 595 1376
data-cleaning/validate validate: Data cleaning for statistical purposes R docs github 236 21 18
ResidentMario/missingno missingno is a missing data visualization module for Python Python github 2540 15 334
great-expectations/great_expectations Great Expectations helps data teams eliminate pipeline debt, through data testing, documentation, and profiling Python docs github 3127 147 348
daveoncode/pyvaru pyvaru: Rule based data validation library for python Python docs github 14 1 3
awslabs/deequ Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets Scala github 1328 90 256
WeBankFinTech/Qualitis Qualitis is a data quality management platform that supports quality verification, notification, and management for various datasources Java docs github 208 16 107
whylabs/whylogs-python whylogs-python is a Python implementation of whylogs Python docs github 191 10 7