More and more often, during IT projects, teams are sensitive to data quality but, unfortunately, this is not always the case. As a consequence, there is no allocated budget to manage it. Additionally, when you ask for the level of confidence on the quality of the data, you will probably get divergent and unquantified answers. So how do you quickly get visibility into data quality with a limited budget ? One solution is to use DataCleaner.

Introduction

DataCleaner is a data quality toolkit that allows you to profile, correct and enrich your data. People use it for ad-hoc analysis, recurring cleansing as well as a “jack of all trades” kind of tool. DataCleaner is open source, you can use it to do one shot data quality analysis but it should not be use it as a persistent block in your architecture.

Besides, installation is really easy, you can download it here. All you have to do is copy the folder to the root directory of your hard disk and then run it.

DataCleaner Features

Data Sources

Database List

As you can see, main databases are available to be set as a data source. Moreover, if your source database does not appears in the list, you can use CSV files as a data source.

For one job, you can select multiple data sources, which enables data comparison enable.

Transformations

Data Transformation

DataCleaner offers ETL-like features that allow you to “draw” data transformations.

Basic transformations are available: table join, format conversion, date format management, filter, math formula, text format management.

You also have access to more specific features, such as a machine learning algorithm to classify data, and you can even script some transformations yourself using JavaScript.

Data Improvement

Data improvement

These components allow you to sort out data and improve their quality by soliciting lookup data. For example, you will be able to standardize the country names and codes, replace strings with their synonyms, or remove adjectives that make it difficult to compare terms.

Analysis

DataCleaner Analysis

This block enables data quality analysis. For example, you can:

  • Add quality control by checking the integrity of a foreign key, Inspecting your boolean values, and asserting completeness
  • Gain insight on your data format by getting the distributions of values that occur in a dataset, generating and matching string patterns, and collecting a variety of typical metrics on string values
  • Visualize your data with density plot, stacked area plot and scatter plot

Export

DataCleaner Export

Finally, you will be able to export data in multiple formats to communicate it: CSV, Excel. Or, you could also insert the result into database.

Conclusion

If you work on a data project, you should try this tool. You will not waste your time and it will probably be useful one day. However, it does have its limits, so you should not use it as a persistent block.

Related Posts

One thought on “DataCleaner Community Edition Review

Comments are closed.