More and more often, during IT projects, teams are sensitive to data quality but, unfortunately, this is not always the case. As a consequence, there is no allocated budget to manage it. Additionally, when you ask for the level of confidence on the quality of the data, you will probably get divergent and unquantified answers. So how do you quickly get visibility into data quality with a limited budget ? One solution is to use DataCleaner.
Introduction
DataCleaner is a data quality toolkit that allows you to profile, correct and enrich your data. People use it for ad-hoc analysis, recurring cleansing as well as a “jack of all trades” kind of tool. DataCleaner is open source, you can use it to do one shot data quality analysis but it should not be use it as a persistent block in your architecture.
Besides, installation is really easy, you can download it here. All you have to do is copy the folder to the root directory of your hard disk and then run it.
DataCleaner Features
Data Sources
As you can see, main databases are available to be set as a data source. Moreover, if your source database does not appears in the list, you can use CSV files as a data source.
For one job, you can select multiple data sources, which enables data comparison enable.
Transformations
DataCleaner offers ETL-like features that allow you to “draw” data transformations.
Basic transformations are available: table join, format conversion, date format management, filter, math formula, text format management.
You also have access to more specific features, such as a machine learning algorithm to classify data, and you can even script some transformations yourself using JavaScript.
Data Improvement
These components allow you to sort out data and improve their quality by soliciting lookup data. For example, you will be able to standardize the country names and codes, replace strings with their synonyms, or remove adjectives that make it difficult to compare terms.
Analysis
This block enables data quality analysis. For example, you can:
- Add quality control by checking the integrity of a foreign key, Inspecting your boolean values, and asserting completeness
- Gain insight on your data format by getting the distributions of values that occur in a dataset, generating and matching string patterns, and collecting a variety of typical metrics on string values
- Visualize your data with density plot, stacked area plot and scatter plot
Export
Finally, you will be able to export data in multiple formats to communicate it: CSV, Excel. Or, you could also insert the result into database.
Conclusion
If you work on a data project, you should try this tool. You will not waste your time and it will probably be useful one day. However, it does have its limits, so you should not use it as a persistent block.
One thought on “DataCleaner Community Edition Review”
Comments are closed.