Role of AI and ML in DATA QUALITY

Technology & Business Perspectives



    Every organization values the importance of data and its contribution to its success. It gets even more significant in this era of big data, cloud computing and AI.

    The relevance of data goes beyond its volume or how it is used. If a company has terrible data quality, actionable analytics in the world will make no difference. Every effort is being made to ensure data generated is of highest quality. In this effort, use of Artificial Intelligence and Machine Learning principles for Master Data Management (MDM) is a hot topic right now in the MDM realm. MDM platforms are incorporating AI and Machine Learning capabilities to improve accuracy, consistency, manageability among others.

    As indicated in the below figure, People, Process & Technology play a big role in ensuring data quality.


    While we may have layers of checks and balances to ensure data quality, errors still creep in. However, use of AI/ML principles as the final filtering layer helps in achieving near zero defect data.

    How can machine learning improve data quality?

    How Machine learning plays an important role in data quality is being illustrated through an example here.

    Use Case

    Credit Risk (CR) Financial of a large bank.

    Problem statement

    CR Financial is referred to several ways such as “CR,” or “CR Financial,” or, rarely, “Credit Risk Financial” in official records. When it is time to reconcile all of these entries, in the normal course, it would be considered as labor-intensive and tedious. Further, manual process will also lead to human error and can make the findings error prone.


    This is where Machine Learning comes into the picture. A computer program can scan all of the bank’s information in a matter of hours, and then deliver a report that shows how many times the variations of CR Financial shows up. With this information, the bank can get a sense of its exposure to CR Financial.

    Improving Quality of Data after every Run

    What happens if there are some mistakes in the scan - for example, the computer program brings back false positives? The machine learns from its errors; once it receives feedback, it incorporates the corrections into its memory. It will apply those rules to the next data set it reviews.

    The incredible thing about machine learning is that it's always getting better. A machine will learn much faster than a human will. Moreover, a machine will work much faster than a human being. Machine learning works so quickly that computers can perform jobs at speeds that used to be considered impossible.

    Different ways of improving Data quality using AI

    Intelligent data capture

    The performance and effectiveness of Output and Process depends entirely on the Input. Garbage in can only produce more Garbage. As per Gartner, $14.2M are lost annually as a result of improper and erroneous data capture. Automation comes to the rescue in the form of Intelligent Data Capture using AI. This, along with data predictions, help in improving the data quality manifold by minimizing data leaks and eliminating gaps in the system.

    Moreover, this process being automated, helps freeing up human resources, who can spend quality time in core business activities.

    Identify duplicate records and errors

    Duplicate entries of data is one of the very common data quality issue, can make an entire repository untrustworthy and can cause serious damage to businesses especially if the data pertains to customers. AI, once again can come to our rescue to eliminate duplicate records and keep precise golden keys in the database. Detection of duplicate entries go beyond simple comparisons due to several ways they can creep in and would require implementation of sophisticated mechanisms using intelligent systems that can detect and isolate duplicate data.

    One of the most effective implementation of an AI based intelligent system to identify duplicate entries can be found in Salesforce CRM. Salesforce CRM captures duplicate entry right at the time when a record is being created in its repository, thus ensuring some of the business critical entities such as contacts, leads and business accounts are clean and trustworthy.

    Likewise, an AI-enabled validation mechanism can go far and beyond the routine data validations that are performed at source. Elimination of data errors automatically results in elimination of any system errors occurring due to bad quality of data.

    Third-party data inclusion

    A big part of our repositories contain entities whose attributes may be available in several other repositories, most often with third parties. For eg., when capturing customer data, details about customers may also be available in repositories maintained by the Government. If there are ways to access these repositories, not only it can significantly improve the quality of data through validations it can also enrich the entity further through additional attributes.

    AI can be used to fetch from repositories maintained by such third-party organisations and governmental units that contain better and complete data. AI makes the suggestions on what to fetch from a particular repository through automatically establishing relationships between the data. This can significantly add value to the quality of an MIS and MDM platforms and help in precise decision making.

    Algorithms and Methods for improving data quality

    Using proven methods and algorithms for AI implementation to improve data quality is the key to achieving a high degree of success. The following are some of the popular algorithms that can be used

    • Random Forest

    • Support vector machine (SVM)

    Random Forest

    Random forest is a flexible machine learning algorithm which uses supervised learning technique to produce reliable results. It is the most used algorithm due to its simplicity and a variety of problems that it solves and can be used for both regression and classification purposes. It builds multiple decision trees and merges them together to get a more accurate and stable prediction.

    How it works

    Random Forest is a classifier that splits a given dataset into a number of data subsets. Decision trees are built for each such subset and average is taken to improve the accuracy of the dataset. This algorithm takes the prediction from each tree, based on the majority votes of predictions, the final output is predicted.

    Advantages of random forest

    • It can be used for both classification and regression tasks.

    • It requires less training compared to other algorithms.

    • It can predict with accuracy even when a large part of the data is missing.

    Support Vector Machine (SVM)

    SVM is a supervised machine learning algorithm. Even though it can be used for both classification and regression, it is primarily used in classification tasks. The primary goal of SVM is to classify unseen data.

    Highlights of SVM

    • The algorithm allows for categorisation of text and hypertext for transductive and inductive models.

    • It uses training data to classify documents into different categories. The categories are put based on scores generated and then comparing with the highest value.

    • SVMs are used to identify widely used handwritten characters. These characters are majorly utilised in validating signatures of vital documents.

    • SVM algorithm is used to identify the classification of genes and other biological problems in patients. In recent years, the SVM algorithm has been used to detect protein remote homology.

    • As opposed to traditional query-based searching techniques, SVMs provide enhanced search accuracy for the classification of images. The algorithm classifies images with higher search accuracy compared to the traditional query-based scheme.

    Advantages of SVM

    • Calculation simplification

    • The algorithms makes prediction and calculation simple since the results are presented as a graphic image.

    • Efficient data generation.

    End Note

    Artificial intelligence and Machine Learning are expected to change the present and future business world. Businesses using AI are getting better at their predictive tasks like determining the preferences of different customers. The prediction results are based on the information fed to the system. It is clear that this new development will affect many industrial sectors such as banking, stock market, E-commerce, learning, health care, manufacturing and many others. The overall effect of implementing AI in businesses would be increased productivity, better customer experience, improved decision making and timely planning.