Data profiling – why it is important?

Data profiling – what it is & why it is important?

M. Subramaniam belongs to Tamil Nadu. Subramaniam is his first name and the “M” represents his father’s name Marimuthu. Hence, all his government documents including Driving License, Aadhaar Card, Diploma and Graduate Certificate, were issued in the name of M. Subramaniam. However, while applying for a PAN card, initials are not allowed in the first name and the last Name/Surname is compulsory whereas the first and middle names are not, and so M. Subramaniam now has a dual identity – a PAN card that contains the name Subramaniam Marimuthu and an Aadhar card that contains the name M. Subramaniam.

This system of data repetition because of mandatory linking of PAN with Aadhar for filing income tax returns has created hassles for many people, especially those in the south of the country. Unlike other parts of the country, where people use surnames, most people in the South use initials either before or after their names and not the surnames. However, with the help of data profiling, companies can now trace and remove such repetitions to improve data quality and simultaneously and enhance business intelligence, thereby enabling better customer experience and profitability.

Data profiling is the process of discovering different anomaly in data by reviewing the values of the data filled and not just filtering it out using traditional standard procedures. It is the systematic analysis of the content of a data source.

Traditional data profiling methods involved skilled technical resources who could manually query the data source, using various technical languages such as Structured Query Language (SQL). This often led to a disconnect between the analyst who knew how the data was to be presented, and the technical programmer who ran the query.

However, organizations are abandoning manual methods in favor of automated data profiling tools, that take much out of the guesswork of finding and identifying problems in data. With the help of user friendly interfaces, business & technology professionals can now sit at the same table for a discussion and reduce project risk by quickly identifying and addressing potential data issues. According to industry estimates, data profiling tools have reduced the time required for data profiling from approximately 3-5 hours per attribute to 15-30 minutes per attribute.

Data profiling tools provide a common repository for storing data profile results and other key metadata such as notes made during analysis. The information is centralized and the entire team can share and leverage the information. It can be broken down into the following categories:

  1. Column Profiling – The values of the data are analyzed within each column or attribute to discover the true metadata and uncover data content quality problems. This type of profiling may be used in profiling a column of phone numbers with different format patterns such as [919]999—00-0000, [919]999000000, or 91-99-990-00000.
  2. Custom Profiling – Data is analyzed in a fashion that is meaningful to an organization. For example, a credit card company can use data profiling to create customer profiles, to create customized products for specific customers. Data Profiling can be used to provide a single view of the customer and understand the gamut of customer’s transactions and payment behavior, thereby enabling the company to monitor its overall risk portfolio and enhance the customer’s credit limit.
  3. Dependency Profiling – Each attribute is compared with every other attribute within a table to look for dependency relationships. For example, e-commerce retailers are dependent upon third party suppliers to transport their products to customers in different locations. Data profiling can be used to integrate supplier details with the items transported by them to increase efficiency
  4. Security Profiling – It determines who (or what roles) has access to a particular set of data and what he/she is authorized to do with the data (add, update, delete, etc.). For example, in a company, the HR department may have access to add and update records, whereas the authority to add approval roles may add with the IT department. Data profiling facilitates a consolidation between the two departments to allow for a smooth delivery of the processes.
  5. Redundancy Profiling – Data within different tables is compared to determine the attributes that contain overlapping or identical set of values, thereby preventing the integrity of the data from corrupting. A good example of this is the issues faced in the mandatory linking of PAN card and Aadhar Card as discussed above.

The data profiling process may seem less than glamorous at times, but it is an important step which adds value to any organization. The first step in adopting a comprehensive data profiling program is to realize the importance and value it ultimately provides. It should be of interest to all teams within the project – the technical team should have an understanding of the database system’s core data to make sound technical decisions and the management team should use the conclusions drawn from the data profiling to help steer the project in the right direction.

Understanding the available data, missing data and required data by an organization can help map out future strategies and data capture methods which would ultimately lead to further improvements in overall customer engagement and intelligence.