We are still helping our customers get more out of the data they have and make data driven decisions. Our new site is packed with information and insights of how data can be the catalyst to your business growth.
By Paul Burgess
The data source is a typical list of people, with their name, email address, post code and company. But within that list the data is presented in non-standard formats, clearly there are duplicate rows for the same person and within those rows some information is missing. The task is to standardise the fields and leave one unique entry for each person, with each field populated with the latest and most relevant data.
The first step is to load the data to Cloud Analytics Server (CAS). This is done via SAS Data Explorer (opened via the Manage Data application), using a data connection or by loading a local file. Once in CAS the data can be ‘prepared’ via SAS Data Studio – Prepare Data and a new plan created.
This application enables task development in a ‘No code’ environment, building the task logically with a series of transforms and making use of ‘Suggestions’ made by the application.
This particular task uses Data Quality Transforms. At the heart of these is the Quality Knowledge Base (QKB); a collection of rules and reference data allowing data quality tasks to be performed.
1. Data Quality Transform – Standardize
Use QKB definitions to standardise Name, Email and Post Code fields.
The resulting output has these fields with consistent capitalisation, spacing, etc.:
2. Data Quality Transform – Matchcodes
We have duplicate rows in the data, with similar but not identical names. The first step in entity resolution is to create matchcodes for the Full_Name field. Matchcodes are generated using ‘fuzzy logic’ via the QKB and enable entities to be matched despite minor variations in the data. Generate matchcodes for the Full_Name field, using the matchcode name definition and set a mid-range sensitivity of 85.
Notice now, for example, the name Sue McScott appears in several rows, with slight variations, but each of these variations has the same matchcode.
3. Data Quality Transform – Match and Cluster
The second step in entity resolution is to match and cluster rows.
The rows are clustered using two rules, each combining the matchcode with another field. Note within the rule the AND operator is applied and between rules the OR operator.
A Cluster ID field is now added and based on the defined rules the 16 rows of data can now be clustered into just 5.
4. Custom transform – Code
The next step is to select the surviving record from each cluster. This is done using a simple data step, via a custom code transform.
The data and set statements are built automatically to reflect that the data is distributed. The data step then uses by group processing and a last.cluster_ID statement to select just one row from each cluster. Note, in CAS even though we are using by group processing there is no requirement to sort the data first. The ID field has been added to the by statement to ensure the highest value of ID in each cluster is output. Temporary variable ‘Company_New’ and a retain statement are used to ensure that each row is populated with a non-missing ‘Company’ value. The output table looks like:
The data has been entity resolved to 5 rows. The remaining transforms are to tidy it up.
5. Manage Columns
Use the Manage Columns transform to select which columns to keep.
Use the Rename columns transform to rename fields as required. Simply rename column ‘Company_new’ to ‘Company’
7. Manage Columns
Repeat the Manage Columns transform to set the order and format for the columns.
(Note, steps 5,6,7 can be combined, these have been separated as combining resulted in an error.)
8. Add a Unique Identifier
The final step is to use the Unique Identifier transform to add a UID.
The final table is:
Cleaned and entity resolved from 16 to 5 rows.
The steps created are saved into a data plan were:
The plan and the prepared data can be saved. A ‘Job’ can be created from the plan and scheduled to be run via Environment Manager, enabling the data to be loaded and cleaned on a periodic basis.
This tip has illustrated the use of Data Quality tools within SAS Viya to prepare data. A simple, intuitive method is used to build a plan using existing transforms and a custom data step. No knowledge of Data Management Studio/Server is required, and very little coding knowledge.
If you want to learn more about SAS Viya our expert tips and tricks can be found here or for a SAS training course that meets your exact skills gap click here.