By Paul Burgess
With the proliferation of data within organisations the need for data governance becomes increasingly important. SAS Information Catalog is an application at the heart of a data governance program. It helps to ensure that data is:
This blog describes its features and capabilities, helping you to get the most out of it.
SAS Data assets are searched for and catalogued via a Discovery Agent. These are typically scheduled by an Administrator to run overnight, to ensure that users see the latest catalogued data. Having catalogued the data, searching for data assets is both simple and sophisticated within SAS Information Catalog. This blog assumes that the desired data has been catalogued, found and opened within SAS Information Catalog, and is ready for discovery. The starting point is the screen below:
Opening the DEMOG2 data presents an Overview screen. The Overview tab of the data firstly highlights if the data contains private data.
Drill down via the information icon to see the fields potentially containing PI data. The fields are classified into sensitive, private and candidate.
The next boxes summarise the data. What’s displayed depends on the columns within the data, in this case we have:
The header gives basic stats for the data
The overview also includes a further key field; the Status. The Status would typically be set by a Data Manager, after reviewing using this application. For this data, it is set to ‘Warning’ and a comment added to say why. Other dropdown values are shown.
So, if the Status is set to ‘Approved’ we can be confident that the data has been reviewed and no issues found. If not set to Approved, then be wary!
On the right-hand side of the screen as well as properties of the data, Contacts and Tag information can be set. Contacts can be added, for example, to show who is the owner or reviewer of the data. Tags allow keywords to be set, which can then be used to search for data assets.
The Overview Window has given us a great initial insight into our data. It has presented top level metrics, highlighted any PI issues and given a top-level description of the data. It has also allowed us to set the status of data, allocate contacts and set tags. Now to drill down further.
The Column Analysis tab is further split into Descriptive Measures, Metadata Measures and Data Quality Measures screens. Let’s look at these in turn.
Two very useful features here are to show outliers and correlation between fields. The field ‘age ‘contains outliers as indicated by the symbol encircled in blue in the image above. The scale of the box plot on the right of the screen indicates that the outliers are at the high end of the age range. (Later we will see how to drill down and clearly see these outliers).
If we highlight the ‘height’ field, the correlation symbol (encircled in the image below) next to the ‘height2’ and’ height3’ fields illustrate correlation with these fields. Hovering over with the same symbol shows the degree of correlation. In this case ‘height’ and ‘height3’ are perfectly correlated (correlation coefficient equals 1), and the fact that their mean is the same indicates that the fields are identical. The Data Manager could highlight this and leading potentially to the removal of one of these fields.
Drilling down to individual columns reveals further in-depth statistics, for example for the ‘age’ field. Here, the frequency histogram more clearly shows the outliers. Also note, the field has been classified as:
For a non-numeric fields different statistics and plots are presented. For the town field, we now see histograms of both value and pattern frequency (the pattern is a simple representation of a strings character pattern). Note this field has a semantic type of CITY, and is a CANDIDATE for Information Privacy (i.e. this may be deemed PI data). A further metric here is ‘Mismatched’. This compares values to the actual type, in this case all values are strings, hence 100% matching.
This screen summarises the metadata values for each field. It is very useful for identifying which fields are Primary Key candidates or contain PI data.
This screen presents the data quality metrics for each field. Included is the Pattern Count (the number of unique word or character patterns for the field) and the Semantic Type. The Pattern Count is useful, if for example, the field should have a single fixed pattern, such as an NI number. If the count is greater than 1 it highlights potential inconsistencies in the data. The Semantic Type is a classification of the data derived from both the field name and its values. Typical semantic types are illustrated in the example below.
The final tab provides a listing of the first few rows of data (100 by default).
Here we’ve given an overview of some of the key features of the SAS Information Catalog application. This tool is extremely easy to use and provides a wealth of information. Data Managers can explore data and highlight issues to be fixed and PI data to be hidden, setting the Status of the data appropriately. Data users can then explore data, and ensure they are using the correct data for their needs and have the confidence that the data is of high quality and trustworthy.
Want to explore your data in your SAS Viya environment using a GUI? Need to know what sensitive data you have? Try out using SAS Information Catalog in SAS Viya. Interested further or have a question? Connect with Katalyze Data and we are glad to help.