We are still helping our customers get more out of the data they have and make data driven decisions. Our new site is packed with information and insights of how data can be the catalyst to your business growth.
By Paul Burgess
The first paper in this series looked at using SAS Information Catalog to discover data, describing how we use the application to learn about our data. However, it skipped over the techniques to make data discoverable, how to run the Discovery Agents to catalog the data and how to search for data assets. This paper fills that gap and illustrates the techniques. Making data discoverable, running and scheduling Discovery Agents are likely to be tasks for SAS Administrators. Searching for data is something all users will need to do.
To catalog data we need to run a Discovery Agent. This is run against a particular library. By default, just the ‘out of the box’ libraries are listed for the Discovery Agent:
This includes both cas libraries, e.g., ‘Samples’ and compute server libraries, e.g., ‘SASHELP’. Custom libraries used within your organization can be added to this list as detailed below.
Two ways to make compute server libraries visible to the Discovery Agent are illustrated. The first is via the ‘New Library Connection’ wizard from SAS Studio. Open the wizard using the encircled icon below, on the Libraries page. This allows you to create a library to connect to data from a wide variety of sources via SAS/ACCESS, plus SAS Base Engine libraries. Configure your library as required and ensure that the ‘Assign and connect to data sources at startup’ box is ticked. Doing so will ensure the library is available to the Discovery Agent.
The second method is to edit the compute server autoexec, via Environment Manager:
2. Enter ‘compute’ in the filter window, and select ‘Compute Service’
3. In the Compute Service window, search for autoexec:
4. Click the edit button and paste in your libname statements, and save:
Illustrated here are libname statements to SAS datasets. Other types of libraries can be made discoverable by SAS Information Catalog by editing the autoexec. See the SAS Information Catalog: Administrators Guide for details.
Having carried out either of these methods, the data is now available to the Discovery Agent:
Only global scope cas data can be discovered. An example of making your data global scope via SAS Studio is illustrated here:
4. Set your caslib (casraw) to be the active caslib.
2. Load your data to cas:
Note, both the cas library and the data itself need to be global scope to be picked up by the discovery agent. In this case data cas_demog has been loaded to cas and promoted to global scope and can be discovered. However, cas_customer, has just been loaded and by default has session scope. This data cannot be seen by the discovery agent.
The CASRAW library can now be selected for discovery.
Once the Discovery Agent has been run for the CASRAW library, the CAS_DEMOG data is available in SAS Information Catalog, but the CAS_CUSTOMER data is not.
A note, on licensing; if just SAS Information Catalog is licensed, then only CAS libraries can be cataloged. To see compute server libraries also, a SAS Information Governance licence is required. Depending on your SAS Viya software bundle this latter may be included already.
Discovery agents crawl through the data in a library and ingests, analyses and enhances metadata from the data assets. You can see existing and set up new Discovery Agents by highlighting the Discovery Agent Icon. (Note, only assigned Administrator users will see this icon).
When selecting ‘New Discovery Agent’ the list of available libraries is displayed. If your library is not listed then it is either because a Discovery Agent already exists for that library, or the library is not visible to the Discovery Agent (see Making Data Discoverable).
Select the library required and hit ‘New Discover Agent’. Now set the properties of the Discovery Agent:
The Discovery Locale specifies the language and country, this is used by the agent to profile the data and so can be important. For example, selecting United Kingdom will influence how a UK address or postcode field is profiled.
Other options available are ‘Configure’ to filter the data from the library to be cataloged (by default its all the tables) and ‘History’ which shows the run history of this agent.
Selecting ‘Run Now’ displays the following window and runs the agent. This may take minutesor sometimes hours depending on the size of the data in the library.
Once complete the data within the library is available to be discovered.
Having created a Discovery Agent, it can be scheduled to run at a given time. So, for example, if the data changes during the day, the Discovery Agent can be run overnight, so that the latest cataloged data is available in the morning. Open SAS Environment Manager, select ‘Jobs and Flows’ and select the Scheduling tab. (Note only Administrators will see this option):
The STAGING AGENT created above is listed and can be scheduled by right clicking and creating a trigger. For example:
This schedules the agent to run daily, at 2AM from 12th June forever.
On the jobs and flows tab, the agent is now indicated as scheduled:
The Discovery Agent is now scheduled to run at 2AM each day, presenting users with up to date cataloged data as they start work.
Searching for data assets is both simple and sophisticated. There are two methods: free text search and faceted search.
Free text search is as simple as using an internet search engine. Some examples:
Searching for ‘customer’ returns the customer table and lists information about that table.
Searching for ‘cust’ also returns the customer table, there is an implicit wildcard in the search.
Searching for ‘kustomar’ also returns the customer table, the free text searches uses fuzzy logic.
Searching for ‘raw’ returns all the tables in the raw library.
Searching for height returns all tables that contain a column with ‘height’ in the name. The tables are listed in order of relevance. Free text searches, search at the library, table and column level.
Faceted searches allow you to search for specific attributes of your data. You can search for multiple values and searches can be nested. Hence a more sophisticated way of searching your data.
There are five default facets available: Name, DateModified, DateCreated, Status, Library. Some examples:
The first example lists all assets with name “DEMOG2”. Note the implicit wildcard used in Free Text searches is not present here.
The second example returns all assets modified between defined dates. A wizard helps to create the date range. Note, not just table assets are found, in this case a Studio Flow is also found.
The third example searched for data with a particular status.
The fourth example searches for data in library named ‘RAW’. Note the syntax for this search, library.name:”RAW” also works.
Some further examples of facet searches.
Data ‘DEMOG2’ was assigned a contact of ‘user_7’ and a tag of ‘Demographic’ via SAS Information Catalog. Both the contact name and the tag value can be searched for and the DEMOG2 data returned.
Tables with particular column attributes can be searched for. In this example columns assigned as ‘private’ and with semantic type ‘family name’ are searched:
This is just a flavour of the search facets available, there are numerous others, which are detailed in the SAS Information Catalog User Guide.
This paper has illustrated how to make data available to be cataloged, how to set up Discovery Agents to catalog the data and how to schedule these agents. These are all tasks typically performed by a SAS Administrator. But knowing the basics of how they are done, is useful background information for all users.
In addition, search methods for cataloged data have been illustrated. Both the simple Free Text search and the sophisticated Facet search. These are key to finding your data assets.
Want to explore further options? Stuck with Discovery Agents? Feel free to connect with Katalyze Data and we are glad to help.