Assessing and Optimising Dataset Efficiency

Data Science
Platform Admin
SAS
Performance
ETL
Data processing

Date

26th July 2024
Category

Insights
Share

By Paul Shannon, Senior Managed Service Consultant.

Background

Many analysts will be aware of limits within a dataset, such as the length of a character column and the maximum values stored in numeric columns. Determining the correct size for datasets columns can be an inexact science and a “safe” approach is often followed to prevent data from various sources being truncated or ETL jobs failing. However, any inefficiency can be multiplied by millions once a dataset enters production use, which can lead to significant performance costs.

Introducing the Dataset Efficiency Utility

Katalyze Data have developed a utility designed to review mature environments and processes which can analyse any number of SAS datasets and the data stored within them, generating a report to highlight where there are potential inefficiencies in the datasets.

In particular, the following areas are focused:

Application-native compression setting.
Length of character and numeric fields, in comparison to the actual data stored within each column.
Size of date/time fields, versus best practice length recommendations.
Potential categorical variables stored as text – these could alternatively be stored in a coded form with the use of lookup tables or formats.
Columns with only one distinct value, which may be redundant.

Results

The output of the analysis is a simple report designed to guide an analyst to make efficiency edits in two phases:

Phase 1: Optimise column lengths, this can usually be implemented without causing breaking changes to any other systems or processes.
Phase 2: Column definitions, re-working categorical data and deleting redundant fields where suitable. This will be much more likely to implement breaking changes.

A high-level summary allows analysts to prioritise resources to the libraries with the biggest potential saving while a detailed breakdown provides the information on exactly which changes are required.

Benefits

The immediate benefit is measured in the disk space consumed by a dataset. However, this inherently leads into other performance improvements.

Shorter column lengths reduce the memory used each time the dataset is processed. This is particularly of benefit when using the CAS in-memory analytics engine, or LASR tables for Visual Analytics.
Reduced file size of a dataset also reduces the disk throughput required each time a dataset is read or written to/from disk. Disk throughput is a common bottleneck for performance therefore any reduction in size can lead to significantly reduced runtimes for data processing jobs.
Reduced size of datasets can lead to cost optimisations too, particularly for those environments hosted in cloud tenancies, minimising the size of data can reduce storage costs and data transfer costs, depending on the cloud services being used. On-premise environments can see a cost benefit as data optimisation can prevent the need for hardware expansion.

Learn More

Please contact us to book a consultation:

info@katalyzedata.com
https://katalyzedata.com
+44(0)1993848010

Back to Insights