By Michael Walshe
In this blog, we explore the process of mapping SAS formats to R for efficient data manipulation. While there is no single function or package in R that replicates SAS formats comprehensively, we can leverage different methods in R to achieve similar results.
SAS formats and informats are a swiss-army knife of data manipulation, providing in one package methods to:
Because of the sheer number of different use-cases for SAS formats, there is no single function or package in R which can provide the same results. However, this can be seen as a benefit, logically separating the different functionalities and making R easier to understand and use. In this article we’ll assess a few different methods for the first two use-cases.
A SAS format, typically a user-defined one, can be used as a look-up table – for example from a region code to the full name of that region. In R this could be achieved with:
Joins are a widespread method for performing some form of look-up, we will demonstrate using the base R merge function.
my_data <- data.frame( region_code = c("SA", "SA", "E", "AS", "AN", "NA"), measure = runif(6) ) lookup_df <- data.frame( region_code = c("E", "NA", "SA", "AS", "AF", "AU", "AN"), region_full = c( "Europe", "North America", "South America", "Asia", "Africa", "Australia", "Antartica" ) ) merge(my_data, lookup_df) #> region_code measure region_full #> 1 AN 0.9404673 Antartica #> 2 AS 0.8830174 Asia #> 3 E 0.4089769 Europe #> 4 NA 0.0455565 North America #> 5 SA 0.2875775 South America #> 6 SA 0.7883051 South America
Where you have a simple mapping of keys to values, a named vector can be a good method in R to move from one to the other. Note here that the keys should be the names of the vector, and the values are the elements. Then indexing using a key produces the value.
lookup_vector <- c( "E" = "Europe", "NA" = "North America", "SA" = "South America", "AS" = "Asia", "AF" = "Africa", "AU" = "Australia", "AN" = "Antartica" ) my_data$region_full <- lookup_vector[my_data$region_code]
Note that if the keys were numbers, we would have to create the named vector in two steps, first creating a vector of values then assigning to the names function. However this is not recommended, as it can be unclear what e.g. lookup_vector[3] should return.
dplyr::case_when() is a very useful way to create a new vector based on several conditions. It is similar to a case when expression in SQL. For a simple look-up, the conditions will be equality with our input vector. This also lets us map values not in our look-up table to some default.
my_data$region_full <- dplyr::case_when( my_data$region_code == "E" ~ "Europe", my_data$region_code == "NA" ~ "North America", my_data$region_code == "SA" ~ "South America", my_data$region_code == "AS" ~ "Asia", my_data$region_code == "AF" ~ "Africa", my_data$region_code == "AU" ~ "Australia", my_data$region_code == "AN" ~ "Antartica", .default = "Unknown" )
This is a little clunky and includes some repetition, but we can use the recently added case_match function to improve it, and at the same time move this into a function to let us easily re-use this mapping.
region_lookup <- function(x) { dplyr::case_match( x, "E" ~ "Europe", "NA" ~ "North America", "SA" ~ "South America", "AS" ~ "Asia", "AF" ~ "Africa", "AU" ~ "Australia", "AN" ~ "Antartica", .default = "Unknown" ) } region_lookup(my_data$region_code) #> [1] "South America" "South America" "Europe" "Asia" #> [5] "Antartica" "North America"
Binning is the process of grouping data in ranges of values. SAS user defined formats can be created to group numeric values into these bins. It’s used in a variety of situations, from analysis and visualisation to statistics. As such, there is a base R function cut that can provide most of the functionality we need.
The simplest behaviour is to provide the number of bins, R will create a factor with labels indicating the intervals.
ages <- c(59, 22, 20, 62, 30) cut(ages, breaks=3) #> [1] (48,62] (20,34] (20,34] (48,62] (20,34] #> Levels: (20,34] (34,48] (48,62]
We can also define custom intervals and labels, note that labels needs to be one element shorter than breaks.
cut( ages, breaks=c(0, 40, 60, 100), labels=c("Young", "Middle-Aged", "Old") ) #> [1] Middle-Aged Young Young Old Young #> Levels: Young Middle-Aged Old
There are many packages in R that provide a function to bin data for different use-cases, as one example we can again use dplyr::case_when(), this time to group data. This produces a character vector rather than a factor.
dplyr::case_when( ages <= 40 ~ "Young", ages <= 60 ~ "Middle-Aged", .default = "Old" ) #> [1] "Middle-Aged" "Young" "Young" "Old" "Young"
A useful package to mention is {fmtr}, which is designed to try and replicate the experience of SAS user-defined formats in R, including conditions and format catalogues. However, it also has limited popularity, and it is best practice to try to write idiomatic R rather than to replicate SAS exactly.
In this blog, we explored various methods for mapping SAS formats to R, enabling effective data manipulation. By leveraging techniques such as look-ups and binning, you can achieve similar functionality in R.