By Michael Walshe, Senior Consultant
Are you a SAS programmer picking up R? Have you struggled with missing values, out-of-memory errors, or how to switch your brain from SAS to R? Then this blog post is for you!
If you’re familiar with SAS, you know it’s an excellent language for high-performance analytics. However, many industries are exploring R alongside SAS for its open-source flexibility and expansive community, building on their existing statistical know-how. So for SAS veterans looking to add another weapon to their analytical arsenal, today we’re going to teach you how to navigate the world of R – highlighting the similarities, smoothing out the bumps, and you’ll be a practised “useR” before you know it!
One thing to note here is what we will not cover – this will not be a complete introduction to R! I’m going to give an overview of R for the SAS programmer. For a detailed introduction to R from the fundamentals up, take a look at our R courses. If you’re looking for a course aimed at SAS programmers, then make an enquiry here.
R first appeared on the scene in 1993, although the syntax and style is heavily inspired by a predecessor S – developed by Bell Labs in 1976. It was started by two professors at the University of Auckland, to aid in using and teaching statistics. This has led to a common phrase: that R is “for statisticians, by statisticians”. Since then, there have been several major milestones:
One of the major differences between R and SAS is that in R a lot of the functionality and fun features are provided by the user community in the form of packages. These are portable bundles of code, documentation, and data, which are distributed via CRAN. An excellent feature of CRAN compared to other open-source package repositories is that every package is checked and curated by a human alongside a suite of automated checks – currently there are >20’000 curated packages available!
Let’s get started by looking at some of the basic operations in R, as compared to SAS. In SAS, the essential unit of data is the dataset, composed of observations and variables. In R, the the essential atomic structure is the vector, a homogeneous sequence of elements (e.g. c(1, 2, 3, 4) or c(“A”, “C”, “A”)). The next level up is the data.frame, which analogously to SAS’s dataset is a tabular data format with rows and columns.
Before we get bogged down in minutia, let’s look at an example of doing some basic processing of a dataset in SAS, and then R. We’ll use the following demog data set for both:
head(demog) #> # A tibble: 6 × 15 #> age gender height weight status children cars salary grade hours dept #> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr> #> 1 22 M 64 60 S 0 1 13592. low 35 A #> 2 26 F 87 74 M 1 1 8870. low 35 B #> 3 29 M 55 62 D 2 1 12672. low 35 C #> 4 28 M 84 74 M 3 1 23761. high 35 C #> 5 23 F 84 88 P 1 1 10513. low 35 C #> 6 30 M 87 76 S 0 1 7521. low 25 D #> # ℹ 4 more variables: overtime <dbl>, cc <dbl>, staffno <chr>, budget <dbl>
%let rc = %sysfunc(dlgcdir(%nrstr(%%USERPROFILE%%)\source\katalyze-data\content\r-for-sas-programmers-blog)); proc import file="data/demog.csv" out=work.demog replace; run; data demog2; set demog(keep=height weight salary gender); where gender = "M"; height_m = (height * 2.54) / 100; bmi = weight / (height_m **2); run; proc corr data=demog2; var bmi salary; run;
This produces the following output:
demog <- read.csv("data/demog.csv") demog2 <- demog[ demog$gender == "M", c("height", "weight", "salary") ] demog2$height_m <- (demog2$height * 2.54) / 100 demog2$bmi <- demog2$weight / (demog2$height_m **2) result <- cor.test( ~ bmi + salary, data = demog2, na.action="na.omit" )
print(result) #> #> Pearson's product-moment correlation #> #> data: bmi and salary #> t = 0.71916, df = 69, p-value = 0.4745 #> alternative hypothesis: true correlation is not equal to 0 #> 95 percent confidence interval: #> -0.1500696 0.3132541 #> sample estimates: #> cor #> 0.08625406
We can already see a few of the key differences, but let’s enumerate the most important ones:
What we’ve seen so far is using just the functionality available in so-called “Base R”, without any extra packages installed. This isn’t typical, most users will install a variety of packages to aid in their work. One of the most popular sets of packages is the {tidyverse}, which is a collection of packages that share a common philosophy and style. Let’s load the {tidyverse} and see how it changes our workflow.
# Loading just two packages from the tidyverse library(readr) # For better file reading library(dplyr) # For better data wrangling result <- read_csv("data/demog.csv") |> filter(gender == "M") |> select(height, weight, salary) |> mutate(height_m = (height * 2.54) / 100, bmi = weight / (height_m **2)) |> cor.test(~ bmi + salary, data = _, na.action = "na.omit")
I won’t go through this line-by-line, but we’ve achieved the same result as before, this time writing {tidyverse} style code. This style is characterised by a few key features:
These functions may seem familiar to those of you used to SQL (available through PROC SQL in SAS). This is no coincidence, the {dplyr} package was partially inspired by SQL, and is a great way to do data wrangling in R. Our output is the same as before, just produced in a (in my opinion) more readable way.
Now that we’ve seen a little more of the basics of data manipulation in R, let’s take a look at plotting. Here, we’ll create a simple plot of the demog dataset from before using the immensely popular {ggplot2} package. This package helps us create plots by layering different components together, a similar approach to the proc sgplot procedure in SAS.
library(ggplot2) ggplot(demog, aes(height, weight, colour=gender, fill=gender)) + geom_point() + geom_smooth(method="lm")
If we wanted to create a similar output in SAS, we could do:
proc sgplot data=demog; scatter x=height y=weight / group=gender; reg x=height y=weight / group=gender clm; run;
Here we can see that the {ggplot2} code is a little less verbose, and also as we build up more complex plots it’s easier in R to add different layers and components. R also has a robust ecosystem of tools for producing reports and dashboards – for example this blog post was written in Quarto!
Integration between SAS and R is a topic deserving of an entire blog post (let us know if you’d like to see one!), but here we’ll just cover the basics. If you’re looking to talk to with SAS from R, there are a few options:
If you’re looking to talk to R from SAS, then the main way is via PROC IML. This is a SAS procedure for performing matrix and vector operations, and includes functionality to call R functions or scripts.
One of the most common pitfalls for SAS programmers learning R is the different mode of thinking about data. In SAS, we’re used to thinking about looping over observations, and performing calculations on each one. In R, generally the best method is to perform a vectorised calculation over an entire column. If you do try to perform calculations in a loop, it can be slower if you don’t prepare correctly, and will often be much slower if the vectorised method is implemented in a faster language (such as C or C++, as many functions are in R).
Another common mistake that we’ve already seen is the difference in missing values. More than just the representation and treatment in functions, however, is the difference during comparisons. In SAS, a missing value is treated as negative infinity for comparison (e.g . < anything is always true). In R, a missing value during comparison will propagate! So NA < anything is NA. This can be a source of bugs if you’re not careful.
The final pitfall that I’ll mention here is that of data sizes. In SAS, datasets of any size can be manipulated on any system without issue, as long as you have enough disk space. In R, however, all data is loaded into memory and is limited by available RAM. This means that if you’re working with very large datasets, you may need to be careful about how you load and manipulate them. There are a variety of packages that can help with this, two that I’ll demonstrate here are {arrow} alongside {duckdb}.
We’ll use a large dataset of taxi trips in New York City. I’ve downloaded the first 4 months of 2016, which is approximately 8GB as a CSV! This is just about manageable for a modern laptop, but it would quickly get painful to analyse any more data, let alone the full history.
If you want to take a look at the code to download it and benchmark some basic manipulation using the standard tidyverse, see the section below. For now, we’ll just note the time taken to do some calculations and aggregations.
# Download and unpack the data library(arrow) library(aws.s3) # n.b. need to set environment variable with AWS account details library(dplyr) library(fs) library(lubridate) library(purrr) library(readr) library(stringr) library(tibble) # Download each file if it doesnt already exist downloaded_csvs <- get_bucket_df("s3://nyc-tlc/", region="us-east-1") |> as_tibble() |> filter(str_detect(Key, r"(^csv_backup/yellow_tripdata_2016-0[1-4].csv?)")) |> pull(Key) |> map_chr( \(x) { fpath <- file.path("data", "nyctaxi", x) if (!file_exists(fpath)) { save_object( object = x, bucket = "s3://nyc-tlc/", file = fpath ) } fpath } ) # Convert all the CSVs to parquet, for later duckdb manipulation open_dataset("data/nyctaxi/csv_backup", format="csv") |> mutate(month = month(tpep_pickup_datetime)) |> group_by(month) |> write_dataset("data/nyctaxi/nycp", format="parquet") # Load the data in R lazily # fread is typically faster than readr::read_csv, but lazy representations # can make up the difference! nyc_taxi <- read_csv( downloaded_csvs, id="file_name", lazy=TRUE, col_types = cols( tpep_pickup_datetime = col_datetime(format = "%Y-%m-%d %H:%M:%S"), tpep_dropoff_datetime = col_datetime(format = "%Y-%m-%d %H:%M:%S"), store_and_fwd_flag = col_character(), .default=col_double() ) ) tictoc::tic("Using standard methods") agg_taxi <- nyc_taxi |> filter(VendorID == 2) |> mutate( # Compute the "manhatten" distance between two points... :P trip_manhatten_distance = ( abs(dropoff_longitude - pickup_longitude) + abs(dropoff_latitude - pickup_latitude) ), # Calculate trip duration, in seconds trip_duration = as.duration(tpep_pickup_datetime - tpep_dropoff_datetime) ) |> group_by(passenger_count) |> summarise( avg_trip_distance = mean(trip_manhatten_distance, na.rm=TRUE), avg_trip_duration = mean(trip_duration, na.rm=TRUE), avg_fare_amount = mean(fare_amount, na.rm=TRUE) ) tictoc::toc() rm(agg_taxi) #> Using standard methods: 776.47 sec elapsed
Now let’s demonstrate doing the same manipulation with {duckdb}:
library(arrow) library(duckdb) library(tictoc) tic("Using duckdb & arrow") # Basic timing agg_taxi2 <- open_dataset("data/nyctaxi/nycp") |> to_duckdb() |> filter(VendorID == 2) |> mutate( trip_manhatten_distance = ( abs(dropoff_longitude - pickup_longitude) + abs(dropoff_latitude - pickup_latitude) ), trip_duration = datediff("s", tpep_pickup_datetime, tpep_dropoff_datetime) ) |> group_by(passenger_count) |> summarise( avg_trip_distance = mean(trip_manhatten_distance, na.rm=TRUE), avg_trip_duration = mean(trip_duration, na.rm=TRUE), avg_fare_amount = mean(fare_amount, na.rm=TRUE) ) |> collect() res <- toc() #> Using duckdb & arrow: 7.61 sec elapsed
I won’t walk through the code – just notice that it was 56 times faster, all without running out of memory! This is another topic that deserves a blog post of it’s own, and it’s worth saying that this isn’t the only method for working with large datasets in R.
This has introduced you to some common functionality in R, and how it compares to SAS. If you’re looking for a fully featured introduction to R, either from base principles or focused on advanced concepts, then take a look at our R courses. If you’d be interested in an R course aimed at SAS programmers, with more tips and tricks for those transitioning, then make an enquiry here.