By Barbara Mikulášová
In the world of data analysis, the debate between R and Python is fading. Instead, the focus is on combining their strengths. The ‘reticulate’ package helps connect these languages smoothly.
The reticulate package is a bridge between the two most popular open-source languages in the data science world, R and Python. Its purpose is to support multilingual data science teams by facilitating seamless interoperability between Python and R. It also eliminates the need to translate any legacy scripts from one programming language to another.
Now the question is no longer about which programming language is the best but which one offers the most useful tools for the different stages of projects.
This tip is aimed at the programmers and data scientists who already have some familiarity the reticulate package but sometimes find themselves wondering why is it that their objects and imported functions are not behaving the way they expected.
This is a common experience since reticulate was built to blur the lines between R and Python. But there is no need to feel discouraged. After finishing this tip, you should feel ready to embrace the versatility this package offers you and your team.
At the time of writing this tip, the requirements for using the reticulate package are:
This article will not get into details on how to set up the correct Python version or conda environment to use with reticulate. A detailed outline of these steps together with the relevant examples can be found on their official website here.
One of the advantages of the reticulate package is the automatic data type conversion. In practice, when we need Python to return values to R, these Python data types are automatically converted into their R data type equivalents, unless explicitly specified otherwise. The same happens when we are calling into Python while working with R variables.
Of course, you can always take the matter into your own hands and convert the R types into the Python equivalents, especially if the objects are required for specific API or function calls. More discussion on this topic later.
If you are working with custom Python classes you do not need to worry. R will simply create a reference to your custom Python object. Read on for more details on how to work with custom Python classes and functions while using reticulate.
Once you start incorporating R and Python scripts into your workflow, it can be easy to lose track of your different R and Python variables and objects. In the following section, we will go through a variety of scenarios and how they influence the way your R variables and Python objects behave in the code script.
The two inference objects are the main bridges between our Python and R sessions. The py object allows us to interact with the Python session and any objects stored there from the R code.
#create a python dictionary object py_dic = {'London' : '8.80 million' , 'Vienna' ; '1.97 million'} print(py_dic) {'London' : '8.80 million', 'Vienna' : '1.97 million'}
Naturally, when you are working from RStudio, any python objects created during the Python sessions will not be available in your R global environment pane. This means that we are unable to access them by purely calling their name.
py_dic[1] Error in eval(expr, envir, enclos): object 'py_dic' not found
Instead, the dictionary object needs to be refereced via the py object using $ .
# Reference python dictionary objects in R session
py$py_dic $London [1] "8.80 million" $Vienna [1] "1.97 million"
Additionally, when we call the python objects into R, they are automatically converted into the corresponding R types. In this case, the dictionary object became an R list object. Now we can use R syntax to index the list’s elements.
#R session py$py_dic[1] $London [1] "8.80 million"
Notice the way the first element of the converted list object is extracted. Since we are working with the R list data type now and not Python dictionary object, we need to follow R programming rules. Indexing of elements is one of the examples where the rules of Python and R deviate. Python uses 0-based indexing while R uses a 1-based system.
This example shows why it is important to remember which programming language we are working with and where each object comes from.
To reference R variables in a Python session, use the r inference object followed by a dot . , followed by the name of the R object you need to call. The reference object will be automatically converted into its Python equivalent.
# Create vector in R session my_vec <- c(1:4) # Reference vector in Python session print(r.my_vec) [1, 2, 3, 4] print(type(r.my_vec)) <class 'list'>
One of the ways to run a string of python code from within the R code is to use the py_run_string function. This short call can also be used to create new Python objects. Although this is an R function, the dictionary object my_dict is created in the Python session and therefore is not available in the R global environment. It must be referenced via the py object.
#R session py_run_string("my_dict = {'A':12, 'B':2, 'C':7}") print(py$my_dict) $A [1] 12 $B [1] 2 $C [1] 7
The package itself offers useful functions to create python objects from R. For example, to create a Python dictionary object during the R session we can use dict() or py_dict() functions. The main difference between the two is that the dict() function accepts two variations on how to define key-value pairs.
# R session # Using keyword arguments cities_dic <- py_dict(keys = c("Germany","Portugal"), values = c('Berlin','Lisbon')) cities_dic_2 <- dict(keys = c("Germany","Portugal"), values = c('Berlin','Lisbon')) # Alternative syntax to define key-value pairs rivers_dic <- dict('Amazon' = 6400, 'Nile'=6650)
When you create Python objects using these R functions, they exist in your global R environment and by default are not converted into the corresponding R types. We can call their methods and attributes using $ .
# R session print(cities_dic) {'Germany': 'Berlin', 'Portugal': 'Lisbon'} cities_dic$keys() dict_keys(['Germany', 'Portugal'])
What you need to remember here, however, is that these Python objects are available to your R session. In order to reference them in the Python code, you need to call them using the r inference object. This is one of the instances where it is easy to misplace your objects.
# Python session r.cities_dic.keys() dict_keys(['Germany', 'Portugal'])
Any Python object created using the reticulate function during the R session can be manually converted into their R types using py_to_r() function.
cities_list <-py_to_r(cities_dic) print(cities_list) $Germany [1] "Berlin" $Portugal [1] "Lisbon"
This explicit conversion is not possible for the Python objects stored in the Python session.
r_list <-py_to_r(py$py_dic) Error in py_to_r.default(py$py_dic): Object to convert is not a Python object
The good news is that most of the time the explicit conversion is unnecessary since python objects can be used in the R code by referencing it using py .That is the beauty of Reticulate.
Another instance where the automatic type conversion between Python and R can be prevented is when we import Python libraries.
By setting the argument convert = FALSE in the import() function, any objects created using the functions from the imported library will not be automatically converted. This means we have to be careful of the syntax we use if we wish to manipulate these objects further.
pd <- reticulate::import("pandas", convert = FALSE) # create pandas series from a list my_series <- pd$Series(c(2,4,6,8), index = c('blue','red','green','yellow')) # use a NumPy method on the series object series_mean <- my_series$mean() print(series_mean ) 5.0
Without setting the convert argument to FALSE, the automatic data conversion will take place. This means that if you try to call the same NumPy method on a converted object. The code will Error because the $ operator is used on the R vector and thus is not recognised as a method call.
# Set if convert = TRUE np <- import("numpy") # NumPy array object is converted into R vector num_vec <- np$array(c(7:11)) # $ is an invalid atomic vector operator vec_mean <- num_vec$mean() # error Error in num_vec$mean: $ operator is invalid for atomic vectors
In summary, the reticulate package helps us to combine the functionalities of R and Python. It also forces us to pay close attention to the nature of the objects we create in our code. Rather than seeing this as a negative, we should embrace the exciting possibilities which this hybridised code brings to our data science projects.
If you want to explore the benefits this will bring, reach out to us.