R programming is overwhelming to new data scientists. It was for me. I came from a SQL background. In SQL, a simple SELECT statement with a WHERE clause works in most situations. I didn't have to change things up based on the object type. Not true for R.
That meant I spent a lot of time trying to figure out how to wrangle data in R that would've been better spent on analysis.
In hindsight, a lot of these mistakes I made could've been prevented had someone given me a "list." Something to tell me the useful key principles, such as object types, that would've made the R programming language easier to learn and understand.
Well, I can't change the past, but I can hopefully save you (the new R user) some of those headaches.
Here's the list of R programming concepts you should learn ahead of time to make your R programming journey easier.
#1: The Different Object Types and How to Work With Them
R is an object oriented programming language. That might sound like semantics, but it's a super important concept.
Every command you execute in R will:
Read an object
Modify an object
Produce an object
Call upon an existing object
If you're wondering what an object is, it's basically the way that R stores data, as well as various functions.
There are seven common object types you should remember:
The reason it's important to learn the difference between these object types is that common R commands change depending on the object type.
For example, the code object_name[n] will select a single data entry in a vector, matrix, or array. But that same script will select a whole column from a data frame.
You can read more on the differences between these object types here and how to interact with them here.
#2: The Difference Between R and RStudio
When most people think of R, they think of RStudio, but the two are actually quite different.
R (or "base" R) is the original version of R. It has a simple user interface, which you can see in the screenshot below:
RStudio is an integrated development environment. That's a fancy way of saying it's a user-friendly interface. It allows you to perform a lot of common R tasks, like data and package import, with a simple click of the mouse. It also cleanly presents saved code, saved objects, and R documentation.
I strongly suggest downloading RStudio. It will make learning the R programming language a lot easier, if you do.
Base R is actually quite functional. As you get better at programming in RStudio, you will actually find it easy to "go backwards" to the more simplistic base R interface. Once you're at a pro level, you may find the more minimalist approach even easier to work with than the "busy" RStudio interface.
You can learn more about R, RStudio, and how to use them here.
#3: R Packages and How to Install and Download Them
R packages store various functions and functionality that expands R beyond its humble beginnings as a statistical programming language. That may include new statistical calculations, like those found in survival analysis or machine learning, or better data visualizations, like those found in ggplot2.
You can download new R packages for free! This gets back to the fact that R is an open-source programming language, which means it was developed so that any individual or organization can use it and add to it.
What adds to the confusion though is that there's a difference between installing and loading a package. Installing means installing it on your computer. Loading means making it accessible to your current R session.
I remember when I first started programming in R, I kept trying to use various R functions I had found on the internet. I would then complain to my co-worker, who had more experience in R than I did, about how the functions kept throwing errors.
He looked at the error code and laughed. He then told me: “you have to actually install the package first, ding dong!”
My R programming experience radically improved after learning that simple concept.
Speaking of packages, there's one non-base R package you should learn: dplyr. It will help data wrangling, transformation, and filtering far easier.
The base R syntax that's commonly used for filtering and data transformation is not very readable. It's great for developing your own functions and packages, but it's not good for writing code that other developers will use.
The dplyr package solves that issue. It's a cleaner, more intuitive syntax that allows you to transform and filter data. And it makes it far easier for your colleagues to read and interpret your code.
Here's an example of dplyr code, which filters and summarizes data in a data frame:
And here's how to achieve that same goal in base R:
Quite a difference, isn't it?
While I would suggest learning the base R methods for data management first, as it's better for writing functions and developing your own packages, I recommend learning dplyr to improve your code readability for when you'll have to share it with other data scientists.
#5: How to Write Your Own Functions
Whether you're a data analyst, data scientist, statistician, or a programmer – I strongly recommend writing your own functions.
The big reason is the open-source nature of R programming. Many of the packages out there were developed by individual contributors. That means we don't always know the level of quality checking they went through.
The other major issue is documentation. I've sadly come to realize that not all R documentation, especially on lesser known packages and functions, are created equal. While the R base documentation is well-written and reliable, others lack clarity.
They will say "this function runs this test," but they won't detail the arguments clearly and how they work. Or their output is a random assortment of numbers without clear labels that are hard to interpret.
For that reason, it's often faster to write your own functions for the specific output you need. In the time it would take you to read the documentation on an existing function, practice with the function, interpret the results, verify their accuracy, you could have written your own function that served your needs.
To learn how to write a function, download my book or read this article here.
#6: How to Write Control Flows
Control flows are a handy little trick to shorten your R code. I find them more useful for writing my own functions, which relies more on the base R syntax, than writing code with dplyr.
It will prompt R to repeatedly conduct the same action on each individual subset of data.
Confusing? It's really not that complicated when you see a real example.
In my book, I use James Bond data to illustrate this. Let's say we want a function to calculate the mean and standard deviation of the gross revenue each Bond actor made.
Unlike before though, we don't want to use dplyr, because we want this function to scale and not be reliant on downloading the dplyr package.
So we write a custom function that can calculate this data and we want to not include code that looks like this:
Well, we can use a control flow to shorten that script. This will repeat the calculation for mean and standard deviation, one grouping at a time, until we reached the last group.
That shortens the code above to something like the code below:
Here are the common control flows:
if(cond) cons.expr else alt. expr
for(var in seq) expr
They all work differently and are handy in different contexts. Execute "?Control" in your R console and you can see documentation for control flows. I'll be writing an article about them in the next few weeks.