How to Filter and Transform Objects (Data) in R

8/30/2020

I adapted this blog post from a chapter in my upcoming book, R Programming in Plain English. You may download a PDF of all completed material for this book here. The PDF also includes practice problems and solutions over these concepts.

You can select individual entries using the above notation

In a previous post, I explained the various object types in R. Now we want to learn how to filter and transform those objects. Notice how I didn't say filter and transform the data? That's because filtering and transforming data in R heavily depend on the object type.

That's what we'll cover in this post.

Before I explain those methods though, we need to quickly cover operators.

What Are Operators?

If you're new to programming, you're probably not familiar with the term operator. Operators, in plain English, modify or evaluate data. That's important to data transformation and filtering.

There are two types of operators in R: arithmetic and logical.

Arithmetic operators cover tasks like addition, subtraction, etc. You know? The basic math stuff. This is useful for data transformation and will be used in several examples later.

Here are the common arithmetic R operators:

addition: " + "
subtraction: " - "
multiplication: " * "
division: " / "
exponent: " ^ "
matrix multiplication: " %*% "
matrix division: " %/% "

Logical operators takes the data and generates a TRUE or FALSE output, based on whether the data meets your requirement. This is more helpful for filtering data than transforming.

Here are the common logical R operators:

less than: " < "
greater than: " > "
less than or equal: " <= "
greater than or equal: " >= "
equal: " == "
does not equal: " != "
and: " & "
or: " | "

Don't worry if you're unsure of how to use these just yet. You'll see examples for these in the next few sections. This is just for your easy reference.

How to Filter and Transform Data From a Vector

Vectors are the easiest objects to filter. Same with transforming the data within them.

If you want to reference or view the entire vector, you simply enter the name you assigned the object:

Run this code:

  v5 <- c(1,5,5,2,1,4)

  v5



See output below:

  [1] 1 5 5 2 1 4

You also can select a single entry from a vector using the [n] notation:

Run this code:

  v5[3]



See output below:

  [1] 5

As you can see, the script above selected the third value from the vector.

You can select a range of entries by using the [n:n] notation:

Run this code:

  v5[3:4]



See output below:

  [1] 5 2

And you can also create a new vector by referencing old vectors!

Run this code:

  v2 <- c("Hola","Howdy","Hello")

  v7 <- c(2:4)

  v8 <- c(v2,v7)

  v8



See output below:

  [1] "Hola" "Howdy" "Hello" "2" "3" "4"

You can use other base R functions to filter data as well.

For example, you may want to see the minimum or maximum value in a vector. You can use the max() and min() command for that:

Run this code to show max value:

  max(v5)



See output below:

  [1] 5



Run this code to show min value:

  min(v5)



See output below:

  [1] 1

And you can use logical operators as well. In the examples below, I use the >= and & operators to filter values:

Run this code to view entries greater than 2:

  v5 >= 2



See output below:

  [1] FALSE TRUE TRUE TRUE FALSE TRUE



Run this code to view entries between 3 and 5:

  v5 >= 3 & v5 <= 5 # Values between 3 and 5



See output below:

  [1] FALSE TRUE TRUE FALSE FALSE TRUE

We can also use the | operator to find values that meet a criteria. For example, I filter the vector below to "Hola" and "Howdy":

Run this code to view all entries:

  v2



See output below:

  [1] "Hola" "Howdy" "Hello"



Run this code to view entries that equal "Hola" or "Howdy":

  v2 == "Hola" | v2 == "Howdy"



See output below:

  [1] TRUE TRUE FALSE

You probably noticed that these logical operators only return TRUE or FALSE statements. That makes sense since it is a logical argument that's evaluated. However, we may want to see the actual values that meet our filter criteria. This isn't important in an example like this, but it's important when we start to filter data frames.

To show the actual values where the logical argument is true, you use the object_name[argument] notation. In the next few examples, I filter the vectors down to values that meet the arguments used in the last few examples:

Run this code to view all entries that are greater than 2:

  v5[v5>=2]



See output below:

  [1] 5 5 2 4



Run this code to view all entries that are between 3 and 5

  v5[v5 >= 3 & v5 <= 5]



See output below:

  [1] 5 5 4



Run this code to view all entries that equal "Hola" or "Howdy":

  v2[v2 == "Hola" | v2 == "Howdy"]



See output below:

  [1] "Hola" "Howdy"

In the examples above, I simply copied the logical arguments used previously and pasted them between the brackets.

You can also change data easily when it comes to numeric vectors. For example, down below is a vector of box office revenue for James Bond films. Copy and paste this script into your R console and execute:

Run this code:

  gross <-

   c(1108561008,1014941117,912257512,880669186,

    825110761,756544419,692713752,669789482,

    655872400,648514469,622246378,576277964,

    543639638,529548711,505899782,491617153,

    486468881,478946402,448249281,440759072,

    426244352,381088866,321172633,285157191)

  gross

As you can see, the values are very large. To make our analysis easier, we can use one of the arithmetic operators I showed earlier. In this scenario, I want to make the values smaller. So I'm going to divide it using the / operator.

Run this code:

  gross <- gross/1000000

  round(gross)



See output below:

  [1] 1109 1015 912 881 825 757 693

  [8] 670 656 649 622 576 544 530

  [15] 506 492 486 479 448 441 426

  [22] 381 321 285

Finally, we can re-assign the value to a particular part of a vector using a combination of methods we covered earlier and the <- notation. For example, we can see below how we re-assign values based on the location:

Run this code:

  v8 <- c(1,5,5,2,1,4) # Creates the original vector

  v8[6] <- 8 # Replaces the sixth value with an 8

  v8[1:3] <- c(4,3,1) # Replaces the first three values

  v8



See output below:

  [1] 4 3 1 2 1 8

How to Filter and Transform Data From a Matrix

Filtering the data within a matrix is both similar and different than a vector. It's similar because we can use the [n] notation to select a single entry. We had done this before with a vector.

Run this code:

  v5[3]



See output below:

  [1] 5

You can do the same for a matrix. If you run the code below, you’ll re-create and view the matrix we used in the last chapter:

Run this code:

  matrix1 <- matrix(c(2,0,1,3),nrow=2,ncol=2)

  matrix1



See output below:

      [,1] [,2]

  [1,]  2    1

  [2,]  0    3

And here you’ll select the fourth value from that matrix using the [4] command:

Run this code:

  matrix1[4]



See output below:

  [1] 3

Now that isn’t very practical for a matrix. You may need to select a value from a specific row or column instead. This is where matrices behave differently than vectors. You’ll want to use the [r,c] notation to determine which values you want.

In the example below, I select the second row and first column of the matrix:

Run this code:

  matrix1[2,1]



See output below:

  [1] 0

We can make this easier on ourselves. Instead of specifying row or column numbers, we can give them names. That way, we can use the [row_name,column_name] notation to select data from a matrix.

Down below, I give our previously created matrix row and column names:

Run this code:

  colnames(matrix1) <- c("Col1","Col2")

  rownames(matrix1) <- c("Row1","Row2")

  matrix1["Row2","Col1"]



See output below:

  [1] 0

We can also use the filtering methods used on vectors and apply them to matrices.

For example, I may want to see what values are greater than 0...

Run this code:

  matrix1 > 0



See output below:

        Col1    Col2

  Row1  TRUE    TRUE

  Row2  FALSE   TRUE

Funny enough though, you can't return the actual values that meet this criteria in a matrix form. It'll turn into a vector. That's because the output may not have the same number of columns and rows as the original matrix. So R assumes it'll need a one-dimensional object output.

Run this code:

  matrix1[matrix1 > 0]



See output below:

  [1] 2 1 3

You can use the same techniques we outlined before with the vectors to transform the data within a matrix. Copy and paste the codes below to your R console and see the results. Feel free to play around with the inputs to see what happens.

Run this code:

  matrix1 <- matrix(c(2,0,1,3),nrow=2,ncol=2)

  matrix1

  matrix1[3] <- 5 #Changes 3rd value to a 5

  matrix1

  matrix1[,2] <- 2

  matrix1

  matrix1[2,2] <- 0 #Changes the 2nd row, 2nd column to 0

  matrix1

Like the vectors, you can transform the data within the matrix using the arithmetic operators we discussed earlier. Run the code below in your own R console and see what happens:

Run this code:

  matrix1

  matrix1+2

  matrix1-4

  matrix1^3

  matrix1*5

You can also use these operators to combine matrices. We'll need a few matrices to illustrate these examples though. Take the code below and execute it in your console, if you want to follow along with my examples:

Run this code:

  matrix1 <- matrix(c(2,0,1,3),nrow=2,ncol=2)

  matrix1

  matrix2 <- matrix(c(5,7),nrow=2)

  matrix2

  matrix6 <- matrix(c(4,3,1,3),nrow=2,ncol=2)

  matrix6

It's important to remember the dimensions of your matrices. Attempting to use addition on two matrices without the same dimensions won't work.

Matrix 1 and 2 do not have the same dimensions, so it will return an error:

Run this code:

  matrix1 + matrix2



See output below:

  Error in matrix1 + matrix2 : non-conformable arrays

However, Matrix 1 and Matrix 6 do have the same dimensions and will execute:

Run this code:

  matrix1 + matrix6



See output below:

       [,1] [,2]

  [1,]   6    2

  [2,]   3    5

Multiplying two matrices together can be misleading. For example, using the simple * operator will only multiply the corresponding values in two matrices with the same dimensions. Confused? Look at the two matrices below and then look at the output:

Run this code:

  matrix1

  matrix6

  matrix1 * matrix6



Matrix 1:

       [,1] [,2]

  [1,]   2    1

  [2,]   0    3



Matrix 6:

       [,1] [,2]

  [1,]   4    1

  [2,]   3    3



Output:

       [,1] [,2]

  [1,]   8    1

  [2,]   0    9

Entry [1,1] of the first matrix is 2. Entry [1,1] of the second matrix is 4. 2 x 4 = 8. That shows us that the multiplication used here is not true matrix multiplication.

If you attempt to use the same * operator for Matrix 1 and Matrix 2, you will also get an error:

Run this code:

  matrix1 * matrix2



See output below:

  Error in matrix1 * matrix2 : non-conformable arrays

That's because these two matrices do not share the same dimensions.

However, we can generate a single matrix from these two matrices using matrix algebra. How do we do this? We use the %*% operator!

Run this code:

  matrix7 <- matrix1 %*% matrix2

  matrix7



See output below:

       [,1]

  [1,]  17

  [2,]  21

You can also divide a matrix with another using the %/% operator:

Run this code:

  matrix7 %/% matrix2



See output below:

       [,1]

  [1,]  3

  [2,]  3

Just remember the difference in how a matrix will interact with the *, %*%, /, and %/% operators.

How to Filter and Transform Data from Arrays

Selecting data from arrays is similar to what we did before. You can select an individual entry using the [n] notation. If you use the script below, you can re-create the array we used in the previous blog post on object types:

Run this code:

  matrix3 <- matrix(c(2,0,1,4,5,2,3,4),nrow=4,ncol=2)

  matrix4 <- matrix(c(4,3,5,2,1,6,4,5),nrow=4,ncol=2)

  matrix5 <- matrix(c(1,3,1,2,3,5,6,2),nrow=4,ncol=2)

  array1 <- array(c(matrix3,matrix4,matrix5),dim=c(4,2,3))

  array1

This script will create three separate matrices, like you see below:

And then stacks them into an array:

Now we'll select the first and twenty-second entry from that array using the [n] notation:

Run this code:

  array1[1]



See output below:

  [1] 2



Run this code:

  array1[22]



See output below:

  [1] 5

To help you visualize this, I highlighted the fifth and twenty second values from the array in the illustration below:

With arrays, selecting specific columns or rows gets confusing because arrays can have multiple dimensions. That introduces the [r,c,d] notation.

Down below, we select the entire second row of every matrix in our array:

Run this code:

  array1[2,,]



See output below:

       [,1] [,2] [,3]

  [1,]   0    3    3

  [2,]   2    6    5

You may have noticed that this "flipped" the direction. R isn't trying to confuse you. Instead, it's displaying the previous, individual matrices as columns. So column 1 shows the results from matrix 1 in the previous array. Keep this in mind as you interact with arrays. The output may not always be intuitive.

Down below is an illustration of how R processed this command. First, R will select the second row from each matrix (or level) of the array...

It will then pivot those rows and output them into a new array, with each column representing the level of the original array...

Just like rows, we can also the second column of every matrix within the array:

Run this code:

  array1[,2,]



See output below:

       [,1] [,2] [,3]

  [1,]   5    1    3

  [2,]   2    6    5

  [3,]   3    4    6

  [4,]   4    5    2

Down below is an illustration of how R processes this command. R takes the second column from each level of the array and outputs it. Each column of the output represents the level of the original array.

We can also select every entry in the third matrix of our array as well...

Run this code:

  array1[,,3]



See output below:

       [,1] [,2]

  [1,]   1    3

  [2,]   3    5

  [3,]   1    6

  [4,]   2    2

We can even get a specific entry by selecting the second row, second column of the third matrix in our array:

Run this code:

  array1[2,2,3]



See output below:

  [1] 5

You can transform arrays in the same way as the other objects. Copy and paste the code below to your own computer and play around with the array values:

Run this code:

  array1

  array1[2] <- 2 #Changes 2nd value to a 2

  array1[,2,] <- 2 #Changes entire 2nd column of each matrix to a 2

  array1[2,1,3] <- 0 #Changes the 2nd row, 1st column, 3rd array to 0

  array1[,,3] <- array1[,,3] / 3 #Divides entire 3rd matrix by 3

How to Filter and Transform Data from a Data Frame

There's two approaches you can take to selecting data from a data frame. There's the "classic" approach, which I'll show you in this post, and then there's the data plyer approach.

The classic approach uses R base to interact with data frames. The data plyer approach uses a package called dplyr to transform the data. The dplyr syntax is far more readable, which is handy for really long scripts. I personally recommend the data plyer approach and will detail it in a later post.

This post will cover the classic approach.

To follow my examples, you'll need to re-create the James Bond data frame used in previous posts:

Run the following code to build the individual vectors:

  filmname <- c("Skyfall","Thunderball","Goldfinger","Spectre",

    "Live and Let Die","You Only Live Twice",

    "The Spy Who Loved Me","Casino Royale","Moonraker",

    "Diamonds Are Forever","Quantum of Solace",

    "From Russia with Love","Die Another Day",

    "Goldeneye","On Her Majesty's Secret Service",

    "The World is Not Enough","For Your Eyes Only",

    "Tomorrow Never Dies","The Man with the Golden Gun",

    "Dr. No","Octopussy","The Living Daylights",

    "A View to a Kill","Licence to Kill")

  year <- c("2012","1965","1964","2015","1973","1967","1977",

    "2006","1979","1971","2008","1963","2002","1995",

    "1969","1999","1981","1997","1974","1962","1983",

    "1987","1985","1989")

  actor <- c("Daniel Craig","Sean Connery","Sean Connery",

    "Daniel Craig","Roger Moore","Sean Connery",

    "Roger Moore","Daniel Craig","Roger Moore",

    "Sean Connery","Daniel Craig","Sean Connery",

    "Pierce Brosnan","Pierce Brosnan","George Lazenby",

    "Pierce Brosnan","Roger Moore","Pierce Brosnan",

    "Roger Moore","Sean Connery","Roger Moore",

    "Timothy Dalton","Roger Moore","Timothy Dalton")

  gross <- c(1108561008,1014941117,912257512,880669186,

    825110761,756544419,692713752,669789482,655872400,

    648514469,622246378,576277964,543639638,529548711,505899782,

    491617153,486468881,478946402,448249281,440759072,426244352,

    381088866,321172633,285157191)/1000000



The following will build a data frame from the individual vectors:

  bond <- data.frame(filmname=filmname,

            year=year,

            actor=actor,

            gross=gross)

As you can see, a data frame is built using smaller vectors. This gives you a clue to how to select data from a data frame.

For example, you can re-select individual vectors back out using the $ notation:

Run this code:

  bond$filmname



See first five results below:

  [1] "Skyfall"

  [2] "Thunderball"

  [3] "Goldfinger"

  [4] "Spectre"

  [5] "Live and Let Die"

You can also produce a vector using the [,c] notation. What this means is you're ignoring the row and selecting a column number. We generate the same result as our last section with this method.

Run this code:

  bond[,1]



See first five results below:

  [1] "Skyfall"

  [2] "Thunderball"

  [3] "Goldfinger"

  [4] "Spectre"

  [5] "Live and Let Die"

If you use the [n] notation, you'll select the same column as before, but you're keeping it in a data frame structure.

Run this code:

  bond[1]



See screenshot of results below:

As you may have noticed, the [n] notation in this context works differently than other object types. For data frames, [n] selects the nth column and not the nth data entry.

You can use the [n:n] or the [,n:n] notation to select multiple columns. Both will be presented as a data frame.

Run this code:

  bond[1:3]

  bond[,1:3]



See screenshot of results below:

To select an individual or range of rows, you use the same [r,c] notation as before:

Run this code:

  bond[1:3,]



See screenshot of results below:

You can also exclude rows or columns using the negative "-" sign before the row or column numbers:

Run this code to exclude rows 1 through 20:

  bond[-1:-20,]



See screenshots of results below:

Run this code to select row 3 and exclude the 4th column:

  bond[3,-4]



See screenshots of results below:

Just like the other object types, you can use logical and arithmetic operators, which makes it easy to filter to what you need. (Note: this is where things start to get complicated with data frame filtering and why I suggest the dplyr package.)

Lets say we want to filter by year. We want only Bond films made after 1990. First, we'll generate our TRUE / FALSE output.

Go ahead and take the script below and run it in your own console:

Run this code:

  bond["year"]>=1990



See first five results below:

  [1] TRUE

  [2] FALSE

  [3] FALSE

  [4] TRUE

  [5] FALSE

Now, you'll notice that I intentionally kept this as a data frame object type. Had I used the bond$year notation, it would've turned the results into a vector. This would've made further filtering later more difficult.

Next, we need to plug this into another script:

Run this code:

  bond[bond["year"]>=1990]



See screenshots of results below:

If you notice though, this doesn't cleanly give us the information we need.

We need to make sure we preserve the columns. We can do this by simply adding a comma:

Run this code to include all columns:

  bond[bond["year"]>=1990,]





See screenshots of results below:

And we can also limit the number of columns we pull by specifying them within the brackets:

Run this code to include only columns 1 through 3:

  bond[bond["year"]>=1990,1:3]



See screenshots of results below:

How to Filter a List

Unlike the other object types, I won't go into detail about transforming a list. The reason is that lists are usually reserved as an output of various functions. Or they're a handy way of bunching other objects together. If you wanted to transform an object within a list, you'd probably transform that object directly.

Filtering a list is a useful skill to have though.

The script below creates a model using our James Bond data. That creates a list of the various calculations in a regression analysis. The names() function then shows you all the objects contained within the list:

Run this code:

  bondmodel <- lm(gross ~ actor,data=bond)

  names(bondmodel)



See screenshots of results below:

We can select any of these objects within the list with the $, ["object_name"], or [n] notations. Copy and paste the following script into your console and see the results:

Run this code:

  bondmodel$coefficients

  bondmodel["coefficients"]

  bondmodel[1]



See screenshots of results below:

Now here's where things get tricky. Let's say we want to filter down to a smaller value within the objects of the list. That changes depending on those object types. Confused?

It's better if we go with a simpler example than the list generated by the lm() function.

Down below, I create a list using some of the other objects we made in this lesson:

Run this code:

  list1 <- list(vector1=v1,

                vector2=v2,

                vector3=v3,

                matrix1=matrix1,

                array1=array1,

                bond=bond)

  list1

I can select any one of those objects from the list using the $ notation:

Run this code to select vector 1 from the list:

  list1$vector1



See output below:

  [1] TRUE FALSE TRUE



Run this code to select matrix 1 from the list:

  list1$matrix1



See output below:

       [,1] [,2]

  [1,]   2    1

  [2,]   0    3

Now let's say I want to select a specific data point from the list. For example, let's say I want to know the Bond actors. I know that information was stored within a data frame within the list. To pull that data, I use a combination of filtering techniques.

First, I have to pull the data frame from the list. I do that with the $ notation. Then, we treat the object type as a normal data frame.

Here's what I mean:

Run this code to select the actor column in the data frame stucture:

  list1$bond[3]



See screenshots of output below:

Run this code to select the actor column and convert it to a vector:

  list1$bond$actor



See screenshots of output below:

I can do the same with the other object types. Copy and paste the following code to your own console and see how it interacts with the objects within the list:

Run this code:

  list1$vector1[3]

  list1$vector3[v3>=2]

  list1$array1[,2,]

Things to Remember

Filtering and transforming data in R depends upon the object type. Pay close attention to how the data is structured before deciding the best way to interact with it.

Download the free beta version of my book, R Programming in Plain English, to get practice problems and solutions over these concepts.

0 Comments

ABOUT

Welcome to the R Programming in Plain English blog!

This blog seeks to demystify the R programming language for those who need it, such as statisticians, academic researchers, data analysts, and data scientists.

To learn more about this blog, visit here.