Outline

Data types in R

Data frame

Data manipulation

Typical tasks:

Base R can do all of those things, but it can be pretty low-level (focus on coding instead of analyzing data) and awkward, so a number of alternatives have been put forward over time.

Currently, the most popular such alternative solution for data manipulation is the package dplyr. It’s so good at what it does, and integrates so well with other popular tools like ggplot2, that it has rapidly become the de-facto standard and it is what we will focus on today.

Dplyr has a set of functions, or verbs in its terminology, that each deal with one of the above tasks:

Each verb works similarly:

Select

You can use the select function to focus on a subset of variables.

library(dplyr)
select(mtcars,mpg,wt)
##                      mpg    wt
## Mazda RX4           21.0 2.620
## Mazda RX4 Wag       21.0 2.875
## Datsun 710          22.8 2.320
## Hornet 4 Drive      21.4 3.215
## Hornet Sportabout   18.7 3.440
## Valiant             18.1 3.460
## Duster 360          14.3 3.570
## Merc 240D           24.4 3.190
## Merc 230            22.8 3.150
## Merc 280            19.2 3.440
## Merc 280C           17.8 3.440
## Merc 450SE          16.4 4.070
## Merc 450SL          17.3 3.730
## Merc 450SLC         15.2 3.780
## Cadillac Fleetwood  10.4 5.250
## Lincoln Continental 10.4 5.424
## Chrysler Imperial   14.7 5.345
## Fiat 128            32.4 2.200
## Honda Civic         30.4 1.615
## Toyota Corolla      33.9 1.835
## Toyota Corona       21.5 2.465
## Dodge Challenger    15.5 3.520
## AMC Javelin         15.2 3.435
## Camaro Z28          13.3 3.840
## Pontiac Firebird    19.2 3.845
## Fiat X1-9           27.3 1.935
## Porsche 914-2       26.0 2.140
## Lotus Europa        30.4 1.513
## Ford Pantera L      15.8 3.170
## Ferrari Dino        19.7 2.770
## Maserati Bora       15.0 3.570
## Volvo 142E          21.4 2.780

There are many helpful functions that can be used with select to describe which variables to keep:

  • starts_with(x, ignore.case = TRUE): names starts with x
  • ends_with(x, ignore.case = TRUE): names ends in x
  • contains(x, ignore.case = TRUE): selects all variables whose name contains x
  • matches(x, ignore.case = TRUE): selects all variables whose name matches the regular expression x
  • num_range(“x”, 1:5, width = 2): selects all variables (numerically) from x01 to x05.
  • one_of(“x”, “y”, “z”): selects variables provided in a character vector.
  • everything(): selects all variables.
select(mtcars,starts_with("d"))
##                      disp drat
## Mazda RX4           160.0 3.90
## Mazda RX4 Wag       160.0 3.90
## Datsun 710          108.0 3.85
## Hornet 4 Drive      258.0 3.08
## Hornet Sportabout   360.0 3.15
## Valiant             225.0 2.76
## Duster 360          360.0 3.21
## Merc 240D           146.7 3.69
## Merc 230            140.8 3.92
## Merc 280            167.6 3.92
## Merc 280C           167.6 3.92
## Merc 450SE          275.8 3.07
## Merc 450SL          275.8 3.07
## Merc 450SLC         275.8 3.07
## Cadillac Fleetwood  472.0 2.93
## Lincoln Continental 460.0 3.00
## Chrysler Imperial   440.0 3.23
## Fiat 128             78.7 4.08
## Honda Civic          75.7 4.93
## Toyota Corolla       71.1 4.22
## Toyota Corona       120.1 3.70
## Dodge Challenger    318.0 2.76
## AMC Javelin         304.0 3.15
## Camaro Z28          350.0 3.73
## Pontiac Firebird    400.0 3.08
## Fiat X1-9            79.0 4.08
## Porsche 914-2       120.3 4.43
## Lotus Europa         95.1 3.77
## Ford Pantera L      351.0 4.22
## Ferrari Dino        145.0 3.62
## Maserati Bora       301.0 3.54
## Volvo 142E          121.0 4.11

This trick is handy to reorder the variables so that the ones you’re most interested in are at the front, without dropping any:

select(mtcars, cyl, everything())
##                     cyl  mpg  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4             6 21.0 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag         6 21.0 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710            4 22.8 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive        6 21.4 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout     8 18.7 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant               6 18.1 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360            8 14.3 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D             4 24.4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230              4 22.8 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280              6 19.2 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C             6 17.8 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE            8 16.4 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL            8 17.3 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC           8 15.2 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood    8 10.4 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental   8 10.4 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial     8 14.7 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128              4 32.4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic           4 30.4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla        4 33.9  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona         4 21.5 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger      8 15.5 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin           8 15.2 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28            8 13.3 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird      8 19.2 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9             4 27.3  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2         4 26.0 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa          4 30.4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L        8 15.8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino          6 19.7 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora         8 15.0 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E            4 21.4 121.0 109 4.11 2.780 18.60  1  1    4    2

Using a named argument will rename a variable:

select(mtcars, mpg, weight=wt)
##                      mpg weight
## Mazda RX4           21.0  2.620
## Mazda RX4 Wag       21.0  2.875
## Datsun 710          22.8  2.320
## Hornet 4 Drive      21.4  3.215
## Hornet Sportabout   18.7  3.440
## Valiant             18.1  3.460
## Duster 360          14.3  3.570
## Merc 240D           24.4  3.190
## Merc 230            22.8  3.150
## Merc 280            19.2  3.440
## Merc 280C           17.8  3.440
## Merc 450SE          16.4  4.070
## Merc 450SL          17.3  3.730
## Merc 450SLC         15.2  3.780
## Cadillac Fleetwood  10.4  5.250
## Lincoln Continental 10.4  5.424
## Chrysler Imperial   14.7  5.345
## Fiat 128            32.4  2.200
## Honda Civic         30.4  1.615
## Toyota Corolla      33.9  1.835
## Toyota Corona       21.5  2.465
## Dodge Challenger    15.5  3.520
## AMC Javelin         15.2  3.435
## Camaro Z28          13.3  3.840
## Pontiac Firebird    19.2  3.845
## Fiat X1-9           27.3  1.935
## Porsche 914-2       26.0  2.140
## Lotus Europa        30.4  1.513
## Ford Pantera L      15.8  3.170
## Ferrari Dino        19.7  2.770
## Maserati Bora       15.0  3.570
## Volvo 142E          21.4  2.780

(You can also use rename() to change variable names while keeping all columns as they were.)

Filter

You can use filter to select specific rows based on a logical condition of a variable. To specify more than one condition, just give them as additional arguments. The conditions are joined together as a logical and:

filter(mtcars, cyl==8)
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 2  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 3  16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## 4  17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 5  15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## 6  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## 7  10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## 8  14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## 9  15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## 10 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## 11 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## 12 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## 13 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## 14 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
filter(mtcars, cyl==8, carb==3)
##    mpg cyl  disp  hp drat   wt qsec vs am gear carb
## 1 16.4   8 275.8 180 3.07 4.07 17.4  0  0    3    3
## 2 17.3   8 275.8 180 3.07 3.73 17.6  0  0    3    3
## 3 15.2   8 275.8 180 3.07 3.78 18.0  0  0    3    3

To use the logical or to join conditions, you must use the | operator explicitly:

filter(mtcars, cyl==4 | carb==8)
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 2  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 3  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 4  32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 5  30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 6  33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 7  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 8  27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 9  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 10 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 11 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## 12 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
filter(mtcars, cyl==8 & carb==8)
##   mpg cyl disp  hp drat   wt qsec vs am gear carb
## 1  15   8  301 335 3.54 3.57 14.6  0  1    5    8
mtcars[mtcars$cyl==4 | mtcars$carb==8, ]
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

If you need to select several conditions on the same variable you can use %in%:

filter(mtcars,carb==3 | carb==6 | carb==8)
##    mpg cyl  disp  hp drat   wt qsec vs am gear carb
## 1 16.4   8 275.8 180 3.07 4.07 17.4  0  0    3    3
## 2 17.3   8 275.8 180 3.07 3.73 17.6  0  0    3    3
## 3 15.2   8 275.8 180 3.07 3.78 18.0  0  0    3    3
## 4 19.7   6 145.0 175 3.62 2.77 15.5  0  1    5    6
## 5 15.0   8 301.0 335 3.54 3.57 14.6  0  1    5    8
filter(mtcars,carb %in% c(3,6,8))
##    mpg cyl  disp  hp drat   wt qsec vs am gear carb
## 1 16.4   8 275.8 180 3.07 4.07 17.4  0  0    3    3
## 2 17.3   8 275.8 180 3.07 3.73 17.6  0  0    3    3
## 3 15.2   8 275.8 180 3.07 3.78 18.0  0  0    3    3
## 4 19.7   6 145.0 175 3.62 2.77 15.5  0  1    5    6
## 5 15.0   8 301.0 335 3.54 3.57 14.6  0  1    5    8

To use numeric indices the dplyr function is slice.

slice(mtcars,c(1,3,5))
## # A tibble: 3 x 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  21.0    6.  160.  110.  3.90  2.62  16.5    0.    1.    4.    4.
## 2  22.8    4.  108.   93.  3.85  2.32  18.6    1.    1.    4.    1.
## 3  18.7    8.  360.  175.  3.15  3.44  17.0    0.    0.    3.    2.

Note: slice and filter do not carry the row names with the subset of rows.

If you wish to include the row names you need to add them to the data frame as a variable

filter(add_rownames(mtcars), cyl==8, carb==3)
## Warning: Deprecated, use tibble::rownames_to_column() instead.
## # A tibble: 3 x 12
##   rowname       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear
##   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Merc 450SE   16.4    8.  276.  180.  3.07  4.07  17.4    0.    0.    3.
## 2 Merc 450SL   17.3    8.  276.  180.  3.07  3.73  17.6    0.    0.    3.
## 3 Merc 450SLC  15.2    8.  276.  180.  3.07  3.78  18.0    0.    0.    3.
## # ... with 1 more variable: carb <dbl>

Chaining with %>%

when combining several function call together the command can be very hard to read

as.data.frame(select(filter(add_rownames(mtcars),mpg>=30),rowname,mpg,cyl,hp))
## Warning: Deprecated, use tibble::rownames_to_column() instead.
##          rowname  mpg cyl  hp
## 1       Fiat 128 32.4   4  66
## 2    Honda Civic 30.4   4  52
## 3 Toyota Corolla 33.9   4  65
## 4   Lotus Europa 30.4   4 113

You can chain commands together using the %>% operator.

f(x) %>% g(y) is the same as g(f(x),y)
add_rownames(mtcars) %>%
  filter(mpg>=30) %>%
  select(rowname,mpg,cyl,hp) %>%
  as.data.frame()
## Warning: Deprecated, use tibble::rownames_to_column() instead.
##          rowname  mpg cyl  hp
## 1       Fiat 128 32.4   4  66
## 2    Honda Civic 30.4   4  52
## 3 Toyota Corolla 33.9   4  65
## 4   Lotus Europa 30.4   4 113

The default is to put the left hand side as the first argument of the right hand side. You can use . as a placeholder to change this behaviour

filter(mtcars, cyl<8) %>%
  boxplot(mpg~cyl, data=.)
boxplot(mpg~cyl, data=mtcars, subset= cyl<8)