Typical tasks:
Base R can do all of those things, but it can be pretty low-level (focus on coding instead of analyzing data) and awkward, so a number of alternatives have been put forward over time.
Currently, the most popular such alternative solution for data manipulation is the package dplyr. It’s so good at what it does, and integrates so well with other popular tools like ggplot2, that it has rapidly become the de-facto standard and it is what we will focus on today.
Dplyr has a set of functions, or verbs in its terminology, that each deal with one of the above tasks:
select
: view only some variablesfilter
: choose observations by their valuesarrange
: order observations (rows)mutate
: create new variablessummarise
: calculate a summary of many variable valuesEach verb works similarly:
You can use the select
function to focus on a subset of variables.
library(dplyr)
select(mtcars,mpg,wt)
## mpg wt
## Mazda RX4 21.0 2.620
## Mazda RX4 Wag 21.0 2.875
## Datsun 710 22.8 2.320
## Hornet 4 Drive 21.4 3.215
## Hornet Sportabout 18.7 3.440
## Valiant 18.1 3.460
## Duster 360 14.3 3.570
## Merc 240D 24.4 3.190
## Merc 230 22.8 3.150
## Merc 280 19.2 3.440
## Merc 280C 17.8 3.440
## Merc 450SE 16.4 4.070
## Merc 450SL 17.3 3.730
## Merc 450SLC 15.2 3.780
## Cadillac Fleetwood 10.4 5.250
## Lincoln Continental 10.4 5.424
## Chrysler Imperial 14.7 5.345
## Fiat 128 32.4 2.200
## Honda Civic 30.4 1.615
## Toyota Corolla 33.9 1.835
## Toyota Corona 21.5 2.465
## Dodge Challenger 15.5 3.520
## AMC Javelin 15.2 3.435
## Camaro Z28 13.3 3.840
## Pontiac Firebird 19.2 3.845
## Fiat X1-9 27.3 1.935
## Porsche 914-2 26.0 2.140
## Lotus Europa 30.4 1.513
## Ford Pantera L 15.8 3.170
## Ferrari Dino 19.7 2.770
## Maserati Bora 15.0 3.570
## Volvo 142E 21.4 2.780
There are many helpful functions that can be used with select
to describe which variables to keep:
select(mtcars,starts_with("d"))
## disp drat
## Mazda RX4 160.0 3.90
## Mazda RX4 Wag 160.0 3.90
## Datsun 710 108.0 3.85
## Hornet 4 Drive 258.0 3.08
## Hornet Sportabout 360.0 3.15
## Valiant 225.0 2.76
## Duster 360 360.0 3.21
## Merc 240D 146.7 3.69
## Merc 230 140.8 3.92
## Merc 280 167.6 3.92
## Merc 280C 167.6 3.92
## Merc 450SE 275.8 3.07
## Merc 450SL 275.8 3.07
## Merc 450SLC 275.8 3.07
## Cadillac Fleetwood 472.0 2.93
## Lincoln Continental 460.0 3.00
## Chrysler Imperial 440.0 3.23
## Fiat 128 78.7 4.08
## Honda Civic 75.7 4.93
## Toyota Corolla 71.1 4.22
## Toyota Corona 120.1 3.70
## Dodge Challenger 318.0 2.76
## AMC Javelin 304.0 3.15
## Camaro Z28 350.0 3.73
## Pontiac Firebird 400.0 3.08
## Fiat X1-9 79.0 4.08
## Porsche 914-2 120.3 4.43
## Lotus Europa 95.1 3.77
## Ford Pantera L 351.0 4.22
## Ferrari Dino 145.0 3.62
## Maserati Bora 301.0 3.54
## Volvo 142E 121.0 4.11
This trick is handy to reorder the variables so that the ones you’re most interested in are at the front, without dropping any:
select(mtcars, cyl, everything())
## cyl mpg disp hp drat wt qsec vs am gear carb
## Mazda RX4 6 21.0 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 6 21.0 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 4 22.8 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 6 21.4 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 8 18.7 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 6 18.1 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 8 14.3 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 4 24.4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 4 22.8 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 6 19.2 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 6 17.8 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 8 16.4 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 8 17.3 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 8 15.2 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 8 10.4 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 8 10.4 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 8 14.7 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 4 32.4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 4 33.9 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 4 21.5 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 8 15.5 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 8 15.2 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 8 13.3 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 8 19.2 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 4 27.3 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 4 26.0 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 4 30.4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 8 15.8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 6 19.7 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 8 15.0 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 4 21.4 121.0 109 4.11 2.780 18.60 1 1 4 2
Using a named argument will rename a variable:
select(mtcars, mpg, weight=wt)
## mpg weight
## Mazda RX4 21.0 2.620
## Mazda RX4 Wag 21.0 2.875
## Datsun 710 22.8 2.320
## Hornet 4 Drive 21.4 3.215
## Hornet Sportabout 18.7 3.440
## Valiant 18.1 3.460
## Duster 360 14.3 3.570
## Merc 240D 24.4 3.190
## Merc 230 22.8 3.150
## Merc 280 19.2 3.440
## Merc 280C 17.8 3.440
## Merc 450SE 16.4 4.070
## Merc 450SL 17.3 3.730
## Merc 450SLC 15.2 3.780
## Cadillac Fleetwood 10.4 5.250
## Lincoln Continental 10.4 5.424
## Chrysler Imperial 14.7 5.345
## Fiat 128 32.4 2.200
## Honda Civic 30.4 1.615
## Toyota Corolla 33.9 1.835
## Toyota Corona 21.5 2.465
## Dodge Challenger 15.5 3.520
## AMC Javelin 15.2 3.435
## Camaro Z28 13.3 3.840
## Pontiac Firebird 19.2 3.845
## Fiat X1-9 27.3 1.935
## Porsche 914-2 26.0 2.140
## Lotus Europa 30.4 1.513
## Ford Pantera L 15.8 3.170
## Ferrari Dino 19.7 2.770
## Maserati Bora 15.0 3.570
## Volvo 142E 21.4 2.780
(You can also use rename()
to change variable names while keeping all columns as they were.)
You can use filter
to select specific rows based on a logical condition of a variable. To specify more than one condition, just give them as additional arguments. The conditions are joined together as a logical and:
filter(mtcars, cyl==8)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## 2 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## 3 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## 4 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## 5 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## 6 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## 7 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## 8 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## 9 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## 10 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## 11 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## 12 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## 13 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## 14 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
filter(mtcars, cyl==8, carb==3)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
## 2 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
## 3 15.2 8 275.8 180 3.07 3.78 18.0 0 0 3 3
To use the logical or to join conditions, you must use the |
operator explicitly:
filter(mtcars, cyl==4 | carb==8)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 3 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 4 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 5 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 6 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 7 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## 8 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## 9 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## 10 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## 11 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## 12 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
filter(mtcars, cyl==8 & carb==8)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 15 8 301 335 3.54 3.57 14.6 0 1 5 8
mtcars[mtcars$cyl==4 | mtcars$carb==8, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
If you need to select several conditions on the same variable you can use %in%
:
filter(mtcars,carb==3 | carb==6 | carb==8)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
## 2 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
## 3 15.2 8 275.8 180 3.07 3.78 18.0 0 0 3 3
## 4 19.7 6 145.0 175 3.62 2.77 15.5 0 1 5 6
## 5 15.0 8 301.0 335 3.54 3.57 14.6 0 1 5 8
filter(mtcars,carb %in% c(3,6,8))
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
## 2 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
## 3 15.2 8 275.8 180 3.07 3.78 18.0 0 0 3 3
## 4 19.7 6 145.0 175 3.62 2.77 15.5 0 1 5 6
## 5 15.0 8 301.0 335 3.54 3.57 14.6 0 1 5 8
To use numeric indices the dplyr function is slice
.
slice(mtcars,c(1,3,5))
## # A tibble: 3 x 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21.0 6. 160. 110. 3.90 2.62 16.5 0. 1. 4. 4.
## 2 22.8 4. 108. 93. 3.85 2.32 18.6 1. 1. 4. 1.
## 3 18.7 8. 360. 175. 3.15 3.44 17.0 0. 0. 3. 2.
Note: slice
and filter
do not carry the row names with the subset of rows.
If you wish to include the row names you need to add them to the data frame as a variable
filter(add_rownames(mtcars), cyl==8, carb==3)
## Warning: Deprecated, use tibble::rownames_to_column() instead.
## # A tibble: 3 x 12
## rowname mpg cyl disp hp drat wt qsec vs am gear
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Merc 450SE 16.4 8. 276. 180. 3.07 4.07 17.4 0. 0. 3.
## 2 Merc 450SL 17.3 8. 276. 180. 3.07 3.73 17.6 0. 0. 3.
## 3 Merc 450SLC 15.2 8. 276. 180. 3.07 3.78 18.0 0. 0. 3.
## # ... with 1 more variable: carb <dbl>
%>%
when combining several function call together the command can be very hard to read
as.data.frame(select(filter(add_rownames(mtcars),mpg>=30),rowname,mpg,cyl,hp))
## Warning: Deprecated, use tibble::rownames_to_column() instead.
## rowname mpg cyl hp
## 1 Fiat 128 32.4 4 66
## 2 Honda Civic 30.4 4 52
## 3 Toyota Corolla 33.9 4 65
## 4 Lotus Europa 30.4 4 113
You can chain commands together using the %>%
operator.
f(x) %>% g(y) is the same as g(f(x),y)
add_rownames(mtcars) %>%
filter(mpg>=30) %>%
select(rowname,mpg,cyl,hp) %>%
as.data.frame()
## Warning: Deprecated, use tibble::rownames_to_column() instead.
## rowname mpg cyl hp
## 1 Fiat 128 32.4 4 66
## 2 Honda Civic 30.4 4 52
## 3 Toyota Corolla 33.9 4 65
## 4 Lotus Europa 30.4 4 113
The default is to put the left hand side as the first argument of the right hand side. You can use .
as a placeholder to change this behaviour
filter(mtcars, cyl<8) %>%
boxplot(mpg~cyl, data=.)
boxplot(mpg~cyl, data=mtcars, subset= cyl<8)