R/dplyr

dplyr is my tool of choice for all things related to data wrangling and getting the information gems out of the raw material. Below I documented some of the most useful and reoccurring task flows to support my forgetful brain.

Create short example data set (with NA)

iris_short % 
        group_by(Species) %>% 
        summarise_all(mean) %>% 
        modify_at(c("Sepal.Length"), ~ifelse(.x<5.1, NA, .))
iris_short
## # A tibble: 3 x 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
##                                      
## 1 setosa            NA           3.43         1.46       0.246
## 2 versicolor         5.94        2.77         4.26       1.33 
## 3 virginica          6.59        2.97         5.55       2.03

Compute new variable

iris_short %>% 
        mutate(Petal.Sum = Petal.Length + Petal.Width) 
## # A tibble: 3 x 6
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Sum
##                                           
## 1 setosa            NA           3.43         1.46       0.246      1.71
## 2 versicolor         5.94        2.77         4.26       1.33       5.59
## 3 virginica          6.59        2.97         5.55       2.03       7.58

Compute and only show new variables

iris_short %>% 
        transmute(Sepal.Sum = Sepal.Length + Sepal.Width, 
                  Petal.Sum = Petal.Length + Petal.Width)
## # A tibble: 3 x 2
##   Sepal.Sum Petal.Sum
##            
## 1     NA         1.71
## 2      8.71      5.59
## 3      9.56      7.58

Replace/recode values in several variables at once

iris_short %>% 
        mutate_at(c("Petal.Length", "Petal.Width"), ~ case_when(.<=1 ~ "Small",
                                                                .4  ~ "Big",
                                                                TRUE ~ NA_character_)) 
## # A tibble: 3 x 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
##                                      
## 1 setosa            NA           3.43 Medium       Small      
## 2 versicolor         5.94        2.77 Big          Medium     
## 3 virginica          6.59        2.97 Big          Medium

Compute and overwrite result into same columns

iris_short %>% 
        mutate_at(c("Petal.Length", "Petal.Width"), ~.x*100)
## # A tibble: 3 x 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
##                                      
## 1 setosa            NA           3.43         146.        24.6
## 2 versicolor         5.94        2.77         426        133. 
## 3 virginica          6.59        2.97         555.       203.

Take subset of variables, transform it and add to dataframe

iris_short %>% 
        select(4:5) %>% mutate_all(~.*100) %>% bind_cols(iris_short,.)
## # A tibble: 3 x 7
##   Species  Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Length1
##                                             
## 1 setosa          NA           3.43         1.46       0.246          146.
## 2 versico~         5.94        2.77         4.26       1.33           426 
## 3 virgini~         6.59        2.97         5.55       2.03           555.
## # ... with 1 more variable: Petal.Width1 

Compute mean of variables by row

rowwise() is same as group_by(rowid), were rowid represents row numbers (e.g. created by rowid_to_column())

iris_short %>% 
        rowwise() %>% mutate(Petal.Mean = mean(c(Sepal.Length,Sepal.Width))) 
## Source: local data frame [3 x 6]
## Groups: 
## 
## # A tibble: 3 x 6
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Mean
##                                            
## 1 setosa            NA           3.43         1.46       0.246      NA   
## 2 versicolor         5.94        2.77         4.26       1.33        4.35
## 3 virginica          6.59        2.97         5.55       2.03        4.78

Replace NAs in several variables at once

iris_short %>% 
        modify_at(c("Sepal.Length", "Sepal.Width"), ~ ifelse(is.na(.),9999,.))
## # A tibble: 3 x 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
##                                      
## 1 setosa          9999           3.43         1.46       0.246
## 2 versicolor         5.94        2.77         4.26       1.33 
## 3 virginica          6.59        2.97         5.55       2.03

Change type of multiple variables in one go

iris_short %>% 
        modify_at(c("Sepal.Length", "Sepal.Width"),  as.character)
## # A tibble: 3 x 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
##                                      
## 1 setosa              3.428               1.46       0.246
## 2 versicolor 5.936        2.77                4.26       1.33 
## 3 virginica  6.588        2.974               5.55       2.03

Replacing values based on lookup table

2 step process: first left_join(), then merge two variables using if_else()

lookup % 
                  select(-replacement)
lookup
## # A tibble: 2 x 2
##   value     replacement         
##                       
## 1 setosa    Borsten-Schwertlilie
## 2 virginica Virginia Iris
iris_short_n 
## # A tibble: 3 x 5
##   Species              Sepal.Length Sepal.Width Petal.Length Petal.Width
##                                                
## 1 Borsten-Schwertlilie        NA           3.43         1.46       0.246
## 2 versicolor                   5.94        2.77         4.26       1.33 
## 3 Virginia Iris                6.59        2.97         5.55       2.03

Notes for working with factors in R

Factors can be a little confusing in R and some default behaviour is questionable, two especially dangerous pitfalls are:

  • silently introducing NAs! when values and levels do not match up, e.g. in case of typos
  • numbers as factor levels and internal numeric representation might not match; e.g. as.numeric() applied to a factor can provide surpising results

General comments

  • a factor has defined levels, and an internal integer representation
  • labels are not part of the data structure of factors
  • in factor(inputvector, levels, labels): labels is only used as argument to relevel the factor at creation
  • if data values are not present in levels, values are silently replaced by NAs
  • use parse_factor(inputvector, levels) to avoid silently introducing NAs
  • factors are internally ordered according to order in levels argument (if provided)
  • select factor values similar to strings: which(factor==“level1”)

Factors with numeric level

  • make sure numeric level and internal representation are the same (e.g. for Likert-scales), to be able to calculate means correctly using as.numeric()
  • use numeric and factor representation in parallel
  • retrieve correct numerics:
    • f <- factor(c(3.4, 1.2, 5))
    • as.numeric(levels(f))[f]

Order of factor levels

  • change order of factor (doesn’t change data, only order in which levels are shown in plots etc.)

    • parse factor again, with levels in modified order: x <- parse_factor(x, levels=c(“level1”, “level2”, …))
    • fct_reorder(f, x, fun = median, …, .desc = FALSE): The levels of f are reordered so that the values of fun(x) are in ascending order; interesting for plots.
  • Change the levels

    • fct_recode: fct_recode(factorvar, new1 = “old1”, new2 = “old2”); levels not provided return old level
    • not recommended (easy to introduce errors): levels(var) <- c(new levels)