dplyr is my tool of choice for all things related to data wrangling and getting the information gems out of the raw material. Below I documented some of the most useful and reoccurring task flows to support my forgetful brain.
Create short example data set (with NA)
iris_short %
group_by(Species) %>%
summarise_all(mean) %>%
modify_at(c("Sepal.Length"), ~ifelse(.x<5.1, NA, .))
iris_short
## # A tibble: 3 x 5 ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## ## 1 setosa NA 3.43 1.46 0.246 ## 2 versicolor 5.94 2.77 4.26 1.33 ## 3 virginica 6.59 2.97 5.55 2.03
Compute new variable
iris_short %>%
mutate(Petal.Sum = Petal.Length + Petal.Width)
## # A tibble: 3 x 6 ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Sum ## ## 1 setosa NA 3.43 1.46 0.246 1.71 ## 2 versicolor 5.94 2.77 4.26 1.33 5.59 ## 3 virginica 6.59 2.97 5.55 2.03 7.58
Compute and only show new variables
iris_short %>%
transmute(Sepal.Sum = Sepal.Length + Sepal.Width,
Petal.Sum = Petal.Length + Petal.Width)
## # A tibble: 3 x 2 ## Sepal.Sum Petal.Sum ## ## 1 NA 1.71 ## 2 8.71 5.59 ## 3 9.56 7.58
Replace/recode values in several variables at once
iris_short %>%
mutate_at(c("Petal.Length", "Petal.Width"), ~ case_when(.<=1 ~ "Small",
.4 ~ "Big",
TRUE ~ NA_character_))
## # A tibble: 3 x 5 ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## ## 1 setosa NA 3.43 Medium Small ## 2 versicolor 5.94 2.77 Big Medium ## 3 virginica 6.59 2.97 Big Medium
Compute and overwrite result into same columns
iris_short %>%
mutate_at(c("Petal.Length", "Petal.Width"), ~.x*100)
## # A tibble: 3 x 5 ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## ## 1 setosa NA 3.43 146. 24.6 ## 2 versicolor 5.94 2.77 426 133. ## 3 virginica 6.59 2.97 555. 203.
Take subset of variables, transform it and add to dataframe
iris_short %>%
select(4:5) %>% mutate_all(~.*100) %>% bind_cols(iris_short,.)
## # A tibble: 3 x 7 ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Length1 ## ## 1 setosa NA 3.43 1.46 0.246 146. ## 2 versico~ 5.94 2.77 4.26 1.33 426 ## 3 virgini~ 6.59 2.97 5.55 2.03 555. ## # ... with 1 more variable: Petal.Width1
Compute mean of variables by row
rowwise() is same as group_by(rowid), were rowid represents row numbers (e.g. created by rowid_to_column())
iris_short %>%
rowwise() %>% mutate(Petal.Mean = mean(c(Sepal.Length,Sepal.Width)))
## Source: local data frame [3 x 6] ## Groups: ## ## # A tibble: 3 x 6 ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Mean ## ## 1 setosa NA 3.43 1.46 0.246 NA ## 2 versicolor 5.94 2.77 4.26 1.33 4.35 ## 3 virginica 6.59 2.97 5.55 2.03 4.78
Replace NAs in several variables at once
iris_short %>%
modify_at(c("Sepal.Length", "Sepal.Width"), ~ ifelse(is.na(.),9999,.))
## # A tibble: 3 x 5 ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## ## 1 setosa 9999 3.43 1.46 0.246 ## 2 versicolor 5.94 2.77 4.26 1.33 ## 3 virginica 6.59 2.97 5.55 2.03
Change type of multiple variables in one go
iris_short %>%
modify_at(c("Sepal.Length", "Sepal.Width"), as.character)
## # A tibble: 3 x 5 ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## ## 1 setosa 3.428 1.46 0.246 ## 2 versicolor 5.936 2.77 4.26 1.33 ## 3 virginica 6.588 2.974 5.55 2.03
Replacing values based on lookup table
2 step process: first left_join(), then merge two variables using if_else()
lookup %
select(-replacement)
lookup
## # A tibble: 2 x 2 ## value replacement ## ## 1 setosa Borsten-Schwertlilie ## 2 virginica Virginia Iris
iris_short_n
## # A tibble: 3 x 5 ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## ## 1 Borsten-Schwertlilie NA 3.43 1.46 0.246 ## 2 versicolor 5.94 2.77 4.26 1.33 ## 3 Virginia Iris 6.59 2.97 5.55 2.03
Notes for working with factors in R
Factors can be a little confusing in R and some default behaviour is questionable, two especially dangerous pitfalls are:
- silently introducing NAs! when values and levels do not match up, e.g. in case of typos
- numbers as factor levels and internal numeric representation might not match; e.g. as.numeric() applied to a factor can provide surpising results
General comments
- a factor has defined levels, and an internal integer representation
- labels are not part of the data structure of factors
- in factor(inputvector, levels, labels): labels is only used as argument to relevel the factor at creation
- if data values are not present in levels, values are silently replaced by NAs
- use parse_factor(inputvector, levels) to avoid silently introducing NAs
- factors are internally ordered according to order in levels argument (if provided)
- select factor values similar to strings: which(factor==“level1”)
Factors with numeric level
- make sure numeric level and internal representation are the same (e.g. for Likert-scales), to be able to calculate means correctly using as.numeric()
- use numeric and factor representation in parallel
- retrieve correct numerics:
- f <- factor(c(3.4, 1.2, 5))
- as.numeric(levels(f))[f]
Order of factor levels
-
change order of factor (doesn’t change data, only order in which levels are shown in plots etc.)
- parse factor again, with levels in modified order: x <- parse_factor(x, levels=c(“level1”, “level2”, …))
- fct_reorder(f, x, fun = median, …, .desc = FALSE): The levels of f are reordered so that the values of fun(x) are in ascending order; interesting for plots.
-
Change the levels
- fct_recode: fct_recode(factorvar, new1 = “old1”, new2 = “old2”); levels not provided return old level
- not recommended (easy to introduce errors): levels(var) <- c(new levels)