library(tidyverse)
N = 10
d <- tibble(x1 = rnorm(N, 10, 2),
x2 = rnorm(N, 10, 2),
y1 = sample(letters, N, replace = TRUE),
y2 = sample(letters, N, replace = TRUE),
z = sample(c("dog", "cat", "fish"), N, replace = TRUE)) |>
mutate(z = fct(z, levels = c("dog", "cat", "fish")))5 Exploring dataframes
Make an example data frame to explore
5.1 glimpse()
glimpse()
- Provides info about nrows, ncols, column names, column types and a “glimpse” of some of the data in each column
- returns the tibble so can be used at the end of a pipe when you first read the data with
read_csv()or similar
d |> glimpse()Rows: 10
Columns: 5
$ x1 <dbl> 9.864656, 10.648627, 9.370226, 8.208932, 7.798212, 10.212464, 11.69…
$ x2 <dbl> 9.760840, 12.671562, 10.331021, 4.643704, 11.381113, 13.237035, 9.0…
$ y1 <chr> "b", "o", "q", "j", "e", "c", "l", "j", "a", "l"
$ y2 <chr> "h", "e", "a", "v", "x", "v", "y", "r", "v", "w"
$ z <fct> fish, cat, dog, fish, fish, fish, cat, cat, cat, fish
5.2 Rows, Columns, and Names
If you don’t glimpse the dataframe, you should at least check the number of rows and columns, and review the column names
d |> nrow()[1] 10
d |> ncol()[1] 5
d |> names()[1] "x1" "x2" "y1" "y2" "z"
5.3 Viewing the dataframe directly
I often use view() interactively with the data but view() does not work when rendering quarto documents. You should use head(), tail(), or slice_sample() if you want the output saved in your quarto report.
head(), tail() or slice_sample()
head()andtail()returns the first or last 6 rows of the tibble. Can be changed usingn = argumentjslice_sample()returns a random row from the tibble. Can be changed to more rows usingn = argument- I find that showing someone some rows with these functions makes the data more real to them. However, works best where there are only a few columns so that they can all be displayed in the console
d |> head()# A tibble: 6 × 5
x1 x2 y1 y2 z
<dbl> <dbl> <chr> <chr> <fct>
1 9.86 9.76 b h fish
2 10.6 12.7 o e cat
3 9.37 10.3 q a dog
4 8.21 4.64 j v fish
5 7.80 11.4 e x fish
6 10.2 13.2 c v fish
d |> tail (n = 3)# A tibble: 3 × 5
x1 x2 y1 y2 z
<dbl> <dbl> <chr> <chr> <fct>
1 8.88 11.0 j r cat
2 9.12 9.21 a v cat
3 12.4 9.38 l w fish
d |> slice_sample(n = 5)# A tibble: 5 × 5
x1 x2 y1 y2 z
<dbl> <dbl> <chr> <chr> <fct>
1 11.7 9.03 l y cat
2 7.80 11.4 e x fish
3 8.88 11.0 j r cat
4 9.12 9.21 a v cat
5 10.6 12.7 o e cat
5.4 skim()
skim()
- Included in the
skimrpackage - Provides a detailed summary of the dataframe
- But the output takes up too much space
- Can use
yank()to select only the data types you want to see and you can limit to only some columns if needed. - Can customize skim to return only the descriptives you want
5.4.1 Make your own skimmer
Lets start with a custom skim that returns only the descriptives I generally want
- Easiest to start with the base skim()
- Then remove statistics you don’t want by setting to NULL
- Then add any statistics you do want (see example below for syntax)
- Do this for each data type
- However, for base (which are reported for all data types), you can’t remove and add, you just need to set what you want
library(skimr)
my_skim <- skim_with(base = sfl(n_complete = ~ sum(!is.na(.), na.rm = TRUE),
n_missing = ~sum(is.na(.), na.rm = TRUE)),
numeric = sfl(p25 = NULL,
p75 = NULL,
hist = NULL),
character = sfl(min = NULL, max = NULL),
factor = sfl(ordered = NULL))5.4.2 Use with all variables
First with all output at once. Does provide summary tables with nrows, ncols, and counts of columns of each datatype. Maybe fine to start (though a bit verbose)
d |> my_skim()| Name | d |
| Number of rows | 10 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| factor | 1 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_complete | n_missing | empty | n_unique | whitespace |
|---|---|---|---|---|---|
| y1 | 10 | 0 | 0 | 8 | 0 |
| y2 | 10 | 0 | 0 | 8 | 0 |
Variable type: factor
| skim_variable | n_complete | n_missing | n_unique | top_counts |
|---|---|---|---|---|
| z | 10 | 0 | 3 | fis: 5, cat: 4, dog: 1 |
Variable type: numeric
| skim_variable | n_complete | n_missing | mean | sd | p0 | p50 | p100 |
|---|---|---|---|---|---|---|---|
| x1 | 10 | 0 | 9.82 | 1.47 | 7.80 | 9.62 | 12.44 |
| x2 | 10 | 0 | 10.06 | 2.39 | 4.64 | 10.05 | 13.24 |
5.4.3 Use with specific data types
I prefer to yank() a class/type at a time but then we don’t see rows and columns and all classes present. Could combine with nrow() and ncol()
d |> my_skim() |>
yank("numeric")Variable type: numeric
| skim_variable | n_complete | n_missing | mean | sd | p0 | p50 | p100 |
|---|---|---|---|---|---|---|---|
| x1 | 10 | 0 | 9.82 | 1.47 | 7.80 | 9.62 | 12.44 |
| x2 | 10 | 0 | 10.06 | 2.39 | 4.64 | 10.05 | 13.24 |
d |> my_skim() |>
yank("character")Variable type: character
| skim_variable | n_complete | n_missing | empty | n_unique | whitespace |
|---|---|---|---|---|---|
| y1 | 10 | 0 | 0 | 8 | 0 |
| y2 | 10 | 0 | 0 | 8 | 0 |
d |> my_skim() |>
yank("factor")Variable type: factor
| skim_variable | n_complete | n_missing | n_unique | top_counts |
|---|---|---|---|---|
| z | 10 | 0 | 3 | fis: 5, cat: 4, dog: 1 |
5.4.4 Limit output to specific columns
We can limit the dataframes returned by skimr to a subset of the variables/columns in the original data
- This can be done across or within a data type
- Columns can be selected using tidy select functions
- Can be combined with
yank()to limit the output to specific data types if your selected columns are all the same type
d |> my_skim(x1, y2) | Name | d |
| Number of rows | 10 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_complete | n_missing | empty | n_unique | whitespace |
|---|---|---|---|---|---|
| y2 | 10 | 0 | 0 | 8 | 0 |
Variable type: numeric
| skim_variable | n_complete | n_missing | mean | sd | p0 | p50 | p100 |
|---|---|---|---|---|---|---|---|
| x1 | 10 | 0 | 9.82 | 1.47 | 7.8 | 9.62 | 12.44 |
d |> my_skim(contains("x")) |>
yank("numeric")Variable type: numeric
| skim_variable | n_complete | n_missing | mean | sd | p0 | p50 | p100 |
|---|---|---|---|---|---|---|---|
| x1 | 10 | 0 | 9.82 | 1.47 | 7.80 | 9.62 | 12.44 |
| x2 | 10 | 0 | 10.06 | 2.39 | 4.64 | 10.05 | 13.24 |
5.5 Limit output to specific descriptive statistics
We can limit the dataframes returned by skimr to a subset of the statistics using focus()
- This is a variant of
dplyr::select()but safer to use with skimmer dataframes - This can be done across or within a data type
- Must pre-pend column name with data type (and a
.) - Columns can be selected using tidy select functions
- Can be combined with
yank()to limit the output to specific data types
d |> my_skim() |>
focus(n_missing, numeric.mean) |>
yank("numeric")Variable type: numeric
| skim_variable | n_missing | mean |
|---|---|---|
| x1 | 0 | 9.82 |
| x2 | 0 | 10.06 |