library(tidyverse)
= 10
N <- tibble(x1 = rnorm(N, 10, 2),
d x2 = rnorm(N, 10, 2),
y1 = sample(letters, N, replace = TRUE),
y2 = sample(letters, N, replace = TRUE),
z = sample(c("dog", "cat", "fish"), N, replace = TRUE)) |>
mutate(z = fct(z, levels = c("dog", "cat", "fish")))
5 Exploring dataframes
Make an example data frame to explore
5.1 glimpse()
glimpse()
- Provides info about nrows, ncols, column names, column types and a “glimpse” of some of the data in each column
- returns the tibble so can be used at the end of a pipe when you first read the data with
read_csv()
or similar
|> glimpse() d
Rows: 10
Columns: 5
$ x1 <dbl> 9.864656, 10.648627, 9.370226, 8.208932, 7.798212, 10.212464, 11.69…
$ x2 <dbl> 9.760840, 12.671562, 10.331021, 4.643704, 11.381113, 13.237035, 9.0…
$ y1 <chr> "b", "o", "q", "j", "e", "c", "l", "j", "a", "l"
$ y2 <chr> "h", "e", "a", "v", "x", "v", "y", "r", "v", "w"
$ z <fct> fish, cat, dog, fish, fish, fish, cat, cat, cat, fish
5.2 Rows, Columns, and Names
If you don’t glimpse the dataframe, you should at least check the number of rows and columns, and review the column names
|> nrow() d
[1] 10
|> ncol() d
[1] 5
|> names() d
[1] "x1" "x2" "y1" "y2" "z"
5.3 Viewing the dataframe directly
I often use view()
interactively with the data but view()
does not work when rendering quarto documents. You should use head()
, tail()
, or slice_sample()
if you want the output saved in your quarto report.
head()
, tail()
or slice_sample()
head()
andtail()
returns the first or last 6 rows of the tibble. Can be changed usingn = argument
jslice_sample()
returns a random row from the tibble. Can be changed to more rows usingn = argument
- I find that showing someone some rows with these functions makes the data more real to them. However, works best where there are only a few columns so that they can all be displayed in the console
|> head() d
# A tibble: 6 × 5
x1 x2 y1 y2 z
<dbl> <dbl> <chr> <chr> <fct>
1 9.86 9.76 b h fish
2 10.6 12.7 o e cat
3 9.37 10.3 q a dog
4 8.21 4.64 j v fish
5 7.80 11.4 e x fish
6 10.2 13.2 c v fish
|> tail (n = 3) d
# A tibble: 3 × 5
x1 x2 y1 y2 z
<dbl> <dbl> <chr> <chr> <fct>
1 8.88 11.0 j r cat
2 9.12 9.21 a v cat
3 12.4 9.38 l w fish
|> slice_sample(n = 5) d
# A tibble: 5 × 5
x1 x2 y1 y2 z
<dbl> <dbl> <chr> <chr> <fct>
1 11.7 9.03 l y cat
2 7.80 11.4 e x fish
3 8.88 11.0 j r cat
4 9.12 9.21 a v cat
5 10.6 12.7 o e cat
5.4 skim()
skim()
- Included in the
skimr
package - Provides a detailed summary of the dataframe
- But the output takes up too much space
- Can use
yank()
to select only the data types you want to see and you can limit to only some columns if needed. - Can customize skim to return only the descriptives you want
5.4.1 Make your own skimmer
Lets start with a custom skim that returns only the descriptives I generally want
- Easiest to start with the base skim()
- Then remove statistics you don’t want by setting to NULL
- Then add any statistics you do want (see example below for syntax)
- Do this for each data type
- However, for base (which are reported for all data types), you can’t remove and add, you just need to set what you want
library(skimr)
<- skim_with(base = sfl(n_complete = ~ sum(!is.na(.), na.rm = TRUE),
my_skim n_missing = ~sum(is.na(.), na.rm = TRUE)),
numeric = sfl(p25 = NULL,
p75 = NULL,
hist = NULL),
character = sfl(min = NULL, max = NULL),
factor = sfl(ordered = NULL))
5.4.2 Use with all variables
First with all output at once. Does provide summary tables with nrows, ncols, and counts of columns of each datatype. Maybe fine to start (though a bit verbose)
|> my_skim() d
Name | d |
Number of rows | 10 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
character | 2 |
factor | 1 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_complete | n_missing | empty | n_unique | whitespace |
---|---|---|---|---|---|
y1 | 10 | 0 | 0 | 8 | 0 |
y2 | 10 | 0 | 0 | 8 | 0 |
Variable type: factor
skim_variable | n_complete | n_missing | n_unique | top_counts |
---|---|---|---|---|
z | 10 | 0 | 3 | fis: 5, cat: 4, dog: 1 |
Variable type: numeric
skim_variable | n_complete | n_missing | mean | sd | p0 | p50 | p100 |
---|---|---|---|---|---|---|---|
x1 | 10 | 0 | 9.82 | 1.47 | 7.80 | 9.62 | 12.44 |
x2 | 10 | 0 | 10.06 | 2.39 | 4.64 | 10.05 | 13.24 |
5.4.3 Use with specific data types
I prefer to yank()
a class/type at a time but then we don’t see rows and columns and all classes present. Could combine with nrow()
and ncol()
|> my_skim() |>
d yank("numeric")
Variable type: numeric
skim_variable | n_complete | n_missing | mean | sd | p0 | p50 | p100 |
---|---|---|---|---|---|---|---|
x1 | 10 | 0 | 9.82 | 1.47 | 7.80 | 9.62 | 12.44 |
x2 | 10 | 0 | 10.06 | 2.39 | 4.64 | 10.05 | 13.24 |
|> my_skim() |>
d yank("character")
Variable type: character
skim_variable | n_complete | n_missing | empty | n_unique | whitespace |
---|---|---|---|---|---|
y1 | 10 | 0 | 0 | 8 | 0 |
y2 | 10 | 0 | 0 | 8 | 0 |
|> my_skim() |>
d yank("factor")
Variable type: factor
skim_variable | n_complete | n_missing | n_unique | top_counts |
---|---|---|---|---|
z | 10 | 0 | 3 | fis: 5, cat: 4, dog: 1 |
5.4.4 Limit output to specific columns
We can limit the dataframes returned by skimr to a subset of the variables/columns in the original data
- This can be done across or within a data type
- Columns can be selected using tidy select functions
- Can be combined with
yank()
to limit the output to specific data types if your selected columns are all the same type
|> my_skim(x1, y2) d
Name | d |
Number of rows | 10 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_complete | n_missing | empty | n_unique | whitespace |
---|---|---|---|---|---|
y2 | 10 | 0 | 0 | 8 | 0 |
Variable type: numeric
skim_variable | n_complete | n_missing | mean | sd | p0 | p50 | p100 |
---|---|---|---|---|---|---|---|
x1 | 10 | 0 | 9.82 | 1.47 | 7.8 | 9.62 | 12.44 |
|> my_skim(contains("x")) |>
d yank("numeric")
Variable type: numeric
skim_variable | n_complete | n_missing | mean | sd | p0 | p50 | p100 |
---|---|---|---|---|---|---|---|
x1 | 10 | 0 | 9.82 | 1.47 | 7.80 | 9.62 | 12.44 |
x2 | 10 | 0 | 10.06 | 2.39 | 4.64 | 10.05 | 13.24 |
5.5 Limit output to specific descriptive statistics
We can limit the dataframes returned by skimr to a subset of the statistics using focus()
- This is a variant of
dplyr::select()
but safer to use with skimmer dataframes - This can be done across or within a data type
- Must pre-pend column name with data type (and a
.
) - Columns can be selected using tidy select functions
- Can be combined with
yank()
to limit the output to specific data types
|> my_skim() |>
d focus(n_missing, numeric.mean) |>
yank("numeric")
Variable type: numeric
skim_variable | n_missing | mean |
---|---|---|
x1 | 0 | 9.82 |
x2 | 0 | 10.06 |