5 Exploring dataframes

Make an example data frame to explore

library(tidyverse)
N = 10
d <- tibble(x1 = rnorm(N, 10, 2), 
            x2 = rnorm(N, 10, 2),
            y1 = sample(letters, N, replace = TRUE), 
            y2 = sample(letters, N, replace = TRUE), 
            z = sample(c("dog", "cat", "fish"), N, replace = TRUE)) |> 
  mutate(z = fct(z, levels = c("dog", "cat", "fish")))

5.1 glimpse()

glimpse()

Provides info about nrows, ncols, column names, column types and a “glimpse” of some of the data in each column
returns the tibble so can be used at the end of a pipe when you first read the data with read_csv() or similar

d |> glimpse()

Rows: 10
Columns: 5
$ x1 <dbl> 9.864656, 10.648627, 9.370226, 8.208932, 7.798212, 10.212464, 11.69…
$ x2 <dbl> 9.760840, 12.671562, 10.331021, 4.643704, 11.381113, 13.237035, 9.0…
$ y1 <chr> "b", "o", "q", "j", "e", "c", "l", "j", "a", "l"
$ y2 <chr> "h", "e", "a", "v", "x", "v", "y", "r", "v", "w"
$ z  <fct> fish, cat, dog, fish, fish, fish, cat, cat, cat, fish

5.2 Rows, Columns, and Names

If you don’t glimpse the dataframe, you should at least check the number of rows and columns, and review the column names

d |> nrow()

[1] 10

d |> ncol()

[1] 5

d |> names()

[1] "x1" "x2" "y1" "y2" "z"

5.3 Viewing the dataframe directly

I often use view() interactively with the data but view() does not work when rendering quarto documents. You should use head(), tail(), or slice_sample() if you want the output saved in your quarto report.

head(), tail() or slice_sample()

head() and tail() returns the first or last 6 rows of the tibble. Can be changed using n = argumentj
slice_sample() returns a random row from the tibble. Can be changed to more rows using n = argument
I find that showing someone some rows with these functions makes the data more real to them. However, works best where there are only a few columns so that they can all be displayed in the console

d |> head()

# A tibble: 6 × 5
     x1    x2 y1    y2    z    
  <dbl> <dbl> <chr> <chr> <fct>
1  9.86  9.76 b     h     fish 
2 10.6  12.7  o     e     cat  
3  9.37 10.3  q     a     dog  
4  8.21  4.64 j     v     fish 
5  7.80 11.4  e     x     fish 
6 10.2  13.2  c     v     fish

d |> tail (n = 3)

# A tibble: 3 × 5
     x1    x2 y1    y2    z    
  <dbl> <dbl> <chr> <chr> <fct>
1  8.88 11.0  j     r     cat  
2  9.12  9.21 a     v     cat  
3 12.4   9.38 l     w     fish

d |> slice_sample(n = 5)

# A tibble: 5 × 5
     x1    x2 y1    y2    z    
  <dbl> <dbl> <chr> <chr> <fct>
1 11.7   9.03 l     y     cat  
2  7.80 11.4  e     x     fish 
3  8.88 11.0  j     r     cat  
4  9.12  9.21 a     v     cat  
5 10.6  12.7  o     e     cat

5.4 skim()

skim()

Included in the skimr package
Provides a detailed summary of the dataframe
But the output takes up too much space
Can use yank() to select only the data types you want to see and you can limit to only some columns if needed.
Can customize skim to return only the descriptives you want

5.4.1 Make your own skimmer

Lets start with a custom skim that returns only the descriptives I generally want

Easiest to start with the base skim()
Then remove statistics you don’t want by setting to NULL
Then add any statistics you do want (see example below for syntax)
Do this for each data type
However, for base (which are reported for all data types), you can’t remove and add, you just need to set what you want

library(skimr)

my_skim <- skim_with(base = sfl(n_complete = ~ sum(!is.na(.), na.rm = TRUE),
                                n_missing = ~sum(is.na(.), na.rm = TRUE)),
                     numeric = sfl(p25 = NULL,
                                   p75 = NULL,
                                   hist = NULL),
                     character = sfl(min = NULL, max = NULL),
                     factor = sfl(ordered = NULL))

5.4.2 Use with all variables

First with all output at once. Does provide summary tables with nrows, ncols, and counts of columns of each datatype. Maybe fine to start (though a bit verbose)

d |> my_skim()

Data summary
Name	d
Number of rows	10
Number of columns	5
_______________________
Column type frequency:
character	2
factor	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_complete	n_missing	empty	n_unique	whitespace
y1	10	0	0	8	0
y2	10	0	0	8	0

Variable type: factor

skim_variable	n_complete	n_missing	n_unique	top_counts
z	10	0	3	fis: 5, cat: 4, dog: 1

Variable type: numeric

skim_variable	n_complete	n_missing	mean	sd	p0	p50	p100
x1	10	0	9.82	1.47	7.80	9.62	12.44
x2	10	0	10.06	2.39	4.64	10.05	13.24

5.4.3 Use with specific data types

I prefer to yank() a class/type at a time but then we don’t see rows and columns and all classes present. Could combine with nrow() and ncol()

d |> my_skim() |> 
  yank("numeric")

Variable type: numeric

skim_variable	n_complete	n_missing	mean	sd	p0	p50	p100
x1	10	0	9.82	1.47	7.80	9.62	12.44
x2	10	0	10.06	2.39	4.64	10.05	13.24

d |> my_skim() |> 
 yank("character")

Variable type: character

skim_variable	n_complete	n_missing	empty	n_unique	whitespace
y1	10	0	0	8	0
y2	10	0	0	8	0

d |> my_skim() |> 
  yank("factor")

Variable type: factor

skim_variable	n_complete	n_missing	n_unique	top_counts
z	10	0	3	fis: 5, cat: 4, dog: 1

5.4.4 Limit output to specific columns

We can limit the dataframes returned by skimr to a subset of the variables/columns in the original data

This can be done across or within a data type
Columns can be selected using tidy select functions
Can be combined with yank() to limit the output to specific data types if your selected columns are all the same type

d |> my_skim(x1, y2)

Data summary
Name	d
Number of rows	10
Number of columns	5
_______________________
Column type frequency:
character	1
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	n_complete	n_missing	empty	n_unique	whitespace
y2	10	0	0	8	0

Variable type: numeric

skim_variable	n_complete	n_missing	mean	sd	p0	p50	p100
x1	10	0	9.82	1.47	7.8	9.62	12.44

d |> my_skim(contains("x")) |> 
  yank("numeric")

Variable type: numeric

skim_variable	n_complete	n_missing	mean	sd	p0	p50	p100
x1	10	0	9.82	1.47	7.80	9.62	12.44
x2	10	0	10.06	2.39	4.64	10.05	13.24

5.5 Limit output to specific descriptive statistics

We can limit the dataframes returned by skimr to a subset of the statistics using focus()

This is a variant of dplyr::select() but safer to use with skimmer dataframes
This can be done across or within a data type
Must pre-pend column name with data type (and a .)
Columns can be selected using tidy select functions
Can be combined with yank() to limit the output to specific data types

d |> my_skim() |>
  focus(n_missing, numeric.mean) |> 
  yank("numeric")

Variable type: numeric

skim_variable	n_missing	mean
x1	0	9.82
x2	0	10.06