5  Exploring dataframes

Make an example data frame to explore

library(tidyverse)
N = 10
d <- tibble(x1 = rnorm(N, 10, 2), 
            x2 = rnorm(N, 10, 2),
            y1 = sample(letters, N, replace = TRUE), 
            y2 = sample(letters, N, replace = TRUE), 
            z = sample(c("dog", "cat", "fish"), N, replace = TRUE)) |> 
  mutate(z = fct(z, levels = c("dog", "cat", "fish")))

5.1 glimpse()

glimpse()

  • Provides info about nrows, ncols, column names, column types and a “glimpse” of some of the data in each column
  • returns the tibble so can be used at the end of a pipe when you first read the data with read_csv() or similar
d |> glimpse()
Rows: 10
Columns: 5
$ x1 <dbl> 9.864656, 10.648627, 9.370226, 8.208932, 7.798212, 10.212464, 11.69…
$ x2 <dbl> 9.760840, 12.671562, 10.331021, 4.643704, 11.381113, 13.237035, 9.0…
$ y1 <chr> "b", "o", "q", "j", "e", "c", "l", "j", "a", "l"
$ y2 <chr> "h", "e", "a", "v", "x", "v", "y", "r", "v", "w"
$ z  <fct> fish, cat, dog, fish, fish, fish, cat, cat, cat, fish

5.2 Rows, Columns, and Names

If you don’t glimpse the dataframe, you should at least check the number of rows and columns, and review the column names

d |> nrow()
[1] 10
d |> ncol()
[1] 5
d |> names()
[1] "x1" "x2" "y1" "y2" "z" 

5.3 Viewing the dataframe directly

I often use view() interactively with the data but view() does not work when rendering quarto documents. You should use head(), tail(), or slice_sample() if you want the output saved in your quarto report.

head(), tail() or slice_sample()

  • head() and tail() returns the first or last 6 rows of the tibble. Can be changed using n = argumentj
  • slice_sample() returns a random row from the tibble. Can be changed to more rows using n = argument
  • I find that showing someone some rows with these functions makes the data more real to them. However, works best where there are only a few columns so that they can all be displayed in the console
d |> head()
# A tibble: 6 × 5
     x1    x2 y1    y2    z    
  <dbl> <dbl> <chr> <chr> <fct>
1  9.86  9.76 b     h     fish 
2 10.6  12.7  o     e     cat  
3  9.37 10.3  q     a     dog  
4  8.21  4.64 j     v     fish 
5  7.80 11.4  e     x     fish 
6 10.2  13.2  c     v     fish 
d |> tail (n = 3)
# A tibble: 3 × 5
     x1    x2 y1    y2    z    
  <dbl> <dbl> <chr> <chr> <fct>
1  8.88 11.0  j     r     cat  
2  9.12  9.21 a     v     cat  
3 12.4   9.38 l     w     fish 
d |> slice_sample(n = 5)
# A tibble: 5 × 5
     x1    x2 y1    y2    z    
  <dbl> <dbl> <chr> <chr> <fct>
1 11.7   9.03 l     y     cat  
2  7.80 11.4  e     x     fish 
3  8.88 11.0  j     r     cat  
4  9.12  9.21 a     v     cat  
5 10.6  12.7  o     e     cat  

5.4 skim()

skim()

  • Included in the skimr package
  • Provides a detailed summary of the dataframe
  • But the output takes up too much space
  • Can use yank() to select only the data types you want to see and you can limit to only some columns if needed.
  • Can customize skim to return only the descriptives you want

5.4.1 Make your own skimmer

Lets start with a custom skim that returns only the descriptives I generally want

  • Easiest to start with the base skim()
  • Then remove statistics you don’t want by setting to NULL
  • Then add any statistics you do want (see example below for syntax)
  • Do this for each data type
  • However, for base (which are reported for all data types), you can’t remove and add, you just need to set what you want
library(skimr)

my_skim <- skim_with(base = sfl(n_complete = ~ sum(!is.na(.), na.rm = TRUE),
                                n_missing = ~sum(is.na(.), na.rm = TRUE)),
                     numeric = sfl(p25 = NULL,
                                   p75 = NULL,
                                   hist = NULL),
                     character = sfl(min = NULL, max = NULL),
                     factor = sfl(ordered = NULL))

5.4.2 Use with all variables

First with all output at once. Does provide summary tables with nrows, ncols, and counts of columns of each datatype. Maybe fine to start (though a bit verbose)

d |> my_skim()
Data summary
Name d
Number of rows 10
Number of columns 5
_______________________
Column type frequency:
character 2
factor 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_complete n_missing empty n_unique whitespace
y1 10 0 0 8 0
y2 10 0 0 8 0

Variable type: factor

skim_variable n_complete n_missing n_unique top_counts
z 10 0 3 fis: 5, cat: 4, dog: 1

Variable type: numeric

skim_variable n_complete n_missing mean sd p0 p50 p100
x1 10 0 9.82 1.47 7.80 9.62 12.44
x2 10 0 10.06 2.39 4.64 10.05 13.24

5.4.3 Use with specific data types

I prefer to yank() a class/type at a time but then we don’t see rows and columns and all classes present. Could combine with nrow() and ncol()

d |> my_skim() |> 
  yank("numeric")

Variable type: numeric

skim_variable n_complete n_missing mean sd p0 p50 p100
x1 10 0 9.82 1.47 7.80 9.62 12.44
x2 10 0 10.06 2.39 4.64 10.05 13.24
d |> my_skim() |> 
 yank("character")

Variable type: character

skim_variable n_complete n_missing empty n_unique whitespace
y1 10 0 0 8 0
y2 10 0 0 8 0
d |> my_skim() |> 
  yank("factor")

Variable type: factor

skim_variable n_complete n_missing n_unique top_counts
z 10 0 3 fis: 5, cat: 4, dog: 1

5.4.4 Limit output to specific columns

We can limit the dataframes returned by skimr to a subset of the variables/columns in the original data

  • This can be done across or within a data type
  • Columns can be selected using tidy select functions
  • Can be combined with yank() to limit the output to specific data types if your selected columns are all the same type
d |> my_skim(x1, y2) 
Data summary
Name d
Number of rows 10
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_complete n_missing empty n_unique whitespace
y2 10 0 0 8 0

Variable type: numeric

skim_variable n_complete n_missing mean sd p0 p50 p100
x1 10 0 9.82 1.47 7.8 9.62 12.44
d |> my_skim(contains("x")) |> 
  yank("numeric")

Variable type: numeric

skim_variable n_complete n_missing mean sd p0 p50 p100
x1 10 0 9.82 1.47 7.80 9.62 12.44
x2 10 0 10.06 2.39 4.64 10.05 13.24

5.5 Limit output to specific descriptive statistics

We can limit the dataframes returned by skimr to a subset of the statistics using focus()

  • This is a variant of dplyr::select() but safer to use with skimmer dataframes
  • This can be done across or within a data type
  • Must pre-pend column name with data type (and a .)
  • Columns can be selected using tidy select functions
  • Can be combined with yank() to limit the output to specific data types
d |> my_skim() |>
  focus(n_missing, numeric.mean) |> 
  yank("numeric")

Variable type: numeric

skim_variable n_missing mean
x1 0 9.82
x2 0 10.06