6  Factors

ADD demo of cut_number(), cut_interval(), and related functions for making factors from numeric data

read

https://r4ds.had.co.nz/factors.html

https://www.kaggle.com/datasets/dillonmyrick/high-school-student-performance-and-demographics

https://www.kaggle.com/datasets/dillonmyrick/high-school-student-performance-and-demographics

https://www.kaggle.com/datasets/uom190346a/sleep-health-and-lifestyle-dataset

6.1 Classing as factor

x1 <- c(“Dec”, “Apr”, “Jan”, “Mar”) x2 <- c(“Dec”, “Apr”, “Jam”, “Mar”)

month_levels <- c( “Jan”, “Feb”, “Mar”, “Apr”, “May”, “Jun”, “Jul”, “Aug”, “Sep”, “Oct”, “Nov”, “Dec” )

y1 <- fct(x1, levels = month_levels)

y2 <- fct(x2, levels = month_levels)

fct will produce error if value noy in levels when levels supplied

levels(y2)

You can also create a factor when reading your data with readr with col_factor():

csv <- ” month,value Jan,12 Feb,56 Mar,12”

df <- read_csv(csv, col_types = cols(month = col_factor(month_levels))) df$month

6.2 EDA

gss_cat |> count(race)

bar graph

6.3 Modifying order

6.3.1 fct_reorder

relig_summary <- gss_cat |> group_by(relig) |> summarize( tvhours = mean(tvhours, na.rm = TRUE), n = n() )

relig_summary |> mutate( relig = fct_reorder(relig, tvhours) ) |> ggplot(aes(x = tvhours, y = relig)) + geom_point()

### fct_relevel

However, it does make sense to pull “Not applicable” to the front with the other special levels. You can use fct_relevel(). It takes a factor, f, and then any number of levels that you want to move to the front of the line.

ggplot(rincome_summary, aes(x = age, y = fct_relevel(rincome, “Not applicable”))) + geom_point()

6.3.2 fct_infreq & fct_rev

Finally, for bar plots, you can use fct_infreq() to order levels in decreasing frequency: this is the simplest type of reordering because it doesn’t need any extra variables. Combine it with fct_rev() if you want them in increasing frequency so that in the bar plot largest values are on the right, not the left.

gss_cat |> mutate(marital = marital |> fct_infreq() |> fct_rev()) |> ggplot(aes(x = marital)) + geom_bar()

fct_inseq(): by numeric value of level.

6.4 Modifying factor levels

6.4.1 fct_recode

gss_cat |> mutate( partyid = fct_recode(partyid, “Republican, strong” = “Strong republican”, “Republican, weak” = “Not str republican”, “Independent, near rep” = “Ind,near rep”, “Independent, near dem” = “Ind,near dem”, “Democrat, weak” = “Not str democrat”, “Democrat, strong” = “Strong democrat” ) )

fct_recode() will leave the levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist

To combine groups, you can assign multiple old levels to the same new level:

gss_cat |> mutate( partyid = fct_recode(partyid, “Republican, strong” = “Strong republican”, “Republican, weak” = “Not str republican”, “Independent, near rep” = “Ind,near rep”, “Independent, near dem” = “Ind,near dem”, “Democrat, weak” = “Not str democrat”, “Democrat, strong” = “Strong democrat”, “Other” = “No answer”, “Other” = “Don’t know”, “Other” = “Other party” ) )

If we want to manipulate a numeric vector, first coerce it to a character, and then recode it. We need to be sure to quote the right half of each of our recoding pairs, since survey’s values are now character (e.g., “1”) rather than numeric (1).

survey <- fct_recode(as.character(survey), “Strongly agree” = “1”, “Agree” = “2”, “Neither agree nor disagree” = “3”, “Disagree” = “4”, “Strongly disagree” = “5”)

6.4.2 fct_collapse

If you want to collapse a lot of levels, fct_collapse() is a useful variant of fct_recode(). For each new variable, you can provide a vector of old levels:

gss_cat |> mutate( partyid = fct_collapse(partyid, “other” = c(“No answer”, “Don’t know”, “Other party”), “rep” = c(“Strong republican”, “Not str republican”), “ind” = c(“Ind,near rep”, “Independent”, “Ind,near dem”), “dem” = c(“Not str democrat”, “Strong democrat”) ) )

6.5 Setting Factor Contrasts

NOTE: THIS CAN BE UPDATED TO SIMPLIFY TO JUST SET THE COEFFFICIENTS DIRECTLY. ITS EASIER

  • see: https://marissabarlaz.github.io/portfolio/contrastcoding/
  • Default for unordered factors is treatment/dummy
  • We typically want centered orthogonal, and unit weighted.
  • We will use a running example of a factor, \(x\), with three levels. We put \(x\) in a dataframe to match our typical workflows
  • We then follow with several examples of setting and applying contrasts to \(x\)

6.5.1 Sample data

library(tidyverse)
d <- tibble(x = factor(rep(c("A","B", "C"), 4)))
d
# A tibble: 12 × 1
   x    
   <fct>
 1 A    
 2 B    
 3 C    
 4 A    
 5 B    
 6 C    
 7 A    
 8 B    
 9 C    
10 A    
11 B    
12 C    

6.5.2 Default contrasts

The default contrasts for factors in R as set and viewed using options()

  • They are contr.treatment (i.e., dummy codes) for unordered factors
  • They are contr.poly for ordered factors
  • We do not tend to change these defaults but instead explicitly set other contrasts when we need them
options("contrasts")
$contrasts
        unordered           ordered 
"contr.treatment"      "contr.poly" 

We can confirm the contrasts set by default for \(x\) as follows

  • They are dummy codes
  • They are named for the level being contrasted with refererence
contrasts(d$x)
  B C
A 0 0
B 1 0
C 0 1

6.5.3 Base R approach for other contrasts

We set contrasts using the contrasts() function as well. We could use base R functions that define contrast matrices for classic contrasts as well

  • But the coefficients aren’t united weighted (my preference)
  • And the contrasts aren’t given descriptive labels
contrasts(d$x) <- contr.helmert(levels(d$x))
contrasts(d$x)
  [,1] [,2]
A   -1   -1
B    1   -1
C    0    2

6.5.4 An improved approach

We tweak this approach to get unit weights (only for Helmert) and meaningful contrast labels

  • First we get the contrast matrix (same as above)
c3 <- contr.helmert(levels(d$x))
c3
  [,1] [,2]
A   -1   -1
B    1   -1
C    0    2
  • Then we adjust to make coefficients unit-weighted. This simply requires dividing each column by its range. You could put this in a loop if you had more columns. We generally only do this for helmert or other orthogonal contrasts.
c3[, 1] <- c3[, 1] / (max(c3[, 1]) - min(c3[, 1]))
c3[, 2] <- c3[, 2] / (max(c3[, 2]) - min(c3[, 2]))
  • Now assign names to the columns to label the contrasts
colnames(c3) <- c("BvA", "CvBA")
c3
   BvA       CvBA
A -0.5 -0.3333333
B  0.5 -0.3333333
C  0.0  0.6666667
  • And now assign these contrasts to \(x\)
contrasts(d$x) <- c3
contrasts(d$x)
   BvA       CvBA
A -0.5 -0.3333333
B  0.5 -0.3333333
C  0.0  0.6666667

6.5.5 Available contrasts

The contrasts we use most regularly are

  • Dummy contrasts (contr.treatment()). These are set by default to unordered factors
  • Helmert contrasts (contrl.helmert()).
  • Effects constrasts (contr.sum()). We do not generally unit weight these