7 Factors

ADD demo of cut_number(), cut_interval(), and related functions for making factors from numeric data

read

https://r4ds.had.co.nz/factors.html

https://www.kaggle.com/datasets/dillonmyrick/high-school-student-performance-and-demographics

https://www.kaggle.com/datasets/uom190346a/sleep-health-and-lifestyle-dataset

7.1 Classing as factor

x1 <- c(“Dec”, “Apr”, “Jan”, “Mar”) x2 <- c(“Dec”, “Apr”, “Jam”, “Mar”)

month_levels <- c( “Jan”, “Feb”, “Mar”, “Apr”, “May”, “Jun”, “Jul”, “Aug”, “Sep”, “Oct”, “Nov”, “Dec” )

y1 <- fct(x1, levels = month_levels)

y2 <- fct(x2, levels = month_levels)

fct will produce error if value noy in levels when levels supplied

levels(y2)

You can also create a factor when reading your data with readr with col_factor():

csv <- ” month,value Jan,12 Feb,56 Mar,12”

df <- read_csv(csv, col_types = cols(month = col_factor(month_levels))) df$month

7.2 EDA

gss_cat |> count(race)

bar graph

7.3 Modifying order

7.3.1 fct_reorder

relig_summary <- gss_cat |> group_by(relig) |> summarize( tvhours = mean(tvhours, na.rm = TRUE), n = n() )

relig_summary |> mutate( relig = fct_reorder(relig, tvhours) ) |> ggplot(aes(x = tvhours, y = relig)) + geom_point()

### fct_relevel

However, it does make sense to pull “Not applicable” to the front with the other special levels. You can use fct_relevel(). It takes a factor, f, and then any number of levels that you want to move to the front of the line.

ggplot(rincome_summary, aes(x = age, y = fct_relevel(rincome, “Not applicable”))) + geom_point()

7.3.2 fct_infreq & fct_rev

Finally, for bar plots, you can use fct_infreq() to order levels in decreasing frequency: this is the simplest type of reordering because it doesn’t need any extra variables. Combine it with fct_rev() if you want them in increasing frequency so that in the bar plot largest values are on the right, not the left.

gss_cat |> mutate(marital = marital |> fct_infreq() |> fct_rev()) |> ggplot(aes(x = marital)) + geom_bar()

fct_inseq(): by numeric value of level.

7.4 Modifying factor levels

7.4.1 fct_recode

gss_cat |> mutate( partyid = fct_recode(partyid, “Republican, strong” = “Strong republican”, “Republican, weak” = “Not str republican”, “Independent, near rep” = “Ind,near rep”, “Independent, near dem” = “Ind,near dem”, “Democrat, weak” = “Not str democrat”, “Democrat, strong” = “Strong democrat” ) )

fct_recode() will leave the levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist

To combine groups, you can assign multiple old levels to the same new level:

If we want to manipulate a numeric vector, first coerce it to a character, and then recode it. We need to be sure to quote the right half of each of our recoding pairs, since survey’s values are now character (e.g., “1”) rather than numeric (1).

survey <- fct_recode(as.character(survey), “Strongly agree” = “1”, “Agree” = “2”, “Neither agree nor disagree” = “3”, “Disagree” = “4”, “Strongly disagree” = “5”)

7.4.2 fct_collapse

If you want to collapse a lot of levels, fct_collapse() is a useful variant of fct_recode(). For each new variable, you can provide a vector of old levels:

gss_cat |> mutate( partyid = fct_collapse(partyid, “other” = c(“No answer”, “Don’t know”, “Other party”), “rep” = c(“Strong republican”, “Not str republican”), “ind” = c(“Ind,near rep”, “Independent”, “Ind,near dem”), “dem” = c(“Not str democrat”, “Strong democrat”) ) )

7.5 Setting Factor Contrasts

NOTE: THIS CAN BE UPDATED TO SIMPLIFY TO JUST SET THE COEFFFICIENTS DIRECTLY. ITS EASIER

see: https://marissabarlaz.github.io/portfolio/contrastcoding/
Default for unordered factors is treatment/dummy
We typically want centered orthogonal, and unit weighted.
We will use a running example of a factor, $x$, with three levels. We put $x$ in a dataframe to match our typical workflows
We then follow with several examples of setting and applying contrasts to $x$

7.5.1 Sample data

library(tidyverse)
d <- tibble(x = factor(rep(c("A","B", "C"), 4)))
d

# A tibble: 12 × 1
   x    
   <fct>
 1 A    
 2 B    
 3 C    
 4 A    
 5 B    
 6 C    
 7 A    
 8 B    
 9 C    
10 A    
11 B    
12 C

7.5.2 Default contrasts

The default contrasts for factors in R as set and viewed using options()

They are contr.treatment (i.e., dummy codes) for unordered factors
They are contr.poly for ordered factors
We do not tend to change these defaults but instead explicitly set other contrasts when we need them

options("contrasts")

$contrasts
        unordered           ordered 
"contr.treatment"      "contr.poly"

We can confirm the contrasts set by default for $x$ as follows

They are dummy codes
They are named for the level being contrasted with refererence

contrasts(d$x)

  B C
A 0 0
B 1 0
C 0 1

7.5.3 Base R approach for other contrasts

We set contrasts using the contrasts() function as well. We could use base R functions that define contrast matrices for classic contrasts as well

But the coefficients aren’t united weighted (my preference)
And the contrasts aren’t given descriptive labels

contrasts(d$x) <- contr.helmert(levels(d$x))
contrasts(d$x)

  [,1] [,2]
A   -1   -1
B    1   -1
C    0    2

7.5.4 An improved approach

We tweak this approach to get unit weights (only for Helmert) and meaningful contrast labels

First we get the contrast matrix (same as above)

c3 <- contr.helmert(levels(d$x))
c3

  [,1] [,2]
A   -1   -1
B    1   -1
C    0    2

Then we adjust to make coefficients unit-weighted. This simply requires dividing each column by its range. You could put this in a loop if you had more columns. We generally only do this for helmert or other orthogonal contrasts.

c3[, 1] <- c3[, 1] / (max(c3[, 1]) - min(c3[, 1]))
c3[, 2] <- c3[, 2] / (max(c3[, 2]) - min(c3[, 2]))

Now assign names to the columns to label the contrasts

colnames(c3) <- c("BvA", "CvBA")
c3

   BvA       CvBA
A -0.5 -0.3333333
B  0.5 -0.3333333
C  0.0  0.6666667

And now assign these contrasts to $x$

contrasts(d$x) <- c3
contrasts(d$x)

   BvA       CvBA
A -0.5 -0.3333333
B  0.5 -0.3333333
C  0.0  0.6666667

7.5.5 Available contrasts

The contrasts we use most regularly are

Dummy contrasts (contr.treatment()). These are set by default to unordered factors
Helmert contrasts (contrl.helmert()).
Effects constrasts (contr.sum()). We do not generally unit weight these