12 Natural Language Processing: Text Processing and Feature Engineering

12.1 Overview of Unit

12.1.1 Learning Objectives

Objective 1
Objective 2

12.1.2 Readings

Hvitfeldt and Silge (2022) Chapter 2: Tokenization
Hvitfeldt and Silge (2022) Chapter 5: Word Embeddings

NOTES: Please read the above chapters more with an eye toward concepts and issues rather than code. I will demonstrate a minimum set of functions to accomplish the NLP modeling tasks for this unit.

Also know that the entire Hvitfeldt and Silge (2022, book) is really mandatory reading. I would also strongly recommend this entire Silge and Robinson (2017) book. Both will be important references at a minimum.

Post questions to the readings channel in Slack

12.1.3 Lecture Videos

Lecture 1: General Text (Pre-) Processing - the stringr package ~ 9 mins
Lecture 2: General Text (Pre-) Processing - regular expressions ~ 13 mins
Lecture 3: The IMDB Reviews Dataset ~ 6 mins
Lecture 4: Tokenization- Part 1 ~ 27 mins
Lecture 5: Tokenization- Part 2 ~ 13 mins
Lecture 6: Stopwords ~ 12 mins
Lecture 7: Stemming ~12 mins
Lecture 8: Bag of Words ~19 mins
Lecture 9: NLP in Action - Part 1 ~ 17 mins
Lecture 10: NLP in Action - Part 2 ~ 20 mins

Post questions to the video-lectures channel in Slack

12.1.4 Application Assignment

Submit the application assignment here and complete the unit quiz by 8 pm on Wednesday, April 17th

12.2 General Text (Pre-) Processing

In order to work with text, we need to be able to manipulate text. We have two sets of tools to master:

The stringr package
Regular expressions (regex)

12.2.1 The stringr package

There are many functions in the stringr package that are very useful for searching and manipulating text.

stringr is included in tidyverse
I recommend keeping the stringr cheatsheet open whenever you are working with text until you learn these functions well.

All functions in stringr start with str_ and take a vector of strings as the first argument.

Here is a simple vector of strings to use as an example

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

x <- c("why", "video", "cross", "extra", "deal", "authority")

Length of each string

str_length(x)

[1] 3 5 5 5 4 9

Collapse all strings in vector into one long string that is comma separated

str_c(x, collapse = ", ")

[1] "why, video, cross, extra, deal, authority"

Get substring based on position (start and end position)

str_sub(x, 1, 2)

[1] "wh" "vi" "cr" "ex" "de" "au"

Most stringr functions work with regular expressions, a concise language for describing patterns of text.

For example, the regular expression “[aeiou]” matches any single character that is a vowel.

Here we use str_subset() to return the strings that contain vowels (doesnt include “why”)

str_subset(x, "[aeiou]")

[1] "video"     "cross"     "extra"     "deal"      "authority"

Here we count the vowels in each string

str_count(x, "[aeiou]")

[1] 0 3 1 2 2 4

There are eight main verbs that work with patterns:

1 str_detect(x, pattern) tells you if there is any match to the pattern in each string

# x <- c("why", "video", "cross", "extra", "deal", "authority")
str_detect(x, "[aeiou]")

[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

2 str_count(x, pattern) counts the number of patterns

# x <- c("why", "video", "cross", "extra", "deal", "authority")
str_count(x, "[aeiou]")

[1] 0 3 1 2 2 4

3 str_subset(x, pattern) extracts the matching components

# x <- c("why", "video", "cross", "extra", "deal", "authority")
str_subset(x, "[aeiou]")

[1] "video"     "cross"     "extra"     "deal"      "authority"

4 str_locate(x, pattern) gives the position of the match

# x <- c("why", "video", "cross", "extra", "deal", "authority")
str_locate(x, "[aeiou]")

     start end
[1,]    NA  NA
[2,]     2   2
[3,]     3   3
[4,]     1   1
[5,]     2   2
[6,]     1   1

5 str_extract(x, pattern) extracts the text of the match

# x <- c("why", "video", "cross", "extra", "deal", "authority")
str_extract(x, "[aeiou]")

[1] NA  "i" "o" "e" "e" "a"

6 str_match(x, pattern) extracts parts of the match defined by parentheses. In this case, the characters on either side of the vowel

# x <- c("why", "video", "cross", "extra", "deal", "authority")
str_match(x, "(.)[aeiou](.)")

     [,1]  [,2] [,3]
[1,] NA    NA   NA  
[2,] "vid" "v"  "d" 
[3,] "ros" "r"  "s" 
[4,] NA    NA   NA  
[5,] "dea" "d"  "a" 
[6,] "aut" "a"  "t"

7 str_replace(x, pattern, replacement) replaces the matches with new text

# x <- c("why", "video", "cross", "extra", "deal", "authority")
str_replace(x, "[aeiou]", "?")

[1] "why"       "v?deo"     "cr?ss"     "?xtra"     "d?al"      "?uthority"

8 str_split(x, pattern) splits up a string into multiple pieces.

str_split(c("a,b", "c,d,e"), ",")

[[1]]
[1] "a" "b"

[[2]]
[1] "c" "d" "e"

12.2.2 Regular Expressions

Regular expressions are a way to specify or search for patterns of strings using a sequence of characters. By combining a selection of simple patterns, we can capture quite complicated strings.

The stringr package uses regular expressions extensively

The regular expressions are passed as the pattern = argument. Regular expressions can be used to detect, locate, or extract parts of a string.

Julia Silge has put together a wonderful tutorial/primer on the use of regular expressions. After reading it, I finally had a solid grasp on them. Rather than grab sections, I will direct you to it (and review it live in our filmed lectures). She does it much better than I could!

You might consider installing the RegExplain package using devtools if you want more support working with regular expressions. They are powerful but they are complicated to learn initially

There is also a very helpful cheatsheet for regular expressions

And finally, there is a great Wickham, Çetinkaya-Rundel, and Grolemund (2023) chapter on strings more generally, which covers both stringr and regex.

12.3 The IMDB Dataset

Now that we have a basic understanding of how to manipulation raw text, we can get set up for NLP and introduce a guiding example for this unit

We can start with our normal cast of characters RE packages, source, and settings (not displayed here)

However, we will also install a few new ones that are specific to working with text.

library(tidytext)
library(textrecipes)  #step_- functions for NLP
library(SnowballC)  
library(stopwords)

The IMDB Reviews dataset is a classic NLP dataset that is used for sentiment analysis

It contains:

25,000 movie reviews in train and test
Balanced on positive and negative sentiment (labeled outcome)
For more info, see the website

Let’s start by loading the dataset and adding an identifier for each review (i.e., document, doc_num)

data_trn <- read_csv(here::here(path_data, "imdb_trn.csv"), 
                     show_col_types = FALSE) |>
  rowid_to_column(var = "doc_num") |> 
  mutate(sentiment = fct(sentiment, levels = c("neg", "pos"))) 

data_trn  |>
  skim_some()

Data summary
Name	data_trn
Number of rows	25000
Number of columns	3
_______________________
Column type frequency:
character	1
factor	1
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
text	0	1	52	13704	0	24904	0

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
sentiment	0	1	FALSE	2	neg: 12500, pos: 12500

Variable type: numeric

skim_variable	n_missing	complete_rate	p0	p100
doc_num	0	1	1	25000

Let’s look at our outcome

data_trn |> tab(sentiment)

# A tibble: 2 × 3
  sentiment     n  prop
  <fct>     <int> <dbl>
1 neg       12500   0.5
2 pos       12500   0.5

To get a better sense of the dataset, We can view first five negative reviews from the training set

data_trn |> 
  filter(sentiment == "neg") |> 
  slice(1:5) |> 
  pull(text) |> 
  print_kbl()

x
Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly.
Airport '77 starts as a brand new luxury 747 plane is loaded up with valuable paintings & such belonging to rich businessman Philip Stevens (James Stewart) who is flying them & a bunch of VIP's to his estate in preparation of it being opened to the public as a museum, also on board is Stevens daughter Julie (Kathleen Quinlan) & her son. The luxury jetliner takes off as planned but mid-air the plane is hi-jacked by the co-pilot Chambers (Robert Foxworth) & his two accomplice's Banker (Monte Markham) & Wilson (Michael Pataki) who knock the passengers & crew out with sleeping gas, they plan to steal the valuable cargo & land on a disused plane strip on an isolated island but while making his descent Chambers almost hits an oil rig in the Ocean & loses control of the plane sending it crashing into the sea where it sinks to the bottom right bang in the middle of the Bermuda Triangle. With air in short supply, water leaking in & having flown over 200 miles off course the problems mount for the survivor's as they await help with time fast running out...<br /><br />Also known under the slightly different tile Airport 1977 this second sequel to the smash-hit disaster thriller Airport (1970) was directed by Jerry Jameson & while once again like it's predecessors I can't say Airport '77 is any sort of forgotten classic it is entertaining although not necessarily for the right reasons. Out of the three Airport films I have seen so far I actually liked this one the best, just. It has my favourite plot of the three with a nice mid-air hi-jacking & then the crashing (didn't he see the oil rig?) & sinking of the 747 (maybe the makers were trying to cross the original Airport with another popular disaster flick of the period The Poseidon Adventure (1972)) & submerged is where it stays until the end with a stark dilemma facing those trapped inside, either suffocate when the air runs out or drown as the 747 floods or if any of the doors are opened & it's a decent idea that could have made for a great little disaster flick but bad unsympathetic character's, dull dialogue, lethargic set-pieces & a real lack of danger or suspense or tension means this is a missed opportunity. While the rather sluggish plot keeps one entertained for 108 odd minutes not that much happens after the plane sinks & there's not as much urgency as I thought there should have been. Even when the Navy become involved things don't pick up that much with a few shots of huge ships & helicopters flying about but there's just something lacking here. George Kennedy as the jinxed airline worker Joe Patroni is back but only gets a couple of scenes & barely even says anything preferring to just look worried in the background.<br /><br />The home video & theatrical version of Airport '77 run 108 minutes while the US TV versions add an extra hour of footage including a new opening credits sequence, many more scenes with George Kennedy as Patroni, flashbacks to flesh out character's, longer rescue scenes & the discovery or another couple of dead bodies including the navigator. While I would like to see this extra footage I am not sure I could sit through a near three hour cut of Airport '77. As expected the film has dated badly with horrible fashions & interior design choices, I will say no more other than the toy plane model effects aren't great either. Along with the other two Airport sequels this takes pride of place in the Razzie Award's Hall of Shame although I can think of lots of worse films than this so I reckon that's a little harsh. The action scenes are a little dull unfortunately, the pace is slow & not much excitement or tension is generated which is a shame as I reckon this could have been a pretty good film if made properly.<br /><br />The production values are alright if nothing spectacular. The acting isn't great, two time Oscar winner Jack Lemmon has said since it was a mistake to star in this, one time Oscar winner James Stewart looks old & frail, also one time Oscar winner Lee Grant looks drunk while Sir Christopher Lee is given little to do & there are plenty of other familiar faces to look out for too.<br /><br />Airport '77 is the most disaster orientated of the three Airport films so far & I liked the ideas behind it even if they were a bit silly, the production & bland direction doesn't help though & a film about a sunken plane just shouldn't be this boring or lethargic. Followed by The Concorde ... Airport '79 (1979).
This film lacked something I couldn't put my finger on at first: charisma on the part of the leading actress. This inevitably translated to lack of chemistry when she shared the screen with her leading man. Even the romantic scenes came across as being merely the actors at play. It could very well have been the director who miscalculated what he needed from the actors. I just don't know.<br /><br />But could it have been the screenplay? Just exactly who was the chef in love with? He seemed more enamored of his culinary skills and restaurant, and ultimately of himself and his youthful exploits, than of anybody or anything else. He never convinced me he was in love with the princess.<br /><br />I was disappointed in this movie. But, don't forget it was nominated for an Oscar, so judge for yourself.
Sorry everyone,,, I know this is supposed to be an "art" film,, but wow, they should have handed out guns at the screening so people could blow their brains out and not watch. Although the scene design and photographic direction was excellent, this story is too painful to watch. The absence of a sound track was brutal. The loooonnnnng shots were too long. How long can you watch two people just sitting there and talking? Especially when the dialogue is two people complaining. I really had a hard time just getting through this film. The performances were excellent, but how much of that dark, sombre, uninspired, stuff can you take? The only thing i liked was Maureen Stapleton and her red dress and dancing scene. Otherwise this was a ripoff of Bergman. And i'm no fan f his either. I think anyone who says they enjoyed 1 1/2 hours of this is,, well, lying.
When I was little my parents took me along to the theater to see Interiors. It was one of many movies I watched with my parents, but this was the only one we walked out of. Since then I had never seen Interiors until just recently, and I could have lived out the rest of my life without it. What a pretentious, ponderous, and painfully boring piece of 70's wine and cheese tripe. Woody Allen is one of my favorite directors but Interiors is by far the worst piece of crap of his career. In the unmistakable style of Ingmar Berman, Allen gives us a dark, angular, muted, insight in to the lives of a family wrought by the psychological damage caused by divorce, estrangement, career, love, non-love, halitosis, whatever. The film, intentionally, has no comic relief, no music, and is drenched in shadowy pathos. This film style can be best defined as expressionist in nature, using an improvisational method of dialogue to illicit a "more pronounced depth of meaning and truth". But Woody Allen is no Ingmar Bergman. The film is painfully slow and dull. But beyond that, I simply had no connection with or sympathy for any of the characters. Instead I felt only contempt for this parade of shuffling, whining, nicotine stained, martyrs in a perpetual quest for identity. Amid a backdrop of cosmopolitan affluence and baked Brie intelligentsia the story looms like a fart in the room. Everyone speaks in affected platitudes and elevated language between cigarettes. Everyone is "lost" and "struggling", desperate to find direction or understanding or whatever and it just goes on and on to the point where you just want to slap all of them. It's never about resolution, it's only about interminable introspective babble. It is nothing more than a psychological drama taken to an extreme beyond the audience's ability to connect. Woody Allen chose to make characters so immersed in themselves we feel left out. And for that reason I found this movie painfully self indulgent and spiritually draining. I see what he was going for but his insistence on promoting his message through Prozac prose and distorted film techniques jettisons it past the point of relevance. I highly recommend this one if you're feeling a little too happy and need something to remind you of death. Otherwise, let's just pretend this film never happened.

and the first five positive reviews from the training set

data_trn |> 
  filter(sentiment == "pos") |> 
  slice(1:5) |> 
  pull(text) |> 
  print_kbl()

x
Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!
Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they'll be next to end up on the streets.<br /><br />But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it's like to be homeless? That is Goddard Bolt's lesson.<br /><br />Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days without the luxuries; if Bolt succeeds, he can do what he wants with a future project of making more buildings. The bet's on where Bolt is thrown on the street with a bracelet on his leg to monitor his every move where he can't step off the sidewalk. He's given the nickname Pepto by a vagrant after it's written on his forehead where Bolt meets other characters including a woman by the name of Molly (Lesley Ann Warren) an ex-dancer who got divorce before losing her home, and her pals Sailor (Howard Morris) and Fumes (Teddy Wilson) who are already used to the streets. They're survivors. Bolt isn't. He's not used to reaching mutual agreements like he once did when being rich where it's fight or flight, kill or be killed.<br /><br />While the love connection between Molly and Bolt wasn't necessary to plot, I found "Life Stinks" to be one of Mel Brooks' observant films where prior to being a comedy, it shows a tender side compared to his slapstick work such as Blazing Saddles, Young Frankenstein, or Spaceballs for the matter, to show what it's like having something valuable before losing it the next day or on the other hand making a stupid bet like all rich people do when they don't know what to do with their money. Maybe they should give it to the homeless instead of using it like Monopoly money.<br /><br />Or maybe this film will inspire you to help others.
Brilliant over-acting by Lesley Ann Warren. Best dramatic hobo lady I have ever seen, and love scenes in clothes warehouse are second to none. The corn on face is a classic, as good as anything in Blazing Saddles. The take on lawyers is also superb. After being accused of being a turncoat, selling out his boss, and being dishonest the lawyer of Pepto Bolt shrugs indifferently "I'm a lawyer" he says. Three funny words. Jeffrey Tambor, a favorite from the later Larry Sanders show, is fantastic here too as a mad millionaire who wants to crush the ghetto. His character is more malevolent than usual. The hospital scene, and the scene where the homeless invade a demolition site, are all-time classics. Look for the legs scene and the two big diggers fighting (one bleeds). This movie gets better each time I see it (which is quite often).
This is easily the most underrated film inn the Brooks cannon. Sure, its flawed. It does not give a realistic view of homelessness (unlike, say, how Citizen Kane gave a realistic view of lounge singers, or Titanic gave a realistic view of Italians YOU IDIOTS). Many of the jokes fall flat. But still, this film is very lovable in a way many comedies are not, and to pull that off in a story about some of the most traditionally reviled members of society is truly impressive. Its not The Fisher King, but its not crap, either. My only complaint is that Brooks should have cast someone else in the lead (I love Mel as a Director and Writer, not so much as a lead).
This is not the typical Mel Brooks film. It was much less slapstick than most of his movies and actually had a plot that was followable. Leslie Ann Warren made the movie, she is such a fantastic, under-rated actress. There were some moments that could have been fleshed out a bit more, and some scenes that could probably have been cut to make the room to do so, but all in all, this is worth the price to rent and see it. The acting was good overall, Brooks himself did a good job without his characteristic speaking to directly to the audience. Again, Warren was the best actor in the movie, but "Fume" and "Sailor" both played their parts well.

You need to spend a LOT of time reviewing the text before you begin to process it.

I have NOT done this yet!

My models will be sub-optimal!

12.4 Tokens

Machine learning algorithms cannot work with raw text (documents) directly

We must feature engineer these documents to allow them to serve as input to statistical algorithms

The first step for most NLP feature engineering methods is to represent text (documents) as tokens (words, ngrams)

Given that tokenization is often one of our first steps for extracting features from text, it is important to consider carefully what happens during this step and its implications for your subsequent modeling

In tokenization, we take input documents (text strings) and a token type (a meaningful unit of text, such as a word) and split the document into pieces (tokens) that correspond to the type

We can tokenize text into a variety of token types:

characters
words (most common - our focus; unigrams)
sentences
lines
paragraphs
n-grams (bigrams, trigrams)

An n-gram consists of a sequence of n items from a given sequence of text. Most often, it is a group of n words (bigrams, trigrams)

n-grams retain word order which would otherwise be lost if we were just using words as the token type

“I am not happy”

Tokenized by word, yields:

I
am
not
happy

Tokenized by 2-gram words:

I am
am not
not happy

We will be using tokenizer functions from the tokenizers package. Three in particular are:

tokenize_words(x, lowercase = TRUE, stopwords = NULL, strip_punct = TRUE, strip_numeric = FALSE, simplify = FALSE)
tokenize_ngrams(x, lowercase = TRUE, n = 3L, n_min = n, stopwords = character(), ngram_delim = " ", simplify = FALSE)
tokenize_regex(x, pattern = "\\s+", simplify = FALSE)

However, we will be accessing these functions through wrappers:

tidytext::unnest_tokens(tbl, output, input, token = "words", format = c("text", "man", "latex", "html", "xml"), to_lower = TRUE, drop = TRUE, collapse = NULL) for tidyverse data exploration of tokens within tibbles
textrecipes::step_tokenize() for tokenization in our recipes

Word-level tokenization by tokenize_words() is done by finding word boundaries as follows:

Break at the start and end of text, unless the text is empty
Do not break within CRLF (new line characters)
Otherwise, break before and after new lines (including CR and LF)
Do not break between most letters
Do not break letters across certain punctuation
Do not break within sequences of digits, or digits adjacent to letters (“3a,” or “A3”)
Do not break within sequences, such as “3.2” or “3,456.789.”
Do not break between Katakana
Do not break from extenders
Do not break within emoji zwj sequences
Do not break within emoji flag sequences
Ignore Format and Extend characters, except after sot, CR, LF, and new line
Keep horizontal whitespace together
Otherwise, break everywhere (including around ideographs, e.g., @, %, >)

Let’s start with using tokenize_words() to get a sense of how it works by default

Splits on spaces
Converts to lowercase by default (does it matter? MAYBE!!!)
Retains apostrophes but drops other punctuation (,, ., !) and some symbols (e.g., - \\, @) by default. Does not drop _ (Do you need punctuation? !!!!)
Retains numbers (by default) and decimals but drops + appended to 4+
Has trouble with URLs and email address (do you need this?)
Often, these issues may NOT matter

"Here is a sample document to tokenize.  How EXCITING (I _love_ it).  Sarah has spent 4 or 4.1 or 4P or 4+ or >4 years developing her pre-processing and NLP skills.  You can learn more about tokenization here: https://smltar.com/tokenization.html or by emailing me at jjcurtin@wisc.edu" |> 
  tokenizers::tokenize_words()

[[1]]
 [1] "here"              "is"                "a"                
 [4] "sample"            "document"          "to"               
 [7] "tokenize"          "how"               "exciting"         
[10] "i"                 "_love_"            "it"               
[13] "sarah"             "has"               "spent"            
[16] "4"                 "or"                "4.1"              
[19] "or"                "4p"                "or"               
[22] "4"                 "or"                "4"                
[25] "years"             "developing"        "her"              
[28] "pre"               "processing"        "and"              
[31] "nlp"               "skills"            "you"              
[34] "can"               "learn"             "more"             
[37] "about"             "tokenization"      "here"             
[40] "https"             "smltar.com"        "tokenization.html"
[43] "or"                "by"                "emailing"         
[46] "me"                "at"                "jjcurtin"         
[49] "wisc.edu"

Some of these behaviors can be altered from their defaults

lowercase = TRUE
strip_punc = TRUE
strip_numeric = FALSE

Some of these issues can be corrected by pre-processing the text

str_replace("jjcurtin@wisc.edu", "@", "_at_")

[1] "jjcurtin_at_wisc.edu"

If you need finer control, you can use tokenize_regex() and then do further processing with stringr functions and regex

Now it may be easier to build up from here (e.g., ):

str_to_lower(word)
str_replace(word, ".$", "")

"Here is a sample document to tokenize.  How EXCITING (I _love_ it).  Sarah has spent 4 or 4.1 or 4P or 4+ years developing her pre-processing and NLP skills.  You can learn more about tokenization here: https://smltar.com/tokenization.html or by emailing me at jjcurtin@wisc.edu" |> 
  tokenizers::tokenize_regex(pattern = "\\s+")

[[1]]
 [1] "Here"                                
 [2] "is"                                  
 [3] "a"                                   
 [4] "sample"                              
 [5] "document"                            
 [6] "to"                                  
 [7] "tokenize."                           
 [8] "How"                                 
 [9] "EXCITING"                            
[10] "(I"                                  
[11] "_love_"                              
[12] "it)."                                
[13] "Sarah"                               
[14] "has"                                 
[15] "spent"                               
[16] "4"                                   
[17] "or"                                  
[18] "4.1"                                 
[19] "or"                                  
[20] "4P"                                  
[21] "or"                                  
[22] "4+"                                  
[23] "years"                               
[24] "developing"                          
[25] "her"                                 
[26] "pre-processing"                      
[27] "and"                                 
[28] "NLP"                                 
[29] "skills."                             
[30] "You"                                 
[31] "can"                                 
[32] "learn"                               
[33] "more"                                
[34] "about"                               
[35] "tokenization"                        
[36] "here:"                               
[37] "https://smltar.com/tokenization.html"
[38] "or"                                  
[39] "by"                                  
[40] "emailing"                            
[41] "me"                                  
[42] "at"                                  
[43] "jjcurtin@wisc.edu"

You can explore the tokens that will be formed using unnest_tokens() and basic tidyverse data wrangling using a tidied format of your documents as part of your EDA

We unnest to 1 token (word) per row (tidy format)
We keep track of doc_num (added earlier)

Here, we tokenize the IMDB training set.

Using defaults
Can change other default for tokenize_*() by passing into function via ...
Can set drop = TRUE (default) to discard the original document column (text)
Its pretty fast!

tokens <- data_trn |> 
  unnest_tokens(word, text, token = "words", to_lower = TRUE, drop = FALSE) |> 
  glimpse()

Rows: 5,935,548
Columns: 4
$ doc_num   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ sentiment <fct> neg, neg, neg, neg, neg, neg, neg, neg, neg, neg, neg, neg, …
$ text      <chr> "Story of a man who has unnatural feelings for a pig. Starts…
$ word      <chr> "story", "of", "a", "man", "who", "has", "unnatural", "feeli…

Coding sidebar: You can take a much deeper dive into tidyverse text processing in chapter 1 of Silge and Robinson (2017).

Let’s get oriented by reviewing the tokens from the first document

Raw form

data_trn$text[1]

[1] "Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."

Tokenized by word and tidied

tokens |> 
  filter(doc_num == 1) |> 
  select(word) |> 
  print(n = Inf)

# A tibble: 112 × 1
    word          
    <chr>         
  1 story         
  2 of            
  3 a             
  4 man           
  5 who           
  6 has           
  7 unnatural     
  8 feelings      
  9 for           
 10 a             
 11 pig           
 12 starts        
 13 out           
 14 with          
 15 a             
 16 opening       
 17 scene         
 18 that          
 19 is            
 20 a             
 21 terrific      
 22 example       
 23 of            
 24 absurd        
 25 comedy        
 26 a             
 27 formal        
 28 orchestra     
 29 audience      
 30 is            
 31 turned        
 32 into          
 33 an            
 34 insane        
 35 violent       
 36 mob           
 37 by            
 38 the           
 39 crazy         
 40 chantings     
 41 of            
 42 it's          
 43 singers       
 44 unfortunately 
 45 it            
 46 stays         
 47 absurd        
 48 the           
 49 whole         
 50 time          
 51 with          
 52 no            
 53 general       
 54 narrative     
 55 eventually    
 56 making        
 57 it            
 58 just          
 59 too           
 60 off           
 61 putting       
 62 even          
 63 those         
 64 from          
 65 the           
 66 era           
 67 should        
 68 be            
 69 turned        
 70 off           
 71 the           
 72 cryptic       
 73 dialogue      
 74 would         
 75 make          
 76 shakespeare   
 77 seem          
 78 easy          
 79 to            
 80 a             
 81 third         
 82 grader        
 83 on            
 84 a             
 85 technical     
 86 level         
 87 it's          
 88 better        
 89 than          
 90 you           
 91 might         
 92 think         
 93 with          
 94 some          
 95 good          
 96 cinematography
 97 by            
 98 future        
 99 great         
100 vilmos        
101 zsigmond      
102 future        
103 stars         
104 sally         
105 kirkland      
106 and           
107 frederic      
108 forrest       
109 can           
110 be            
111 seen          
112 briefly

Considering all the tokens across all documents

There are almost 6 million words

length(tokens$word)

[1] 5935548

The total unique vocabulary is around 85 thousand words

length(unique(tokens$word))

[1] 85574

Word frequency is VERY skewed

These are the counts for the most frequent 750 words
there are 84,000 additional infrequent words in the right tail not shown here!

tokens |>
  count(word, sort = TRUE) |> 
  slice(1:750) |> 
  mutate(word = reorder(word, -n)) |>
  ggplot(aes(word, n)) +
    geom_col() +
    xlab("Words") +
    ylab("Raw Count") +
    theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

Now let’s review the 100 most common words

We SHOULD review MUCH deeper than this
Some of our feature engineering approaches (e.g., BoW) will use the first 5K - 20K tokens
Some of our feature engineering approaches (e.g., word embeddings) may use ALL tokens
Some of these are likely not very informative (the, a, of, to, is). We will return to those words in a bit when we consider stopwords
Notice br. Why is it so common?

tokens |>
  count(word, sort = TRUE) |>
  print(n = 100)

# A tibble: 85,574 × 2
    word         n
    <chr>    <int>
  1 the     336179
  2 and     164061
  3 a       162738
  4 of      145848
  5 to      135695
  6 is      107320
  7 br      101871
  8 in       93920
  9 it       78874
 10 i        76508
 11 this     75814
 12 that     69794
 13 was      48189
 14 as       46903
 15 for      44321
 16 with     44115
 17 movie    43509
 18 but      42531
 19 film     39058
 20 on       34185
 21 not      30608
 22 you      29886
 23 are      29431
 24 his      29352
 25 have     27725
 26 be       26947
 27 he       26894
 28 one      26502
 29 all      23927
 30 at       23500
 31 by       22538
 32 an       21550
 33 they     21096
 34 who      20604
 35 so       20573
 36 from     20488
 37 like     20268
 38 her      18399
 39 or       17997
 40 just     17764
 41 about    17368
 42 out      17099
 43 it's     17094
 44 has      16789
 45 if       16746
 46 some     15734
 47 there    15671
 48 what     15374
 49 good     15110
 50 more     14242
 51 when     14161
 52 very     14059
 53 up       13283
 54 no       12698
 55 time     12691
 56 even     12638
 57 she      12624
 58 my       12485
 59 would    12236
 60 which    12047
 61 story    11918
 62 only     11910
 63 really   11734
 64 see      11465
 65 their    11376
 66 had      11289
 67 can      11144
 68 were     10782
 69 me       10745
 70 well     10637
 71 than      9920
 72 we        9858
 73 much      9750
 74 bad       9292
 75 been      9287
 76 get       9279
 77 will      9195
 78 do        9159
 79 also      9130
 80 into      9109
 81 people    9107
 82 other     9083
 83 first     9054
 84 because   9045
 85 great     9033
 86 how       8870
 87 him       8865
 88 most      8775
 89 don't     8445
 90 made      8351
 91 its       8156
 92 then      8097
 93 make      8018
 94 way       8005
 95 them      7954
 96 too       7820
 97 could     7745
 98 any       7653
 99 movies    7648
100 after     7617
# ℹ 85,474 more rows

Here is the first document that has the br token in it.

It is html code for a line break.

tokens |> 
  filter(word == "br") |> 
  slice(1) |> 
  pull(text)

[1] "Airport '77 starts as a brand new luxury 747 plane is loaded up with valuable paintings & such belonging to rich businessman Philip Stevens (James Stewart) who is flying them & a bunch of VIP's to his estate in preparation of it being opened to the public as a museum, also on board is Stevens daughter Julie (Kathleen Quinlan) & her son. The luxury jetliner takes off as planned but mid-air the plane is hi-jacked by the co-pilot Chambers (Robert Foxworth) & his two accomplice's Banker (Monte Markham) & Wilson (Michael Pataki) who knock the passengers & crew out with sleeping gas, they plan to steal the valuable cargo & land on a disused plane strip on an isolated island but while making his descent Chambers almost hits an oil rig in the Ocean & loses control of the plane sending it crashing into the sea where it sinks to the bottom right bang in the middle of the Bermuda Triangle. With air in short supply, water leaking in & having flown over 200 miles off course the problems mount for the survivor's as they await help with time fast running out...<br /><br />Also known under the slightly different tile Airport 1977 this second sequel to the smash-hit disaster thriller Airport (1970) was directed by Jerry Jameson & while once again like it's predecessors I can't say Airport '77 is any sort of forgotten classic it is entertaining although not necessarily for the right reasons. Out of the three Airport films I have seen so far I actually liked this one the best, just. It has my favourite plot of the three with a nice mid-air hi-jacking & then the crashing (didn't he see the oil rig?) & sinking of the 747 (maybe the makers were trying to cross the original Airport with another popular disaster flick of the period The Poseidon Adventure (1972)) & submerged is where it stays until the end with a stark dilemma facing those trapped inside, either suffocate when the air runs out or drown as the 747 floods or if any of the doors are opened & it's a decent idea that could have made for a great little disaster flick but bad unsympathetic character's, dull dialogue, lethargic set-pieces & a real lack of danger or suspense or tension means this is a missed opportunity. While the rather sluggish plot keeps one entertained for 108 odd minutes not that much happens after the plane sinks & there's not as much urgency as I thought there should have been. Even when the Navy become involved things don't pick up that much with a few shots of huge ships & helicopters flying about but there's just something lacking here. George Kennedy as the jinxed airline worker Joe Patroni is back but only gets a couple of scenes & barely even says anything preferring to just look worried in the background.<br /><br />The home video & theatrical version of Airport '77 run 108 minutes while the US TV versions add an extra hour of footage including a new opening credits sequence, many more scenes with George Kennedy as Patroni, flashbacks to flesh out character's, longer rescue scenes & the discovery or another couple of dead bodies including the navigator. While I would like to see this extra footage I am not sure I could sit through a near three hour cut of Airport '77. As expected the film has dated badly with horrible fashions & interior design choices, I will say no more other than the toy plane model effects aren't great either. Along with the other two Airport sequels this takes pride of place in the Razzie Award's Hall of Shame although I can think of lots of worse films than this so I reckon that's a little harsh. The action scenes are a little dull unfortunately, the pace is slow & not much excitement or tension is generated which is a shame as I reckon this could have been a pretty good film if made properly.<br /><br />The production values are alright if nothing spectacular. The acting isn't great, two time Oscar winner Jack Lemmon has said since it was a mistake to star in this, one time Oscar winner James Stewart looks old & frail, also one time Oscar winner Lee Grant looks drunk while Sir Christopher Lee is given little to do & there are plenty of other familiar faces to look out for too.<br /><br />Airport '77 is the most disaster orientated of the three Airport films so far & I liked the ideas behind it even if they were a bit silly, the production & bland direction doesn't help though & a film about a sunken plane just shouldn't be this boring or lethargic. Followed by The Concorde ... Airport '79 (1979)."

Let’s clean it in the raw documents and re-tokenize

You should always check your replacements CAREFULLY before doing them for unexpected matches and side-effects

data_trn <- data_trn |> 
  mutate(text = str_replace_all(text, "<br /><br />", " "))

tokens <- data_trn |> 
  unnest_tokens(word, text, drop = FALSE)

We should continue to review MUCH deeper into the common tokens to detect other tokenization errors. I will not demonstrate that here.

We should also review the least common tokens

Worth searching to make sure they haven’t resulted from tokenization errors
Can use this to tune our pre-processing and tokenization
BUT, not as important b/c these will be dropped by some of our feature engineering approaches (BoW) and may not present too much problem to others (embeddings)

tokens |>
  count(word) |> 
  arrange(n) |> 
  print(n = 200)

# A tibble: 85,574 × 2
    word                            n
    <chr>                       <int>
  1 0.10                            1
  2 0.48                            1
  3 0.7                             1
  4 0.79                            1
  5 0.89                            1
  6 00.01                           1
  7 000                             1
  8 0000000000001                   1
  9 00015                           1
 10 003830                          1
 11 006                             1
 12 0079                            1
 13 0093638                         1
 14 01pm                            1
 15 020410                          1
 16 029                             1
 17 041                             1
 18 050                             1
 19 06th                            1
 20 087                             1
 21 089                             1
 22 08th                            1
 23 0f                              1
 24 0ne                             1
 25 0r                              1
 26 0s                              1
 27 1'40                            1
 28 1,000,000,000,000               1
 29 1,000.00                        1
 30 1,000s                          1
 31 1,2,3                           1
 32 1,2,3,4,5                       1
 33 1,2,3,5                         1
 34 1,300                           1
 35 1,400                           1
 36 1,430                           1
 37 1,500,000                       1
 38 1,600                           1
 39 1,65m                           1
 40 1,700                           1
 41 1,999,999                       1
 42 1.000                           1
 43 1.19                            1
 44 1.30                            1
 45 1.30am                          1
 46 1.3516                          1
 47 1.37                            1
 48 1.47                            1
 49 1.49                            1
 50 1.4x                            1
 51 1.60                            1
 52 1.66                            1
 53 1.78                            1
 54 1.9                             1
 55 1.95                            1
 56 10,000,000                      1
 57 10.000                          1
 58 10.75                           1
 59 10.95                           1
 60 100.00                          1
 61 1000000                         1
 62 1000lb                          1
 63 100b                            1
 64 100k                            1
 65 100m                            1
 66 100mph                          1
 67 100yards                        1
 68 102nd                           1
 69 1040                            1
 70 1040a                           1
 71 1040s                           1
 72 1050                            1
 73 105lbs                          1
 74 106min                          1
 75 10am                            1
 76 10lines                         1
 77 10mil                           1
 78 10min                           1
 79 10minutes                       1
 80 10p.m                           1
 81 10star                          1
 82 10x's                           1
 83 10yr                            1
 84 11,2001                         1
 85 11.00                           1
 86 11001001                        1
 87 1100ad                          1
 88 1146                            1
 89 11f                             1
 90 11m                             1
 91 12.000.000                      1
 92 120,000.00                      1
 93 1200f                           1
 94 1201                            1
 95 1202                            1
 96 123,000,000                     1
 97 12383499143743701               1
 98 125,000                         1
 99 125m                            1
100 127                             1
101 12hr                            1
102 12mm                            1
103 12s                             1
104 13,15,16                        1
105 13.00                           1
106 1300                            1
107 1318                            1
108 135m                            1
109 137                             1
110 139                             1
111 13k                             1
112 14.00                           1
113 14.99                           1
114 140hp                           1
115 1415                            1
116 142                             1
117 1454                            1
118 1473                            1
119 1492                            1
120 14ieme                          1
121 14yr                            1
122 15,000,000                      1
123 15.00                           1
124 150,000                         1
125 1500.00                         1
126 150_worst_cases_of_nepotism     1
127 150k                            1
128 150m                            1
129 151                             1
130 152                             1
131 153                             1
132 1547                            1
133 155                             1
134 156                             1
135 1561                            1
136 1594                            1
137 15mins                          1
138 15minutes                       1
139 16,000                          1
140 16.9                            1
141 16.97                           1
142 1600s                           1
143 160lbs                          1
144 1610                            1
145 163,000                         1
146 164                             1
147 165                             1
148 166                             1
149 1660s                           1
150 16ieme                          1
151 16k                             1
152 16x9                            1
153 16éme                           1
154 17,000                          1
155 17,2003                         1
156 17.75                           1
157 1700s                           1
158 1701                            1
159 171                             1
160 175                             1
161 177                             1
162 1775                            1
163 1790s                           1
164 1794                            1
165 17million                       1
166 18,000,000                      1
167 1800mph                         1
168 1801                            1
169 1805                            1
170 1809                            1
171 180d                            1
172 1812                            1
173 18137                           1
174 1814                            1
175 1832                            1
176 1838                            1
177 1844                            1
178 1850ies                         1
179 1852                            1
180 1860s                           1
181 1870                            1
182 1871                            1
183 1874                            1
184 1875                            1
185 188                             1
186 1887                            1
187 1889                            1
188 188o                            1
189 1893                            1
190 18year                          1
191 19,000,000                      1
192 190                             1
193 1904                            1
194 1908                            1
195 192                             1
196 1920ies                         1
197 1923                            1
198 1930ies                         1
199 193o's                          1
200 194                             1
# ℹ 85,374 more rows

What is the deal with _*_?

tokens |> 
  filter(word == "_anything_") |> 
  slice(1) |> 
  pull(text)

[1] "I am shocked. Shocked and dismayed that the 428 of you IMDB users who voted before me have not given this film a rating of higher than 7. 7?!?? - that's a C!. If I could give FOBH a 20, I'd gladly do it. This film ranks high atop the pantheon of modern comedy, alongside Half Baked and Mallrats, as one of the most hilarious films of all time. If you know _anything_ about rap music - YOU MUST SEE THIS!! If you know nothing about rap music - learn something!, and then see this! Comparisons to 'Spinal Tap' fail to appreciate the inspired genius of this unique film. If you liked Bob Roberts, you'll love this. Watch it and vote it a 10!"

tokens |> 
  filter(word == "_love_") |> 
  slice(1) |> 
  pull(text)

[1] "Before I start, I _love_ Eddie Izzard. I think he's one of the funniest stand-ups around today. Possibly that means I'm going into this with too high expectations, but I just didn't find Eddie funny in this outing. I think the main problem is Eddie is trying too hard to be Eddie. Everyone knows him as a completely irrelevant comic, and we all love him for it. But in Circle, he appears to be going more for irrelevant than funny, and completely lost me in places. Many of the topics he covers he has covered before - I even think I recognised a few recycled jokes in there. If you buy the DVD you'll find a behind-the-scenes look at Eddie's tour (interesting in places, but not very funny), and a French language version of one of his shows. Die-hards will enjoy seeing Eddie in a different language, but subtitled comedy isn't very funny. If you're a fan of Eddie you've either got this already or you're going to buy it whatever I say. If you're just passing through, buy Glorious or Dressed to Kill - you won't be disappointed. With Circle, you probably will."

Let’s find all the tokens that start or end with _

A few of these are even repeatedly used

tokens |> 
  filter(str_detect(word, "^_") | str_detect(word, "_$")) |> 
  count(word, sort = TRUE) |>
  print(n = Inf)

# A tibble: 130 × 2
    word                                                                   n
    <chr>                                                              <int>
  1 _the                                                                   6
  2 _a                                                                     5
  3 thing_                                                                 4
  4 ____                                                                   3
  5 _atlantis_                                                             3
  6 _is_                                                                   3
  7 ______                                                                 2
  8 _____________________________________                                  2
  9 _bounce_                                                               2
 10 _night                                                                 2
 11 _not_                                                                  2
 12 _plan                                                                  2
 13 _real_                                                                 2
 14 _waterdance_                                                           2
 15 story_                                                                 2
 16 9_                                                                     1
 17 _____                                                                  1
 18 _________                                                              1
 19 ____________________________________                                   1
 20 __________________________________________________________________     1
 21 _absolute                                                              1
 22 _am_                                                                   1
 23 _and_                                                                  1
 24 _angel_                                                                1
 25 _annie_                                                                1
 26 _any_                                                                  1
 27 _anything_                                                             1
 28 _apocalyptically                                                       1
 29 _as                                                                    1
 30 _atlantis                                                              1
 31 _attack                                                                1
 32 _before_                                                               1
 33 _blair                                                                 1
 34 _both_                                                                 1
 35 _by                                                                    1
 36 _can't_                                                                1
 37 _cannon_                                                               1
 38 _certainly_                                                            1
 39 _could                                                                 1
 40 _cruel                                                                 1
 41 _dirty                                                                 1
 42 _discuss_                                                              1
 43 _discussing_                                                           1
 44 _do_                                                                   1
 45 _dr                                                                    1
 46 _dying                                                                 1
 47 _earned_                                                               1
 48 _everything_                                                           1
 49 _ex_executives                                                         1
 50 _extremeley_                                                           1
 51 _extremely_                                                            1
 52 _film_                                                                 1
 53 _get_                                                                  1
 54 _have_                                                                 1
 55 _i_                                                                    1
 56 _innerly                                                               1
 57 _inside_                                                               1
 58 _inspire_                                                              1
 59 _les                                                                   1
 60 _love_                                                                 1
 61 _magic_                                                                1
 62 _much_                                                                 1
 63 _mystery                                                               1
 64 _napolean                                                              1
 65 _new                                                                   1
 66 _obviously_                                                            1
 67 _other_                                                                1
 68 _penetrate_                                                            1
 69 _possible_                                                             1
 70 _really_is_                                                            1
 71 _shall                                                                 1
 72 _shock                                                                 1
 73 _so_                                                                   1
 74 _so_much_                                                              1
 75 _somewhere_                                                            1
 76 _spiritited                                                            1
 77 _starstruck_                                                           1
 78 _strictly                                                              1
 79 _sung_                                                                 1
 80 _the_                                                                  1
 81 _the_lost_empire_                                                      1
 82 _there's_                                                              1
 83 _they_                                                                 1
 84 _think_                                                                1
 85 _told_                                                                 1
 86 _toy                                                                   1
 87 _tried                                                                 1
 88 _twice                                                                 1
 89 _undertow_                                                             1
 90 _very_                                                                 1
 91 _voice_                                                                1
 92 _want_                                                                 1
 93 _we've                                                                 1
 94 _well_                                                                 1
 95 _whale_                                                                1
 96 _wrong_                                                                1
 97 _x                                                                     1
 98 acteurs_                                                               1
 99 apple_                                                                 1
100 away_                                                                  1
101 ballroom_                                                              1
102 been_                                                                  1
103 beginners_                                                             1
104 brail_                                                                 1
105 casablanca_                                                            1
106 composer_                                                              1
107 dancing_                                                               1
108 dougray_scott_                                                         1
109 dozen_                                                                 1
110 dynamite_                                                              1
111 eaters_                                                                1
112 eyre_                                                                  1
113 f___                                                                   1
114 film_                                                                  1
115 hard_                                                                  1
116 men_                                                                   1
117 night_                                                                 1
118 opera_                                                                 1
119 rehearsals_                                                            1
120 shrews_                                                                1
121 space_                                                                 1
122 starts__                                                               1
123 that_                                                                  1
124 treatment_                                                             1
125 watchmen_                                                              1
126 what_the_bleep_                                                        1
127 witch_                                                                 1
128 words_                                                                 1
129 you_                                                                   1
130 zhivago_                                                               1

Now we can clean the raw documents again. This works but there is probably a better regex using ^_ and _$

data_trn <- data_trn |> 
  mutate(text = str_replace_all(text, " _", " "),
         text = str_replace_all(text, " _", " "),
         text = str_replace_all(text, "^_", ""),
         text = str_replace_all(text, "_\\.", "\\."),
         text = str_replace_all(text, "\\(_", "\\("),
         text = str_replace_all(text, ":_", ": "),
         text = str_replace_all(text, "_{3,}", " "))

Let’s take another look and uncommon tokens

data_trn |> 
  unnest_tokens(word, text, drop = FALSE) |> 
  count(word) |> 
  arrange(n) |> 
  print(n = 200)

# A tibble: 85,535 × 2
    word                            n
    <chr>                       <int>
  1 0.10                            1
  2 0.48                            1
  3 0.7                             1
  4 0.79                            1
  5 0.89                            1
  6 00.01                           1
  7 000                             1
  8 0000000000001                   1
  9 00015                           1
 10 003830                          1
 11 006                             1
 12 0079                            1
 13 0093638                         1
 14 01pm                            1
 15 020410                          1
 16 029                             1
 17 041                             1
 18 050                             1
 19 06th                            1
 20 087                             1
 21 089                             1
 22 08th                            1
 23 0f                              1
 24 0ne                             1
 25 0r                              1
 26 0s                              1
 27 1'40                            1
 28 1,000,000,000,000               1
 29 1,000.00                        1
 30 1,000s                          1
 31 1,2,3                           1
 32 1,2,3,4,5                       1
 33 1,2,3,5                         1
 34 1,300                           1
 35 1,400                           1
 36 1,430                           1
 37 1,500,000                       1
 38 1,600                           1
 39 1,65m                           1
 40 1,700                           1
 41 1,999,999                       1
 42 1.000                           1
 43 1.19                            1
 44 1.30                            1
 45 1.30am                          1
 46 1.3516                          1
 47 1.37                            1
 48 1.47                            1
 49 1.49                            1
 50 1.4x                            1
 51 1.60                            1
 52 1.66                            1
 53 1.78                            1
 54 1.9                             1
 55 1.95                            1
 56 10,000,000                      1
 57 10.000                          1
 58 10.75                           1
 59 10.95                           1
 60 100.00                          1
 61 1000000                         1
 62 1000lb                          1
 63 100b                            1
 64 100k                            1
 65 100m                            1
 66 100mph                          1
 67 100yards                        1
 68 102nd                           1
 69 1040                            1
 70 1040a                           1
 71 1040s                           1
 72 1050                            1
 73 105lbs                          1
 74 106min                          1
 75 10am                            1
 76 10lines                         1
 77 10mil                           1
 78 10min                           1
 79 10minutes                       1
 80 10p.m                           1
 81 10star                          1
 82 10x's                           1
 83 10yr                            1
 84 11,2001                         1
 85 11.00                           1
 86 11001001                        1
 87 1100ad                          1
 88 1146                            1
 89 11f                             1
 90 11m                             1
 91 12.000.000                      1
 92 120,000.00                      1
 93 1200f                           1
 94 1201                            1
 95 1202                            1
 96 123,000,000                     1
 97 12383499143743701               1
 98 125,000                         1
 99 125m                            1
100 127                             1
101 12hr                            1
102 12mm                            1
103 12s                             1
104 13,15,16                        1
105 13.00                           1
106 1300                            1
107 1318                            1
108 135m                            1
109 137                             1
110 139                             1
111 13k                             1
112 14.00                           1
113 14.99                           1
114 140hp                           1
115 1415                            1
116 142                             1
117 1454                            1
118 1473                            1
119 1492                            1
120 14ieme                          1
121 14yr                            1
122 15,000,000                      1
123 15.00                           1
124 150,000                         1
125 1500.00                         1
126 150_worst_cases_of_nepotism     1
127 150k                            1
128 150m                            1
129 151                             1
130 152                             1
131 153                             1
132 1547                            1
133 155                             1
134 156                             1
135 1561                            1
136 1594                            1
137 15mins                          1
138 15minutes                       1
139 16,000                          1
140 16.9                            1
141 16.97                           1
142 1600s                           1
143 160lbs                          1
144 1610                            1
145 163,000                         1
146 164                             1
147 165                             1
148 166                             1
149 1660s                           1
150 16ieme                          1
151 16k                             1
152 16x9                            1
153 16éme                           1
154 17,000                          1
155 17,2003                         1
156 17.75                           1
157 1700s                           1
158 1701                            1
159 171                             1
160 175                             1
161 177                             1
162 1775                            1
163 1790s                           1
164 1794                            1
165 17million                       1
166 18,000,000                      1
167 1800mph                         1
168 1801                            1
169 1805                            1
170 1809                            1
171 180d                            1
172 1812                            1
173 18137                           1
174 1814                            1
175 1832                            1
176 1838                            1
177 1844                            1
178 1850ies                         1
179 1852                            1
180 1860s                           1
181 1870                            1
182 1871                            1
183 1874                            1
184 1875                            1
185 188                             1
186 1887                            1
187 1889                            1
188 188o                            1
189 1893                            1
190 18year                          1
191 19,000,000                      1
192 190                             1
193 1904                            1
194 1908                            1
195 192                             1
196 1920ies                         1
197 1923                            1
198 1930ies                         1
199 193o's                          1
200 194                             1
# ℹ 85,335 more rows

Lots of numbers. Probably? not that important for our classification problem.

Let’s strip them for demonstration purposes at least using strip_numeric = TRUE

This is likey good for unigrams but wouldn’t be good/possible for bigrams (break sequence)

data_trn |> 
  unnest_tokens(word, text, 
                drop = FALSE,
                strip_numeric = TRUE) |>   
  count(word) |> 
  arrange(n) |> 
  print(n = 200)

# A tibble: 84,392 × 2
    word                            n
    <chr>                       <int>
  1 01pm                            1
  2 06th                            1
  3 08th                            1
  4 0f                              1
  5 0ne                             1
  6 0r                              1
  7 0s                              1
  8 1,000s                          1
  9 1,65m                           1
 10 1.30am                          1
 11 1.4x                            1
 12 1000lb                          1
 13 100b                            1
 14 100k                            1
 15 100m                            1
 16 100mph                          1
 17 100yards                        1
 18 102nd                           1
 19 1040a                           1
 20 1040s                           1
 21 105lbs                          1
 22 106min                          1
 23 10am                            1
 24 10lines                         1
 25 10mil                           1
 26 10min                           1
 27 10minutes                       1
 28 10p.m                           1
 29 10star                          1
 30 10x's                           1
 31 10yr                            1
 32 1100ad                          1
 33 11f                             1
 34 11m                             1
 35 1200f                           1
 36 125m                            1
 37 12hr                            1
 38 12mm                            1
 39 12s                             1
 40 135m                            1
 41 13k                             1
 42 140hp                           1
 43 14ieme                          1
 44 14yr                            1
 45 150_worst_cases_of_nepotism     1
 46 150k                            1
 47 150m                            1
 48 15mins                          1
 49 15minutes                       1
 50 1600s                           1
 51 160lbs                          1
 52 1660s                           1
 53 16ieme                          1
 54 16k                             1
 55 16éme                           1
 56 1700s                           1
 57 1790s                           1
 58 17million                       1
 59 1800mph                         1
 60 180d                            1
 61 1850ies                         1
 62 1860s                           1
 63 188o                            1
 64 18year                          1
 65 1920ies                         1
 66 1930ies                         1
 67 193o's                          1
 68 1949er                          1
 69 1961s                           1
 70 1970ies                         1
 71 1970s.i                         1
 72 197o                            1
 73 1980ies                         1
 74 1982s                           1
 75 1983s                           1
 76 19k                             1
 77 19thc                           1
 78 1and                            1
 79 1h40m                           1
 80 1million                        1
 81 1min                            1
 82 1mln                            1
 83 1o                              1
 84 1ton                            1
 85 1tv.ru                          1
 86 1ç                              1
 87 2.00am                          1
 88 2.5hrs                          1
 89 2000ad                          1
 90 2004s                           1
 91 200ft                           1
 92 200th                           1
 93 20c                             1
 94 20ft                            1
 95 20k                             1
 96 20m                             1
 97 20mins                          1
 98 20minutes                       1
 99 20mn                            1
100 20p                             1
101 20perr                          1
102 20s.what                        1
103 20ties                          1
104 20widow                         1
105 20x                             1
106 20year                          1
107 20yrs                           1
108 225mins                         1
109 22d                             1
110 230lbs                          1
111 230mph                          1
112 23d                             1
113 24m30s                          1
114 24years                         1
115 25million                       1
116 25mins                          1
117 25s                             1
118 25yrs                           1
119 261k                            1
120 2fast                           1
121 2furious                        1
122 2h                              1
123 2hour                           1
124 2hr                             1
125 2in                             1
126 2inch                           1
127 2more                           1
128 300ad                           1
129 300c                            1
130 300lbs                          1
131 300mln                          1
132 30am                            1
133 30ish                           1
134 30k                             1
135 30lbs                           1
136 30s.like                        1
137 30something                     1
138 30ties                          1
139 32lb                            1
140 32nd                            1
141 330am                           1
142 330mins                         1
143 336th                           1
144 33m                             1
145 35c                             1
146 35mins                          1
147 35pm                            1
148 39th                            1
149 3bs                             1
150 3dvd                            1
151 3lbs                            1
152 3m                              1
153 3mins                           1
154 3pm                             1
155 3po's                           1
156 3th                             1
157 3who                            1
158 4.5hrs                          1
159 401k                            1
160 40am                            1
161 40min                           1
162 40mph                           1
163 442nd                           1
164 44c                             1
165 44yrs                           1
166 45am                            1
167 45min                           1
168 45s                             1
169 480m                            1
170 480p                            1
171 4cylinder                       1
172 4d                              1
173 4eva                            1
174 4f                              1
175 4h                              1
176 4hrs                            1
177 4o                              1
178 4pm                             1
179 4w                              1
180 4ward                           1
181 4x                              1
182 5.50usd                         1
183 500db                           1
184 500lbs                          1
185 50c                             1
186 50ft                            1
187 50ies                           1
188 50ish                           1
189 50k                             1
190 50mins                          1
191 51b                             1
192 51st                            1
193 52s                             1
194 53m                             1
195 540i                            1
196 54th                            1
197 57d                             1
198 58th                            1
199 5kph                            1
200 5min                            1
# ℹ 84,192 more rows

The tokenizer didn’t get rid of numbers connected to text

How do we want to handle these?
Could add a space to make them two words?
We will leave them as is

Other issues?

Mis-spellings
Repeated letters for emphasize or effect?
Did caps matter (default tokenization was to convert to lowercase)?
Domain knowledge is useful here though multiple model configurations can be considered
strings of words that aren’t meaningful (“Welcome to Facebook”)

In the above workflow, we:

tokenize
review tokens for issues
clean raw text documents
re-tokenize

In some instances, it may be easier to clean the token and then put them back together

tokenize
review tokens for issues
clean tokens
recreate document
tokenize

If this latter workflow feels easier (i.e., easier to regex into a token than a document), we will need code to put the tokens back together into a document

Here is an example using the first three documents (`slice(1:3)1) and no cleaning

Tokenize

tokens <- 
  data_trn |>
  slice(1:3) |> 
  unnest_tokens(word, text, 
                drop = FALSE,
                strip_numeric = TRUE)

Now we can do further cleaning with tokens

INSERT CLEANING CHUNKS HERE

Then put back together. Could also collapse into text_cln to retain original text column

data_trn_cln <-
  tokens |>  
  group_by(doc_num, sentiment) |> 
  summarize(text = str_c(word, collapse = " "),
            .groups = "drop")

Lets see what we have

Now all lower case
No numbers, punctuation, etc.
and any other cleaning we could have done with the tokens along the way….

data_trn_cln |> 
  pull(text) |>
  print_kbl()

x
story of a man who has unnatural feelings for a pig starts out with a opening scene that is a terrific example of absurd comedy a formal orchestra audience is turned into an insane violent mob by the crazy chantings of it's singers unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting even those from the era should be turned off the cryptic dialogue would make shakespeare seem easy to a third grader on a technical level it's better than you might think with some good cinematography by future great vilmos zsigmond future stars sally kirkland and frederic forrest can be seen briefly
airport starts as a brand new luxury plane is loaded up with valuable paintings such belonging to rich businessman philip stevens james stewart who is flying them a bunch of vip's to his estate in preparation of it being opened to the public as a museum also on board is stevens daughter julie kathleen quinlan her son the luxury jetliner takes off as planned but mid air the plane is hi jacked by the co pilot chambers robert foxworth his two accomplice's banker monte markham wilson michael pataki who knock the passengers crew out with sleeping gas they plan to steal the valuable cargo land on a disused plane strip on an isolated island but while making his descent chambers almost hits an oil rig in the ocean loses control of the plane sending it crashing into the sea where it sinks to the bottom right bang in the middle of the bermuda triangle with air in short supply water leaking in having flown over miles off course the problems mount for the survivor's as they await help with time fast running out also known under the slightly different tile airport this second sequel to the smash hit disaster thriller airport was directed by jerry jameson while once again like it's predecessors i can't say airport is any sort of forgotten classic it is entertaining although not necessarily for the right reasons out of the three airport films i have seen so far i actually liked this one the best just it has my favourite plot of the three with a nice mid air hi jacking then the crashing didn't he see the oil rig sinking of the maybe the makers were trying to cross the original airport with another popular disaster flick of the period the poseidon adventure submerged is where it stays until the end with a stark dilemma facing those trapped inside either suffocate when the air runs out or drown as the floods or if any of the doors are opened it's a decent idea that could have made for a great little disaster flick but bad unsympathetic character's dull dialogue lethargic set pieces a real lack of danger or suspense or tension means this is a missed opportunity while the rather sluggish plot keeps one entertained for odd minutes not that much happens after the plane sinks there's not as much urgency as i thought there should have been even when the navy become involved things don't pick up that much with a few shots of huge ships helicopters flying about but there's just something lacking here george kennedy as the jinxed airline worker joe patroni is back but only gets a couple of scenes barely even says anything preferring to just look worried in the background the home video theatrical version of airport run minutes while the us tv versions add an extra hour of footage including a new opening credits sequence many more scenes with george kennedy as patroni flashbacks to flesh out character's longer rescue scenes the discovery or another couple of dead bodies including the navigator while i would like to see this extra footage i am not sure i could sit through a near three hour cut of airport as expected the film has dated badly with horrible fashions interior design choices i will say no more other than the toy plane model effects aren't great either along with the other two airport sequels this takes pride of place in the razzie award's hall of shame although i can think of lots of worse films than this so i reckon that's a little harsh the action scenes are a little dull unfortunately the pace is slow not much excitement or tension is generated which is a shame as i reckon this could have been a pretty good film if made properly the production values are alright if nothing spectacular the acting isn't great two time oscar winner jack lemmon has said since it was a mistake to star in this one time oscar winner james stewart looks old frail also one time oscar winner lee grant looks drunk while sir christopher lee is given little to do there are plenty of other familiar faces to look out for too airport is the most disaster orientated of the three airport films so far i liked the ideas behind it even if they were a bit silly the production bland direction doesn't help though a film about a sunken plane just shouldn't be this boring or lethargic followed by the concorde airport
this film lacked something i couldn't put my finger on at first charisma on the part of the leading actress this inevitably translated to lack of chemistry when she shared the screen with her leading man even the romantic scenes came across as being merely the actors at play it could very well have been the director who miscalculated what he needed from the actors i just don't know but could it have been the screenplay just exactly who was the chef in love with he seemed more enamored of his culinary skills and restaurant and ultimately of himself and his youthful exploits than of anybody or anything else he never convinced me he was in love with the princess i was disappointed in this movie but don't forget it was nominated for an oscar so judge for yourself

12.5 Stop words

Not all words are equally informative or useful to our model depending on the nature of our problem

Very common words often may carry little or no meaningful information

These words are called stop words

It is common advice and practice to remove stop words for various NLP tasks

Notice some of the top most frequent words among our tokens from IMDB reviews

data_trn |> 
  unnest_tokens(word, text, 
                drop = FALSE,
                strip_numeric = TRUE) |> 
  count(word, sort = TRUE) |> 
  print(n = 100)

# A tibble: 84,392 × 2
    word         n
    <chr>    <int>
  1 the     336185
  2 and     164061
  3 a       162743
  4 of      145848
  5 to      135695
  6 is      107320
  7 in       93920
  8 it       78874
  9 i        76508
 10 this     75814
 11 that     69794
 12 was      48189
 13 as       46904
 14 for      44321
 15 with     44115
 16 movie    43509
 17 but      42531
 18 film     39058
 19 on       34185
 20 not      30608
 21 you      29886
 22 are      29431
 23 his      29352
 24 have     27725
 25 be       26947
 26 he       26894
 27 one      26502
 28 all      23927
 29 at       23500
 30 by       22539
 31 an       21550
 32 they     21096
 33 who      20604
 34 so       20573
 35 from     20488
 36 like     20268
 37 her      18399
 38 or       17997
 39 just     17764
 40 about    17368
 41 out      17099
 42 it's     17094
 43 has      16789
 44 if       16746
 45 some     15734
 46 there    15671
 47 what     15374
 48 good     15110
 49 more     14242
 50 when     14161
 51 very     14059
 52 up       13283
 53 no       12698
 54 time     12691
 55 even     12638
 56 she      12624
 57 my       12485
 58 would    12236
 59 which    12047
 60 story    11919
 61 only     11910
 62 really   11734
 63 see      11465
 64 their    11376
 65 had      11289
 66 can      11144
 67 were     10782
 68 me       10745
 69 well     10637
 70 than      9920
 71 we        9858
 72 much      9750
 73 bad       9292
 74 been      9287
 75 get       9279
 76 will      9195
 77 do        9159
 78 also      9130
 79 into      9109
 80 people    9107
 81 other     9083
 82 first     9054
 83 because   9045
 84 great     9033
 85 how       8870
 86 him       8865
 87 most      8775
 88 don't     8445
 89 made      8351
 90 its       8156
 91 then      8097
 92 make      8018
 93 way       8005
 94 them      7954
 95 too       7820
 96 could     7746
 97 any       7653
 98 movies    7648
 99 after     7617
100 think     7293
# ℹ 84,292 more rows

Stop words can have different roles in a corpus (a set of documents)

For our purposes, we generally care about two different types of stop words:

Global
Subject-specific

Global stop words almost always have very little value for our modeling goals

These are frequent words like “the”, “of” and “and” in English.

It is typically pretty safe to remove these and you can find them in pre-made lists of stop words (see below)

Subject-specific stop words are words that are common and uninformative given the subject or context within which your text/documents were collected and your modeling goals.

For example, given our goal to classify movie reviews as positive or negative, subject-specific stop words might include:

movie
film
movies

We likely we see others if we expand our review of commons words a bit more (which we should!)

character
actor
actress
director
cast
scene

These are not general stop words but they will be common in this dataset and the **may* be uninformative RE our classification goal

Subject-specific stop words may improve performance if you have the domain expertise to create a good list

HOWEVER, you should think carefully about your goals and method. For example, if you are using bigrams rather than single word (unigram) tokens, you might retain words like actor or director because them may be informative in bigrams

bad actor
great director

Though it might be sufficient to just retain bad and great

The stopwords package contains many lists of stopwords.

We can access these lists using through that package
Those lists are also available with get_stopwords() in the tidytext package (my preference)
get_stopwords() returns a tibble with two columns (see below)

Two commonly used stop word lists are:

snowball (175 words)

stop_snowball <- 
  get_stopwords(source = "snowball") |> 
  print(n = 50)  # review the first 50 words

# A tibble: 175 × 2
   word       lexicon 
   <chr>      <chr>   
 1 i          snowball
 2 me         snowball
 3 my         snowball
 4 myself     snowball
 5 we         snowball
 6 our        snowball
 7 ours       snowball
 8 ourselves  snowball
 9 you        snowball
10 your       snowball
11 yours      snowball
12 yourself   snowball
13 yourselves snowball
14 he         snowball
15 him        snowball
16 his        snowball
17 himself    snowball
18 she        snowball
19 her        snowball
20 hers       snowball
21 herself    snowball
22 it         snowball
23 its        snowball
24 itself     snowball
25 they       snowball
26 them       snowball
27 their      snowball
28 theirs     snowball
29 themselves snowball
30 what       snowball
31 which      snowball
32 who        snowball
33 whom       snowball
34 this       snowball
35 that       snowball
36 these      snowball
37 those      snowball
38 am         snowball
39 is         snowball
40 are        snowball
41 was        snowball
42 were       snowball
43 be         snowball
44 been       snowball
45 being      snowball
46 have       snowball
47 has        snowball
48 had        snowball
49 having     snowball
50 do         snowball
# ℹ 125 more rows

smart (571 words)

stop_smart <-
  get_stopwords(source = "smart") |> 
  print(n = 50)  # review the first 50 words

# A tibble: 571 × 2
   word        lexicon
   <chr>       <chr>  
 1 a           smart  
 2 a's         smart  
 3 able        smart  
 4 about       smart  
 5 above       smart  
 6 according   smart  
 7 accordingly smart  
 8 across      smart  
 9 actually    smart  
10 after       smart  
11 afterwards  smart  
12 again       smart  
13 against     smart  
14 ain't       smart  
15 all         smart  
16 allow       smart  
17 allows      smart  
18 almost      smart  
19 alone       smart  
20 along       smart  
21 already     smart  
22 also        smart  
23 although    smart  
24 always      smart  
25 am          smart  
26 among       smart  
27 amongst     smart  
28 an          smart  
29 and         smart  
30 another     smart  
31 any         smart  
32 anybody     smart  
33 anyhow      smart  
34 anyone      smart  
35 anything    smart  
36 anyway      smart  
37 anyways     smart  
38 anywhere    smart  
39 apart       smart  
40 appear      smart  
41 appreciate  smart  
42 appropriate smart  
43 are         smart  
44 aren't      smart  
45 around      smart  
46 as          smart  
47 aside       smart  
48 ask         smart  
49 asking      smart  
50 associated  smart  
# ℹ 521 more rows

smart is mostly a super-set of snowball except for these words which are only in snowball

Stop word lists aren’t perfect. Why does smart contain he's but not she's?

setdiff(pull(stop_snowball, word),
        pull(stop_smart, word))

 [1] "she's"   "he'd"    "she'd"   "he'll"   "she'll"  "shan't"  "mustn't"
 [8] "when's"  "why's"   "how's"

It is common and appropriate to start with a pre-made word list or set of lists and combine, add, and/or remove words based on your specific needs

You can add global words that you feel are missed
You can add subject specific words
You can remove global words that might be relevant to your problem

In the service of simplicity, we will use the union of the two previous pre-made global lists without any additional subject specific lists

all_stops <- union(pull(stop_snowball, word), pull(stop_smart, word))

We can remove stop words as part of tokenization using stopwords = all_stops

Let’s see our two 100 tokens now

data_trn |> 
  unnest_tokens(word, text, 
                drop = FALSE,
                strip_numeric = TRUE,
                stopwords = all_stops) |> 
  count(word) |> 
  arrange(desc(n)) |> 
  print(n = 100)

# A tibble: 83,822 × 2
    word            n
    <chr>       <int>
  1 movie       43509
  2 film        39058
  3 good        15110
  4 time        12691
  5 story       11919
  6 bad          9292
  7 people       9107
  8 great        9033
  9 made         8351
 10 make         8018
 11 movies       7648
 12 characters   7142
 13 watch        6959
 14 films        6881
 15 character    6701
 16 plot         6563
 17 life         6560
 18 acting       6482
 19 love         6421
 20 show         6171
 21 end          5640
 22 man          5630
 23 scene        5356
 24 scenes       5206
 25 back         4965
 26 real         4734
 27 watching     4597
 28 years        4508
 29 thing        4498
 30 actors       4476
 31 work         4368
 32 funny        4278
 33 makes        4204
 34 director     4184
 35 find         4129
 36 part         4020
 37 lot          3965
 38 cast         3816
 39 world        3698
 40 things       3685
 41 pretty       3663
 42 young        3634
 43 horror       3578
 44 fact         3521
 45 big          3471
 46 long         3441
 47 thought      3434
 48 series       3410
 49 give         3374
 50 original     3358
 51 action       3351
 52 comedy       3230
 53 times        3223
 54 point        3218
 55 role         3175
 56 interesting  3125
 57 family       3109
 58 bit          3052
 59 music        3045
 60 script       3007
 61 guy          2962
 62 making       2960
 63 feel         2947
 64 minutes      2944
 65 performance  2887
 66 kind         2780
 67 girl         2739
 68 tv           2732
 69 worst        2730
 70 day          2711
 71 fun          2690
 72 hard         2666
 73 woman        2651
 74 played       2586
 75 found        2571
 76 screen       2474
 77 set          2452
 78 place        2403
 79 book         2394
 80 put          2379
 81 ending       2351
 82 money        2351
 83 true         2329
 84 sense        2320
 85 reason       2316
 86 actor        2312
 87 shows        2304
 88 dvd          2282
 89 worth        2274
 90 job          2270
 91 year         2268
 92 main         2264
 93 watched      2235
 94 play         2222
 95 american     2217
 96 plays        2214
 97 effects      2196
 98 takes        2192
 99 beautiful    2176
100 house        2171
# ℹ 83,722 more rows

What if we were doing bigrams instead?

token = “ngrams”
n = 2,
n_min = 2
NOTE: can’t strip numeric (would break the sequence of words)
NOTE: There are more bigrams than unigrams (more features for BoW!)

data_trn |> 
  unnest_tokens(word, text, 
                drop = FALSE,
                token = "ngrams",
                stopwords = all_stops, 
                n = 2,
                n_min = 2) |> 
  count(word) |> 
  arrange(desc(n)) |> 
  print(n = 100)

# A tibble: 1,616,477 × 2
    word                      n
    <chr>                 <int>
  1 special effects        1110
  2 low budget              881
  3 waste time              793
  4 good movie              785
  5 watch movie             695
  6 movie made              693
  7 sci fi                  647
  8 years ago               631
  9 real life               617
 10 film made               588
 11 movie good              588
 12 movie movie             561
 13 pretty good             557
 14 bad movie               552
 15 high school             545
 16 watching movie          534
 17 movie bad               512
 18 main character          509
 19 good film               487
 20 great movie             473
 21 horror movie            468
 22 horror film             454
 23 long time               448
 24 make movie              445
 25 film making             417
 26 film good               412
 27 worth watching          406
 28 10 10                   404
 29 movie great             388
 30 bad acting              387
 31 worst movie             386
 32 black white             385
 33 main characters         381
 34 end movie               380
 35 film film               368
 36 takes place             360
 37 great film              358
 38 camera work             356
 39 make sense              348
 40 good job                347
 41 story line              347
 42 watch film              344
 43 movie watch             343
 44 character development   341
 45 supporting cast         338
 46 1 2                     334
 47 love story              334
 48 read book               332
 49 bad guys                327
 50 end film                320
 51 8 10                    318
 52 horror movies           318
 53 make film               317
 54 made movie              315
 55 good thing              307
 56 7 10                    305
 57 world war               289
 58 bad film                288
 59 horror films            285
 60 watched movie           285
 61 thing movie             283
 62 1 10                    280
 63 part movie              278
 64 watching film           277
 65 bad guy                 274
 66 4 10                    273
 67 made film               272
 68 rest cast               268
 69 tv series               268
 70 writer director         267
 71 time movie              266
 72 half hour               265
 73 production values       265
 74 film great              262
 75 highly recommend        261
 76 makes sense             256
 77 martial arts            256
 78 love movie              255
 79 science fiction         255
 80 acting bad              254
 81 tv movie                253
 82 recommend movie         251
 83 3 10                    246
 84 entire movie            245
 85 film makers             245
 86 9 10                    242
 87 movie time              242
 88 fun watch               238
 89 kung fu                 238
 90 film bad                235
 91 good acting             235
 92 true story              233
 93 movie make              232
 94 movies made             232
 95 point view              232
 96 film festival           231
 97 great job               227
 98 young woman             227
 99 good story              225
100 star wars               224
# ℹ 1,616,377 more rows

Looks like we are starting to get some signal

12.6 Stemming

Documents often contain different versions of one base word

We refer to the common base as the stem

Often, we may want to treat the different versions of the stem as the same token. This can reduce the total number of tokens that we need to use for features later which can lead to a better performing model

For example, do we need to distinguish between movie vs. movies or actor vs. actors or should we collapse those pairs into a single token?

There are many different algorithms that can stem words for us (i.e., collapse multiple versions into the same base). However, we will focus on only one here as an introduction to the concept and approach for stemming

This is the Porter method and a current implementation of it is available in the using wordStem() in the SnowballC package

The goal of stemming is to reduce the dimensionality (size) of our vocabulary

Whenever we can combine words that “belong” together with respect to our goal, we may improve the performance of our model

However, stemming is hard and it will also invariably combine words that shouldn’t be combined

Stemming is useful when it suceeds more than it fails or when it succees more with important words/tokens

Here are examples of when it helps to reduce our vocabulary given our task

wordStem(c("movie", "movies"))

[1] "movi" "movi"

wordStem(c("actor", "actors", "actress", "actresses"))

[1] "actor"   "actor"   "actress" "actress"

wordStem(c("wait", "waits", "waiting", "waited"))

[1] "wait" "wait" "wait" "wait"

wordStem(c("play", "plays", "played", "playing"))

[1] "plai" "plai" "plai" "plai"

Sometimes it works partially, likely with still some benefit

wordStem(c("go", "gone", "going"))

[1] "go"   "gone" "go"

wordStem(c("send", "sending", "sent"))

[1] "send" "send" "sent"

wordStem(c("fishing", "fished", "fisher"))

[1] "fish"   "fish"   "fisher"

But it clearly makes salient errors too

wordStem(c("university", "universal", "universe"))

[1] "univers" "univers" "univers"

wordStem(c("is", "are", "was"))

[1] "i"  "ar" "wa"

wordStem(c("he", "his", "him"))

[1] "he"  "hi"  "him"

wordStem(c("like", "liking", "likely"))

[1] "like" "like" "like"

wordStem(c("mean", "meaning"))

[1] "mean" "mean"

Of course, the errors are more important if they are with words that contain predictive signal

Therefore, we should look at how it works with our text

To stem our tokens if we only care about unigrams:

We first tokenize as before (including removal of stop words)
Then we stem, for now using wordStem()
We are mutating the stemmed words into a new column, stem, so we can compare its effect

tokens <- 
  data_trn |> 
  unnest_tokens(word, text, 
                drop = FALSE,
                strip_numeric = TRUE,
                stopwords = all_stops) |> 
  mutate(stem = wordStem(word)) |> 
  select(word, stem)

Let’s compare vocabulary size
Stemming produced a sizable reduction in vocabulary size

length(unique(tokens$word))

[1] 83822

length(unique(tokens$stem))

[1] 60813

Let’s compare frequencies of the top 100 words vs. stems

Here are our normal (not stemmed words). Notice vocabulary size

word_tokens <-
  tokens |> 
  tab(word) |> 
  arrange(desc(n)) |> 
  slice(1:100) |> 
  select(word, n_word = n)

stem_tokens <-
  tokens |> 
  tab(stem) |> 
  arrange(desc(n)) |> 
  slice(1:100) |> 
  select(stem, n_stem = n)

word_tokens |> 
  bind_cols(stem_tokens)

# A tibble: 100 × 4
   word   n_word stem    n_stem
   <chr>   <int> <chr>    <int>
 1 movie   43509 movi     51159
 2 film    39058 film     47096
 3 good    15110 time     16146
 4 time    12691 good     15327
 5 story   11919 make     15203
 6 bad      9292 watch    13920
 7 people   9107 charact  13844
 8 great    9033 stori    13104
 9 made     8351 scene    10563
10 make     8018 show      9750
# ℹ 90 more rows

Stemming is often routinely used as part of an NLP pipeline.

Often without thought
We should consider carefully if it will help or hurt
We should likely formally evaluate it (via model configurations), sometimes without much comment about when it is helpful or not. We encourage you to think of stemming as a pre-processing step in text modeling, one that must be thought through and chosen (or not) with good judgment.

In this example, we focused on unigrams.

If we had wanted bigrams, we would have needed a different order of steps

Extract words
Stem
Put back together
Extract bigrams
Remove stop words

Think carefully about what you are doing and what your goals are!

You can read more about stemming and related (more complicated but possibly more precise) procedure called lemmazation in a chapter from Hvitfeldt and Silge (2022)

12.7 Bag of Words

Now that we understand how to tokenize our documents, we can begin to consider how to feature engineer using these tokens

The Bag-of-words (BoW) method

Provides one way of representing tokens within text data when modeling text with machine learning algorithms.
Is simple to understand and implement
Works well for problems such as document classification

BoW is a representation of text that describes the occurrence of words within a document. It involves two things:

A vocabulary of known “words” (I put words in quote because our tokens will sometimes be something other than a word)
A measure of the occurrence or frequency of these known words.

It is called a “bag” of words because information about the order or structure of words in the document is discarded. BoW is only concerned with occurrence or frequency of known words in the document, not where in the document they occur.

BoW assumes that documents that contain the same content are similar and that we can learn something about the document by its content alone.

BoW approaches vary on two primary characteristics:

What the token type is
- Word is most common
- Also common to use bigrams, combinations of unigrams (words) and bigrams
- Other options exist (e.g.,trigram)
How occurrence frequent of the word/token is measured
- Binary (presence or absence)
- Raw count
- Term frequency
- Term frequency - inverse document frequency (tf-idf)
- and other less common options

Lets start with a very simple example

Two documents
Tokenization to words
Lowercase, strip punctuation, did not remove any stopwords, no stemming or lemmatization
Binary measurement (1 = yes, 0 = no)

Document	i	loved	that	movie	am	so	happy	was	not	good
I loved that movie! I am so so so happy.	1	1	1	1	1	1	1	0	0	0
That movie was not good.	0	0	1	1	0	0	0	1	1	1

This matrix is referred to as a Document-Term Matrix (DTM)

Rows are documents
Columns are terms (tokens)
Combination of all terms/tokens is called the vocabulary
The terms columns (or a subset) will be used as features in our statistical algorithm to predict some outcome (not displayed)

You will also see the use of raw counts for measurement of the cell value for each term

Document	i	loved	that	movie	am	so	happy	was	not	good
I loved that movie! I am so so so happy.	2	1	1	1	1	3	1	0	0	0
That movie was not good.	0	0	1	1	0	0	0	1	1	1

Both binary and raw count measures are biased (increased) for longer documents

The bias based on document length motivates the use of term frequency

Term frequency is NOT a simple frequency count (that is raw count)
Instead it is the raw count divided by the document length
This removes the bias based on document length

## echo: false

tibble::tribble(
  ~Document, ~i, ~loved, ~that, ~movie, ~am, ~so, ~happy, ~was, ~not, ~good,
  "I loved that movie! I am so so so happy.", .2, .1, .1, .1, .1, .3, .1, 0, 0, 0,
  "That movie was not good.", 0, 0, .2, .2, 0, 0, 0, .2, .2, .2) |>
  print_kbl()

Document	i	loved	that	movie	am	so	happy	was	not	good
I loved that movie! I am so so so happy.	0.2	0.1	0.1	0.1	0.1	0.3	0.1	0.0	0.0	0.0
That movie was not good.	0.0	0.0	0.2	0.2	0.0	0.0	0.0	0.2	0.2	0.2

Term frequency can be dominated by frequently occurring words that may not be as important to understand the document as are rarer but more domain specific words

This was the motivation for removing stopwords but stopword removal may not be sufficient

Term Frequency - Inverse Document Frequency (tf-idf) was developed to address this issue

TF-IDF scales the term frequency by the inverse document frequency

Term frequency tells us that word is frequently used in the current document
IDF indexes how rare the word is across documents (higer IDF == more rare)

This emphasizes words used in specific documents that are not commonly used otherwise

$IDF = log(\frac{total\:number\:of\:documents}{documents\:containing\:the\:word})$

This results in larger values for words that arent used in many documents. e.g.,

for a word that appears in only 1 of 1000 documents, idf = log(1000/1) = 3
for a word that appears in 1000 of 1000 documents, idf = log(1000/1000) = 0

Note that word that appears in no documents would result in a division by zero. Therefore it is common to add 1 to the denominator of idf

We should be aware of some of the limitations of BoW:

Vocabulary: The vocabulary requires careful thought
- Choice of stop words (global and subject specific) can matter
- Choice about size (number of tokens) can matter
Sparsity:
- Sparse representations (features that contain mostly zeros) are hard to model for computational reasons (space and time)
- Sparse representations present a challenge to extract signal in a large representational space
Meaning: Discarding word order ignores the context, and in turn meaning of words in the document (semantics).
- Context and meaning can offer a lot to the model
- “this is interesting” vs “is this interesting”
- “old bike” vs “used bike”

12.8 Bringing it all together

12.8.1 General

We will now explore a series of model configurations to predict the sentiment (positive vs negative) of IMDB.com reviews

To keep things simple, all model configurations will use only one statistical algorithm - glmnet
- Within algorithm, configurations will differ by penatly and `dials::mixture’ - see grid below
We will consider BoW features derived by TF-IDF only
We will remove global stop words from some configurations
We will stem words from some configurations
These BoW configurations will also differ by token type
- Word/unigram
- A combination of uni- and bigrams
- Prior to casting the tf-idf document-term matrix, we will filter tokens to various sizes - see grid below
We will also train a word embedding model to demonstrate how to implement that feature engineering technique in tidymodels

We are applying these feature engineering steps blindly. YOU should not.

You will want to explore the impact of your feature engineering either

During EDA with unnest_tokens()
After making your recipe by using it to make a feature matrix
Likely some of both

Lets start fresh with our training data

data_trn <- read_csv(here::here(path_data, "imdb_trn.csv"),
                     show_col_types = FALSE) |> 
  rowid_to_column(var = "doc_num") |> 
  mutate(sentiment = factor(sentiment, levels = c("neg", "pos"))) |> 
  glimpse()

Rows: 25,000
Columns: 3
$ doc_num   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ sentiment <fct> neg, neg, neg, neg, neg, neg, neg, neg, neg, neg, neg, neg, …
$ text      <chr> "Story of a man who has unnatural feelings for a pig. Starts…

And do our (very minimal) cleaning

data_trn <- data_trn |> 
  mutate(text = str_replace_all(text, "<br /><br />", " "),
         text = str_replace_all(text, " _", " "),
         text = str_replace_all(text, " _", " "),
         text = str_replace_all(text, "^_", ""),
         text = str_replace_all(text, "_\\.", "\\."),
         text = str_replace_all(text, "\\(_", "\\("),
         text = str_replace_all(text, ":_", ": "),
         text = str_replace_all(text, "_{3,}", " "))

We will select among model configurations using a validation split resampled accuracy to ease computational costs

set.seed(123456)
splits <- data_trn |> 
  validation_split(strata = sentiment)

We will use a simple union of stop words for some configurations:

It could help to edit the global list (or use a smaller list like snowball only)
It could help to have subject specific stop words too AND they might need to be different for words vs. bigrams

all_stops <- union(pull(get_stopwords(source = "smart"), word), pull(get_stopwords(source = "snowball"), word))

All of our model configurations will be tuned on penalty, mixture and max_tokens

grid_tokens <- grid_regular(penalty(range = c(-7, 3)), 
                            mixture(), 
                            max_tokens(range = c(5000, 10000)), 
                            levels = c(20, 6, 2))

12.8.2 Words (unigrams)

We will start by fitting a BoW model configuration for word tokens

Recipe for Word Tokens. NOTE:

token = "words" (default)
max_tokens = tune() - We are now set to tune recipes!!! options = list(stopwords = all_stops) - Passing in options to tokenizers::tokenize_words()

rec_word <-  recipe(sentiment ~ text, data = data_trn) |>
  step_tokenize(text, 
                engine = "tokenizers", token = "words") |>    
  step_tokenfilter(text, max_tokens = tune::tune()) |>
  step_tfidf(text)  |> 
  step_normalize(all_predictors())

Tuning hyperparameters for Word Tokens

fits_word <- cache_rds(
  expr = {
    logistic_reg(penalty = tune::tune(), 
                 mixture = tune::tune()) |> 
      set_engine("glmnet") |> 
      tune_grid(preprocessor = rec_word, 
                resamples = splits, 
                grid = grid_tokens, 
                metrics = metric_set(accuracy))

},
  rerun = rerun_setting,
  dir = "cache/012/",
  file = "fits_word")

Confirm that the range of hyperparameters we considered was sufficient

autoplot(fits_word)

Display performance of best configuration for Word Tokens.

Wow, pretty good!

show_best(fits_word)

# A tibble: 5 × 9
  penalty mixture max_tokens .metric  .estimator  mean     n std_err
    <dbl>   <dbl>      <int> <chr>    <chr>      <dbl> <int>   <dbl>
1 0.207       0         5000 accuracy binary     0.896     1      NA
2 0.0183      0.2      10000 accuracy binary     0.896     1      NA
3 0.00546     0.4      10000 accuracy binary     0.895     1      NA
4 0.00546     0.6      10000 accuracy binary     0.895     1      NA
5 0.207       0        10000 accuracy binary     0.893     1      NA
  .config               
  <chr>                 
1 Preprocessor1_Model013
2 Preprocessor2_Model031
3 Preprocessor2_Model050
4 Preprocessor2_Model070
5 Preprocessor2_Model013

12.8.3 Words (unigrams) Excluding Stop Words

Lets try a configuration that removes stop words

Recipe for Word Tokens excluding Stop Words. NOTE:

options = list(stopwords = all_stops) - Passing in options to tokenizers::tokenize_words()

rec_word_nsw <-  recipe(sentiment ~ text, data = data_trn) |>
  step_tokenize(text, 
                engine = "tokenizers", token = "words",   # included for clarity
                options = list(stopwords = all_stops)) |>
  step_tokenfilter(text, max_tokens = tune::tune()) |>
  step_tfidf(text)  |> 
  step_normalize(all_predictors())

Tuning hyperparameters for Word Tokens excluding Stop Words

fits_word_nsw <- cache_rds(
  expr = {
    logistic_reg(penalty = tune::tune(), 
                 mixture = tune::tune()) |> 
      set_engine("glmnet") |> 
      tune_grid(preprocessor = rec_word_nsw, 
                resamples = splits, 
                grid = grid_tokens, 
                metrics = metric_set(accuracy))

},
  rerun = rerun_setting,
  dir = "cache/012/",
  file = "fits_word_nws")

Confirm that the range of hyperparameters we considered was sufficient

autoplot(fits_word_nsw)

Display performance of best configuration for Word Tokens excluding Stop Words.

Looks like our mindless use of a large list of global stop words didn’t help.
We might consider the more focused snowball list
We might consider subject specific stop words
For this demonstration, we will just move forward without stop words

show_best(fits_word_nsw)

# A tibble: 5 × 9
  penalty mixture max_tokens .metric  .estimator  mean     n std_err
    <dbl>   <dbl>      <int> <chr>    <chr>      <dbl> <int>   <dbl>
1 0.00546     0.6      10000 accuracy binary     0.880     1      NA
2 0.0183      0.2      10000 accuracy binary     0.879     1      NA
3 0.00546     0.4       5000 accuracy binary     0.879     1      NA
4 0.0183      0.2       5000 accuracy binary     0.878     1      NA
5 0.00162     1         5000 accuracy binary     0.877     1      NA
  .config               
  <chr>                 
1 Preprocessor2_Model070
2 Preprocessor2_Model031
3 Preprocessor1_Model050
4 Preprocessor1_Model031
5 Preprocessor1_Model109

12.8.4 Stemmed Words (unigrams)

Now we will try using stemmed words

Recipe for Stemmed Word Tokens. NOTE: step_stem()

rec_stemmed_word <-  recipe(sentiment ~ text, data = data_trn) |>
  step_tokenize(text, 
                engine = "tokenizers", token = "words") |>    # included for clarity
  step_stem(text) |> 
  step_tokenfilter(text, max_tokens = tune()) |>
  step_tfidf(text)  |> 
  step_normalize(all_predictors())

Tuning hyperparameters for Stemmed Word Tokens

fits_stemmed_word <- cache_rds(
  expr = {
    logistic_reg(penalty = tune(), mixture = tune()) |> 
    set_engine("glmnet") |> 
    tune_grid(preprocessor = rec_stemmed_word, 
              resamples = splits, 
              grid = grid_tokens, 
              metrics = metric_set(accuracy))
},
  rerun = rerun_setting, 
  dir = "cache/012/",
  file = "fits_stemmed_word")

Confirm that the range of hyperparameters we considered was sufficient

autoplot(fits_stemmed_word)

Display performance of best configuration for Stemmed Word Tokens

Not much change

show_best(fits_stemmed_word)

# A tibble: 5 × 9
  penalty mixture max_tokens .metric  .estimator  mean     n std_err
    <dbl>   <dbl>      <int> <chr>    <chr>      <dbl> <int>   <dbl>
1 0.207       0        10000 accuracy binary     0.896     1      NA
2 0.695       0        10000 accuracy binary     0.896     1      NA
3 0.0183      0.2      10000 accuracy binary     0.896     1      NA
4 0.00546     0.4      10000 accuracy binary     0.895     1      NA
5 0.00546     0.6      10000 accuracy binary     0.894     1      NA
  .config               
  <chr>                 
1 Preprocessor2_Model013
2 Preprocessor2_Model014
3 Preprocessor2_Model031
4 Preprocessor2_Model050
5 Preprocessor2_Model070

12.8.5 ngrams (unigrams and bigrams)

Now we try both unigrams and bigrams

Recipe for unigrams and bigrams. NOTES:

token = "ngrams"
options = list(n = 2, n_min = 1) - includes uni (1) and bi(2) grams
not stemmed (stemming doesn’t work by default for bigrams)

rec_ngrams <-  recipe(sentiment ~ text, data = data_trn) |>
  step_tokenize(text, 
                engine = "tokenizers", token = "ngrams",
                options = list(n = 2, n_min = 1)) |>
  step_tokenfilter(text, max_tokens =  tune::tune()) |>
  step_tfidf(text) |> 
  step_normalize(all_predictors())

Tuning hyperparameters for ngrams

fits_ngrams <- cache_rds(
  expr = {
    logistic_reg(penalty = tune::tune(), mixture = tune::tune()) |> 
      set_engine("glmnet") |> 
      tune_grid(preprocessor = rec_ngrams, 
                resamples = splits, 
                grid = grid_tokens, 
                metrics = metric_set(yardstick::accuracy))

},
  rerun = rerun_setting,
  dir = "cache/012/",
  file = "fits_ngrams")

Confirm that the range of hyperparameters we considered was sufficient

autoplot(fits_ngrams)

Display performance of best configuration

Our best model yet!

show_best(fits_ngrams)

# A tibble: 5 × 9
  penalty mixture max_tokens .metric  .estimator  mean     n std_err
    <dbl>   <dbl>      <int> <chr>    <chr>      <dbl> <int>   <dbl>
1 0.207       0        10000 accuracy binary     0.901     1      NA
2 0.695       0        10000 accuracy binary     0.900     1      NA
3 0.0183      0.2      10000 accuracy binary     0.900     1      NA
4 0.00546     0.4      10000 accuracy binary     0.898     1      NA
5 0.00546     0.6      10000 accuracy binary     0.897     1      NA
  .config               
  <chr>                 
1 Preprocessor2_Model013
2 Preprocessor2_Model014
3 Preprocessor2_Model031
4 Preprocessor2_Model050
5 Preprocessor2_Model070

12.8.6 Word Embeddings

BoW is an introductory approach for feature engineering.

As you have read, word embeddings are a common alternative that addresses some of the limitations of BoW. Word embeddings are also well-support in the tidyrecipes package.

Let’s switch gears away from document term matrices and BoW to word embeddings

You can find pre-trained word embeddings on the web

Below, we download and open pre-trained GloVe embeddings

I chose a smaller set of embeddings to ease computational cost
- Wikipedia 2014 + Gigaword 5
- 6B tokens, 400K vocab, uncased, 50d

temp <- tempfile()
options(timeout = max(300, getOption("timeout")))  # need more time to download big file
download.file("https://nlp.stanford.edu/data/glove.6B.zip", temp)
unzip(temp, files = "glove.6B.50d.txt")
glove_embeddings <- read_delim(here::here("glove.6B.50d.txt"),
                               delim = " ",
                               col_names = FALSE)

Rows: 8 Columns: 51
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr  (1): X1
dbl (50): X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Recipe for GloVe embedding. NOTES:

token = "words"
No need to filter tokens
New step is step_word_embeddings(text, embeddings = glove_embeddings)

# rec_glove <-  recipe(sentiment ~ text, data = data_trn) |>
#   step_tokenize(text, 
#                 engine = "tokenizers", token = "words",   # included for clarity
#                 options = list(stopwords = all_stops)) |>
#   step_word_embeddings(text, embeddings = glove_embeddings)

Hyperparameter grid for GloVe embedding (no need for max_tokens)

# grid_glove <- grid_regular(penalty(range = c(-7, 3)), 
#                             dials::mixture(), 
#                             levels = c(20, 6))

Tuning hyperparameters for GloVe embedding

# fits_glove <-
#   logistic_reg(penalty = tune(), 
#              mixture = tune()) |> 
#   set_engine("glmnet") |> 
#   tune_grid(preprocessor = rec_glove, 
#             resamples = splits, 
#             grid = grid_glove, 
#             metrics = metric_set(accuracy))

Confirm that the range of hyperparameters we considered was sufficient

# autoplot(fits_glove)

Display performance of best configuration for GloVe embedding

# show_best(fits_glove)

12.8.7 Best Configuration

Fit best model configuration to training set

rec_final <-
  recipe(sentiment ~ text, data = data_trn) |>
    step_tokenize(text, 
                  engine = "tokenizers", token = "ngrams",
                  options = list(n = 2, n_min = 1)) |>
    step_tokenfilter(text, max_tokens = select_best(fits_ngrams)$max_tokens) |>
    step_tfidf(text) |> 
    step_normalize(all_predictors())

rec_final_prep <- rec_final |>
  prep(data_trn)

Warning in asMethod(object): sparse->dense coercion: allocating vector of size
1.9 GiB

feat_trn <- rec_final_prep |> 
  bake(NULL)

fit_final <-
  logistic_reg(penalty = select_best(fits_ngrams)$penalty, 
             mixture = select_best(fits_ngrams)$mixture) |> 
  set_engine("glmnet") |> 
  fit(sentiment ~ ., data = feat_trn)

Open and clean test to make features

data_test <- read_csv(here::here(path_data, "imdb_test.csv"),
                      show_col_types = FALSE) |> 
  rowid_to_column(var = "doc_num") |> 
  mutate(sentiment = factor(sentiment, levels = c("neg", "pos"))) |> 
  glimpse()

Rows: 25,000
Columns: 3
$ doc_num   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ sentiment <fct> neg, neg, neg, neg, neg, neg, neg, neg, neg, neg, neg, neg, …
$ text      <chr> "Once again Mr. Costner has dragged out a movie for far long…

data_test <- data_test |> 
  mutate(text = str_replace_all(text, "<br /><br />", " "),
         text = str_replace_all(text, " _", " "),
         text = str_replace_all(text, " _", " "),
         text = str_replace_all(text, "^_", ""),
         text = str_replace_all(text, "_\\.", "\\."),
         text = str_replace_all(text, "\\(_", "\\("),
         text = str_replace_all(text, ":_", ": "),
         text = str_replace_all(text, "_{3,}", " "))

feat_test <- rec_final_prep |> 
  bake(data_test)

Warning in asMethod(object): sparse->dense coercion: allocating vector of size
1.9 GiB

Predict into test

accuracy_vec(feat_test$sentiment, predict(fit_final, feat_test)$.pred_class)

[1] 0.89268

Confusion matrix

cm <- tibble(truth = feat_test$sentiment,
             estimate = predict(fit_final, feat_test)$.pred_class) |> 
  conf_mat(truth, estimate)

autoplot(cm, type = "heatmap")

And lets end by calculating Permutation feature importance scores in test set using DALEX

library(DALEX, exclude= "explain")

Welcome to DALEX (version: 2.4.3).
Find examples and detailed introduction at: http://ema.drwhy.ai/

library(DALEXtra)

We are going to sample only a subset of the test set to keep the computational costs lower for this example.

set.seed(12345)
feat_subtest <- feat_test |> 
  slice_sample(prop = .05) # 5% of data

Now we can get a df for the features (without the outcome) and a separate vector for the outcome.

x <- feat_subtest |> select(-sentiment)

For outcome, we need to convert to 0/1 (if classification), and then pull the vector out of the dataframe

y <- feat_subtest |> 
  mutate(sentiment = if_else(sentiment == "pos", 1, 0)) |> 
  pull(sentiment)

We also need a specific predictor function that will work with the DALEX package

predict_wrapper <- function(model, newdata) {
  predict(model, newdata, type = "prob") |> 
    pull(.pred_pos)
}

We will also need an explainer object based on our model and data

explain_test <- explain_tidymodels(fit_final, # our model object 
                                   data = x, # df with features without outcome
                                   y = y, # outcome vector
                                   # our custom predictor function
                                   predict_function = predict_wrapper)

Preparation of a new explainer is initiated
  -> model label       :  model_fit  (  default  )
  -> data              :  1250  rows  10000  cols 
  -> data              :  tibble converted into a data.frame 
  -> target variable   :  1250  values 
  -> predict function  :  predict_function 
  -> predicted values  :  No value for predict function target column. (  default  )
  -> model_info        :  package parsnip , ver. 1.1.0.9004 , task classification (  default  ) 
  -> predicted values  :  numerical, min =  0.0003611459 , mean =  0.5143743 , max =  0.9999001  
  -> residual function :  residual_function 
  -> residuals         :  numerical, min =  0 , mean =  0 , max =  0  
  A new explainer has been created!

Finally, we need to define a custom function for our performance metric as well

accuracy_wrapper <- function(observed, predicted) {
  observed <- fct(if_else(observed == 1, "pos", "neg"),
                  levels = c("pos", "neg"))
  predicted <- fct(if_else(predicted > .5, "pos", "neg"), 
                   levels  = c("pos", "neg"))
  accuracy_vec(observed, predicted)
}

We are now ready to calculate feature importance metrics

Only doing 1 permutation for each feature to keep computational costs lower for this demonstration. In real, life do more!

set.seed(123456)
imp_permute <- cache_rds(
  expr = {model_parts(explain_test, 
                      type = "raw", 
                      loss_function = accuracy_wrapper,
                      B = 1)
  },
  rerun = rerun_setting,
  dir = "cache/012/",
  file = "imp_permute"
)

Plot top 30 in an informative display

imp_permute |> 
  filter(variable != "_full_model_",
         variable != "_baseline_") |> 
  mutate(variable = fct_reorder(variable, dropout_loss)) |> 
  slice_head(n = 30) |> 
  print()

                   variable permutation dropout_loss     label
1              tfidf_text_2           0        0.902 model_fit
2        tfidf_text_a great           0        0.902 model_fit
3         tfidf_text_at all           0        0.902 model_fit
4            tfidf_text_and           0        0.903 model_fit
5          tfidf_text_below           0        0.903 model_fit
6         tfidf_text_decent           0        0.903 model_fit
7     tfidf_text_especially           0        0.903 model_fit
8           tfidf_text_true           0        0.903 model_fit
9      tfidf_text_very good           0        0.903 model_fit
10        tfidf_text_actors           0        0.904 model_fit
11           tfidf_text_age           0        0.904 model_fit
12          tfidf_text_aged           0        0.904 model_fit
13        tfidf_text_always           0        0.904 model_fit
14    tfidf_text_an amazing           0        0.904 model_fit
15     tfidf_text_and while           0        0.904 model_fit
16       tfidf_text_appears           0        0.904 model_fit
17         tfidf_text_as he           0        0.904 model_fit
18         tfidf_text_avoid           0        0.904 model_fit
19        tfidf_text_beauty           0        0.904 model_fit
20           tfidf_text_bit           0        0.904 model_fit
21         tfidf_text_bored           0        0.904 model_fit
22        tfidf_text_boring           0        0.904 model_fit
23    tfidf_text_boring and           0        0.904 model_fit
24       tfidf_text_did not           0        0.904 model_fit
25 tfidf_text_disappointing           0        0.904 model_fit
26          tfidf_text_doll           0        0.904 model_fit
27          tfidf_text_door           0        0.904 model_fit
28           tfidf_text_dvd           0        0.904 model_fit
29        tfidf_text_end up           0        0.904 model_fit
30          tfidf_text_good           0        0.904 model_fit

imp_permute |> 
  filter(variable != "_full_model_",
         variable != "_baseline_") |> 
  mutate(variable = fct_reorder(variable, dropout_loss, 
                                .desc = TRUE)) |> 
  slice_tail(n = 30) |> 
  print()

                  variable permutation dropout_loss     label
1      tfidf_text_anywhere           0        0.908 model_fit
2         tfidf_text_as if           0        0.908 model_fit
3        tfidf_text_as the           0        0.908 model_fit
4          tfidf_text_best           0        0.908 model_fit
5       tfidf_text_blatant           0        0.908 model_fit
6  tfidf_text_disappointed           0        0.908 model_fit
7       tfidf_text_episode           0        0.908 model_fit
8    tfidf_text_first half           0        0.908 model_fit
9         tfidf_text_giant           0        0.908 model_fit
10       tfidf_text_got to           0        0.908 model_fit
11     tfidf_text_help the           0        0.908 model_fit
12     tfidf_text_is still           0        0.908 model_fit
13      tfidf_text_it when           0        0.908 model_fit
14         tfidf_text_jack           0        0.908 model_fit
15         tfidf_text_main           0        0.908 model_fit
16         tfidf_text_meat           0        0.908 model_fit
17           tfidf_text_ok           0        0.908 model_fit
18         tfidf_text_once           0        0.908 model_fit
19       tfidf_text_one of           0        0.908 model_fit
20        tfidf_text_other           0        0.908 model_fit
21      tfidf_text_perfect           0        0.908 model_fit
22         tfidf_text_ship           0        0.908 model_fit
23        tfidf_text_shoot           0        0.908 model_fit
24      tfidf_text_sisters           0        0.908 model_fit
25 tfidf_text_something to           0        0.908 model_fit
26    tfidf_text_the lives           0        0.908 model_fit
27        tfidf_text_tries           0        0.908 model_fit
28     tfidf_text_tries to           0        0.908 model_fit
29    tfidf_text_very well           0        0.908 model_fit
30       tfidf_text_effort           0        0.909 model_fit

# full_model <- imp_permute |>  
#    filter(variable == "_full_model_")
  
# imp_permute |> 
#   filter(variable != "_full_model_",
#          variable != "_baseline_") |> 
#   mutate(variable = fct_reorder(variable, dropout_loss)) |> 
#   arrange(desc(dropout_loss) |>
#   slice(n = 30) |> 
#   ggplot(aes(dropout_loss, variable)) +
#     geom_vline(data = full_model, aes(xintercept = dropout_loss),
#                linewidth = 1.4, lty = 2, alpha = 0.7) +
#    geom_boxplot(fill = "#91CBD765", alpha = 0.4) +
#    theme(legend.position = "none") +
#    labs(x = "accuracy", y = NULL)

13 NLP Discussion

13.1 Announcements

Last application assignment done!
Remaining units
- Bring it together for applications
- Ethics and Algorithmic Fairness
- Reading and discussion only for both units. Discussion requires you!
Two weeks left!
- next tuesday (lab on NLP)
- next thursday (discussion on applications)
- following tuesday (discussion on ethics/fairness)
- following thursday (concepts final review; 50 minutes only)
Early start to final application assignment (assigned next thursday, April 25th) and due Wednesday, May 8th at 8pm
Concepts exam at lab time during finals period (Tuesday, May 7th, 11-12:15 in this room)

13.2 Single nominal variable

Our algorithms need features coded with numbers
- How did we generally do this?
- What about algorithms like random forest?

13.3 Single nominal variable Example

Current emotional state
- angry, afraid, sad, excited, happy, calm
How represented with one-hot?

13.4 What about document of words (e.g., sentence rather than single word)

I watched a scary movie last night and couldn’t get to sleep because I was so afraid
I went out with my friends to a bar and slept poorly because I drank too much

How represented with bag of words?
binary, count, tf-idf

What are the problems with these approaches
- relationships between features or similarity between full vector across observations
- context/order
- dimensionality
- sparsity

13.5 N-grams vs. simple (1-gram) BOW

How different?

some context/order
but still no relationship/meaning or similarity
even higher dimesions

13.6 Linguistic Inquiry and Word Count (LIWC)

Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29(1), 24–54. https://doi.org/

Meaningful features? (based on domain expertise)
Lower dimensional
Very limited breadth

- Can add domain specific dictionaries

13.7 Word Embeddings

Encodes word meaning
Words that are similar get similar vectors
Meaning derived from context in various text corpora
Lower dimensional
Less sparse

13.8 Examples

Affect Words

Two more general
word2vec (google)
- CBOW
- skipgram

13.9 fasttext (facebook ai team)

n-grams (e.g., 3-6 character representations)
word vector is sum of its ngrams
can handle low frequecynor even novel wordsp

13.10 Other Methods

Other approaches
- glove
- elmo
- BERT :w

Hvitfeldt, Emil, and Julia Silge. 2022. Supervised Machine Learning for Text Analysis in R. https://smltar.com/.

Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. 1rst ed. Beijing; Boston: O’Reilly Media.

Wickham, Hadley, Çetinkaya-Rundel Mine, and Garrett Grolemund. 2023. R for Data Science: Visualize, Model, Transform, and Import Data. 2nd ed. https://r4ds.hadley.nz/.