Introduction
This is a text mining analysis of the 180 “Seinfeld”" television show scripts. It is the famous “show about nothing” that ran from 1989 to 1998. The Seinfeld show continues to have millions of TV fans who watch the re-runs that are still being aired.
## Clear workspace
rm(list = ls())
## load libraries
library(stringr) ## for strings functions
library(XML) ## for parsing html files
library(rvest) ## for parsing html files
library(RCurl) ## downloading Web pages.
library(tm) ## for text mining
library(dplyr) ## for dataframe manipulation
library(tidytext) ## for text mining. *Not* a Hadley Wickham package.
library(tidyverse) ## has numerous Hadley Wickham libraries providing data manipulation functions, example tidyr
library(ggplot2) ## for plotting functions
library(gsubfn) ## for enhanced "sub" function
## path variables
codePath = '/Users/mitchellfawcett/Documents/Data Science/Seinfeld/Code'
dataPath = '/Users/mitchellfawcett/Documents/Data Science/Seinfeld/Data/RawScriptPage'
outputPath = '/Users/mitchellfawcett/Documents/Data Science/Seinfeld/Data/ScriptOutput'
All 180 Scripts?
Of course, there is a Website that has them. Here you go: http://www.seinfeldscripts.com/seinfeld-scripts.html
The scripts are embedded in Web pages that are filled with assorted links, advertisements, images and other extraneous “stuff”. These all needed to be removed. All I wanted was the name of the script, the date of the show, and the lines of dialog that each character speaks. I didn’t care about stage directions, notes about visual reactions of the characters or any side comments the transcriber may have inserted.
The scripts were transcribed by assorted fans of the show on a volunteer basis. The standards for formatting the transcriptions were fairly loose. Some transcribers used paragraph tags to define lines of dialog, others used line breaks. One used neither. The cast list is not always formatted the same way and even the start and end of the script is not always tagged the same way. Fortunately there were enough similarities between scripts to make a decent attempt at extraction.
Here are some of the variations in transcription techniques. The 3 digit number in the show name indicates the order in which the show aired; 001 to 180:
071_TheNonfatYogurt does not have colons after the character’s name for the start of each line of dialog.
069_TheBris, 090_TheChineseWoman, 076_TheStall puts character names with no colons and dialog on separate lines.
018_TheNote does not use <p> or <br> html tags to separate lines of dialog.
100_Highlights-of-100-1 and 101_Highlights-of-100-2 does not have any dialog documented. These are just snippets of previous shows.
177_The-Clip-Show-1 and 178_The-Clip-Show-2 does not have any dialog documented. These are just snippets of previous shows.
098_TheLabelMakerl, 027_TheStranded has ‘ ’ in almost every line.
171_TheWizard, 165_TheApology, 161_TheJunkMail, has ‘"’ in almost every line.
133_TheWaitOut is unfinished and ends with this line: KRAMER: (Bracing himself on a bar stool) Yeah, I already did it. It won’t come off. The zipper’s stuck.
149_TheSuzie has all sorts of notes about shots, camera angles etc. that need to be removed.
112_ThePostponement.html - the layout includes paragraph and break tags intermixed.
FIX : The last line of dialog is getting dropped from most scripts.
Why do this?
I can’t think of any true practical reasons for doing this other than I thought it would be fun. My day job is being a data analyst, so I intrinsically enjoy this sort of thing. Mostly I justified the time I spent as a good excuse to practice text mining sklls. Once the data was organized a certain way I figured I could answer questions like:
- Which character had the most lines?
- What were the most frequent words uttered by each charatcer?
- Which two characters spent the most time conversing with each other?
- How many times did the phrase “no soup for you!” occur?
- What was the longest script; the shortest?
- Which character asked the most questions?
…a computer program about nothing.
Download a script
The next block of code defines a function that takes a script name, builds a URL to locate it and then downloads the page. The downloaded HTML is saved into a text file with the same name as the script.
## Define the function that downloads and saves entire Web page that contains a script. The function will save all the HTML that is found at the
## script's URL page. No processing of the HTML takes place here.
downloadScriptPageFunc <- function(webPageName, sequenceNumber) {
## The function gets passed a Web page name and a sequence number.
## Sequence number are the script's aired order.
## get rid of any spaces in the Web page name
webPageName <- gsub(" ", "", webPageName, fixed = TRUE)
## Modify the sequence number so it is a left zero padded character string. we use it to keep the downloaded files in numerical order.
sequenceNumber <- paste("00", sequenceNumber, sep = "")
sequenceNumber <- str_sub(sequenceNumber,start=-3) ## just keep the right most 3 characters of the sequence number
## Load a variable with the path to the script source web page.
## This is the URL where we will find a transcription for a single episode.
## Example: "http://www.seinfeldscripts.com/TheDoorman.html"
webPageUrl <- paste("http://www.seinfeldscripts.com/", webPageName, sep='')
## Load a variable with the path and file location to save the raw script.
## This is the folder and file name where we will save the downloaded script html file to our computer.
## Example: "/Users/mitchellfawcett/Documents/Data Science/Seinfeld/Data/RawScriptPage/104_TheDoorman.html"
## (The doorman was the 104th episode.)
webPageSaveLocation <- paste(dataPath, "/", sequenceNumber, "_", webPageName, sep = '')
## Read the contents of the script web page into a variable using a base function, readLines.
webPageHtml <- readLines(webPageUrl, warn = FALSE) ## The "warn" argumnent set to FALSE so it will ignore files that are missing a final EOL.
fileConn <- file(webPageSaveLocation)
writeLines(webPageHtml, fileConn)
close(fileConn)
return()
}
Main Loop for Downloading Scripts from Web
## Apply the download function to each file name to download and save each script's html to a local file on disk.
## On the main page that had all the links to seinfeld scripts, only the 11th through 190th links point to show episodes. I figured
## this out by manually inspecting the page's html.
for (i in 11:190) {
downloadScriptPageFunc(htmlFileNames[i], i - 10) ## Subtracting 10 because the actual script file names don't start until element "11".
## The "i" value will be embeded in the name of the final cleaned up output file to provide a chronological sequence for the scripts.
}
Define Some Text Processing Functions
Need function to insert missing colons after characters names at the start of each line of dialog. They are needed for a later step.
insertColonsFunc <- function(somewebtext) {
## Look for these names at the start of a line and insert a colon after it.
## Elaine, Jerry, George, Lady, Kramer, Dad, Mom, Lloyd, Newman, Owner, Labbie, Labbette, Son, Doctor, News, Giuliani
## Replace names that are at the start of a line and have a space after, with the back-referrenced first word, colon, space ("\\1: ")
textWithColons <- sub("^(Elaine|Jerry|George|Lady|Kramer|Dad|Mom|Lloyd|Newman|Owner|Labbie|Labbette|Son|Doctor|News|Giuliani|Boy)\\s", "\\1: ", somewebtext)
return(textWithColons)
}
Define a Dialog Line Realignment Function
A few scripts put the character’s name and the line they speak on separate lines. This function puts a colon after the character’s name and then puts the dialog on the same line as the name. This is a standard format I use for later processing. This function involves a system command to perform some of the text manipulation.
realignFunc <- function(sometext, filename) {
## sometext <- webTextNoNewline ## test code. remove.
## Define a function to capitalize each word in a string
simpleCap <- function(this, x) {
s <- tolower(x)
s <- strsplit(s, " ")[[1]]
s2 <- paste(toupper(substring(s, 1,1)), substring(s, 2), sep="", collapse=" ")
return(paste0(s2, ': '))
}
## Find character names and put colon-space after them
## Here is the list of character names to look for. The list depends on the episode.
if (filename == "069_TheBris.htm") {
characterList <- "(^JERRY$|^GEORGE$|^KRAMER$|^ELAINE$|^MYRA$|^STAN$|^PATIENT$|^MRS. SWEEDLER$|^RESIDENT$|^MAN$|^WOMAN$|^MOHEL$|^ALL$|^FRANK and ESTELLE$)"
}
if (filename == "090_TheChineseWoman.htm") {
characterList <- "(^JERRY$|^GEORGE$|^KRAMER$|^ELAINE$|^DONNA$|^ESTELLE$|^FRANK$|^CAPE$|^DOCTOR$|^MAN$|^NOREEN$|^HOSTESS$|^GUY$)"
}
if (filename == "076_TheStall.htm") {
characterList <- "(^JERRY$|^GEORGE$|^KRAMER$|^ELAINE$|^JANE$|^TONY$)"
}
if (filename == "137_TheBizarroJerry.htm") {
characterList <- "(^JERRY$|^GEORGE$|^KRAMER$|^ELAINE$|^KEVIN$|^MAN #2$|^MAN$|^MAN3$|^GILLIAN$|^AMANDA$|^MODEL #2$|^MODEL #3$|^GENE$|^FELDMAN$|^LELAND$|^MODEL #4$|^MODEL #5$|^BOUNCER$|^VARGUS$)"
}
if (filename == "149_TheSuzie.htm") {
characterList <- "(^JERRY$|^GEORGE$|^KRAMER$|^ELAINE$|^MIKE$|^PEGGY$|^PETERMAN$|^WILHELM$|^ALLISON$)"
}
## See http://ftp.auckland.ac.nz/software/CRAN/doc/packages/gsubfn.pdf for a wonderful enhancement to the gsub function that allows the
## calling of an external function to modify strings found by the regex patterm. Here I use it to capitalize just the first letter of the character names.
namesWithColons <- gsubfn(pattern = characterList, replacement = simpleCap, x = sometext, backref = 1, ignore.case = TRUE)
## Remove lines that are all uppercase. More transcriber notes.
uppercaseRemoved <- namesWithColons[namesWithColons != toupper(namesWithColons)]
return(uppercaseRemoved)
}
Define a Function to Parse Based on Paragraph Tags
ptagParseFunc <- function(filename) {
## Parsing the html assuming paragraph tags are the dialog delimiters
## filename <- "149_TheSuzie.htm" ## test code comment out.
## Use the xml2 package to read a Web page file into an XML document object
xmlObject <- read_html(paste("/Users/mitchellfawcett/Documents/Data Science/Seinfeld/Data/RawScriptPage/", filename, sep = ''))
## Extract paragraph node set from the XML object and get the attributes for each node.
## Store the resulting list of character strings in a variable.
webText <- xmlObject %>% html_nodes("p") %>% html_text()
## Do some work to clean up the charcter strings contained in the text.
## Get rid of extra whitespace.
webTextNoSpace <- gsub("\\s+"," ", webText)
## Remove newline and carriage return
webTextNoNewline <- gsub("[\r\n]", "", webTextNoSpace)
## Some special proceesing if this is 069_TheBris, 090_TheChineseWoman, 076_TheStall, 137_TheBizarroJerry, 149_TheSuzie.htm
if (filename == "069_TheBris.htm" | filename == "090_TheChineseWoman.htm" | filename == "076_TheStall.htm" | filename == "137_TheBizarroJerry.htm" | filename == "149_TheSuzie.htm") {
webTextNoNewline <- realignFunc(webTextNoNewline, filename)
}
## An % at the beginning of a line indicate comments. Remove these lines
commentLines <- grep("^%", webTextNoNewline)
## For some reason if there are 0 comments, -grep removes all the lines. So check to make sure ther's at least one comment.
if (length(commentLines) > 0) {
webTextNoComment <- webTextNoNewline[-grep("^%", webTextNoNewline)]
} else {
webTextNoComment <- webTextNoNewline
}
## What is left is mostly dialog. A character's name followed by a colon (Jerry:) indicates the start of a line of dialog. But the dialog can
## be split between more than one physical line.
## Create a list of the line numbers that have a colon indicatng the start of a line of dialog.
lineStarts <- grep(":", webTextNoComment)
## Since a single dialog line can wrap to more than one physical lines, they need to be concatenated into a single line.
## Loop through all the physical lines in webTextNoComment and concatenate all the physical lines belonging into a single dialog line.
concatLines <- character()
for (i in 1:length(webTextNoComment)) {
newStart <- lineStarts[i] ## This is the physical line number for the start of a line of dialog.
nextStart <- lineStarts[i+1] ## This is the physical line number of the next characters line of dialog.
nextStartMinusOne <- lineStarts[i+1] - 1 ## This is the physical line number just before the next person's dialog start.
## We want to string together all the physical lines between newStart and nextStartMinusOne, inclusive.
if (is.finite(nextStartMinusOne)) { ## For the last line nextStartMinusOne = NaN. is.finite handles this so no error is generated.
concatLines <- c(concatLines, str_c(c(webTextNoComment[newStart:nextStartMinusOne]), collapse = ' ')) ## using the c() function is okay in this case
## because of the small number of inserts needed.
}
}
## Remove stuff that is within parentheses. These are transcriber notes, not dialog.
parentheseRemoved <- gsub("\\s*\\([^\\)]+\\)", replacement = "", x = concatLines)
## Get rid of stuff between square brackets. More transcriber notes.
squarebracketRemoved <- gsub("\\[.+?]", replacement = "", x = parentheseRemoved)
script_p <- squarebracketRemoved
## Remove any curly braces that are left hanging around
script_p <- gsub("\\{|\\|\\}", '', script_p)
## Remove embedded double quotes
script_p <- gsub('"', '', script_p)
## TheSuzie contains notes about camera angles, which camera is on which actor, kind of shot, etc. These all need to come out but there is no distinctive
## delimiters that contain them. Look for unique strings and remove the entire line of text that has it.
script_p <- grep("MS|full shot|cameras|pans|a:|E;|#2|2:", script_p, value = TRUE, invert = TRUE)
## Remove lines with lots of repetative punctuations
script_p <- grep("=========", script_p, value = TRUE, invert = TRUE)
## Save the script
write(script_p, paste(outputPath, filename, sep = '/'))
}
Define a Function to Determine if Paragrah Tag or Break Tag Parsing should be Used for a Script
The function uses the count of the two tag types to determine which method to use. There is special forcing of method depending on the script in some cases.
## First, a test to determine if *paragraph* or *break* tags delimit lines of dialog. There are different parsing methods used for each.
## Examples:
## TheOldMan.htm uses <p>...</p> to deliniate lines of dialog
## TheDeal.htm uses <br>
parsePageFunc <- function(filename) {
## filename <- '112_ThePostponement.html' ## put a file name her to test the code for this function. Comment it out for production.
## Read in the raw html code and count the two different delimiters.
rawhtml <- readLines(paste("/Users/mitchellfawcett/Documents/Data Science/Seinfeld/Data/RawScriptPage/", filename, sep = ''))
## Do a test to see which of two ways a raw script file should be processed.
if (sum(str_count(rawhtml, "<p>")) >= sum(str_count(rawhtml, "<br>")) & filename != '018_TheNote.html' & filename != '112_ThePostponement.html') {
## '018_TheNote.html' and 112_ThePostponement.html must be forced to be processed by the <br> tag method because of the unique way they were transcribed.
## if count of <p> is greater than <br>, Paragraph tags were used.
## if count of <p> is not greater than <br>, Break tags were used.
## Parse the html assuming paragraph tags '<p>' are the dialog delimiters
ptagParseFunc(filename)
} else {
## Parse the file using <br> tags as dialog delimiters
brtagParseFunc(filename)
}
}
Loop for Parsing Each Downloaded Script File
The following code loops through the list of raw source files calling the “parsePageFunc” function for each one. parsePageFunc extracts the lines of dialog from the page and saves it to a text file.
Loop to Get Broadcast Dates
fileNames <- dir(dataPath, pattern = "^([0-1][0-9][0-9])") ## Pattern finds files that start with three digit numbers.
unlist(lapply(fileNames, FUN = broadcastDateFunc))
About a third of the script transcriptions do not have the original air date specified, so won’t try to do any analysis using that feature.
Start the Dialog Analysis
I now have the dialog (where available) extracted for each of the 180 episodes and saved into separate text files, named according to the episode number and show title.
The analysis I’ll be doing will be statistical in nature and will treat the dialog as a neutral “bag of words”. There will be no natural language processing, no parts of speech analysis, no contextual analysis. Instead I will use simple word counts, word similarity and word associations, that is, distances between particular words, to draw conclusions about the text.
Create a Tidy Text Script Structure
The tidytext package has functions for reshaping the script text into a format that will allow efficient text mining activity. See https://cran.r-project.org/web/packages/tidytext/tidytext.pdf.
Create a tidytext data structure with separate columns for the character’s name and what they said.
Define the Tokenize Function
This function takes a cleaned up script file and tokenizes it. The result is a data frame with where each row represents a single word of dialog. Each row has the word, the script line number where the word appeared, who said it, the episode title and the episode number.
## Each line in a script file consists of a character's name followed by a colon, a space, then a line of dialog text.
## I want to split lines on the first occurance of a colon and create a new dataframe with two columns.
tokenizeFunc <- function(filename) {
## filename <- "004_MaleUnbonding.htm" ## test code
## Get the episode number and tile from the passed file name
f <- sub("\\..*", "", filename) ## chop off after the dot in filename
episodeNumber <- (substr(f, 1, 3))
episodeTitle <- substr(f, 5, 50)
scriptLines <- as.data.frame(readLines(paste("/Users/mitchellfawcett/Documents/Data Science/Seinfeld/Data/ScriptOutput/", filename, sep = '')))
## This uses the tidyr package. Does the split and dataframe building all in one step.
## Creates a dataframe with two columns: (1) Character (2) Text
## col = 1 means split the one and only column in scriptLines.
dialogDF <- tidyr::separate(data = scriptLines, col = 1, into = c("Character", "Text"), sep = ":", extra = "merge")
## extra = "merge" ignores colons after the first
## Another interesting way to split lines on the first colon.
## Citation: https://stackoverflow.com/questions/26246095/r-strsplit-on-first-instance
## myresults <- unlist(regmatches(scriptLines, regexpr(": ", scriptLines), invert = TRUE))
## The tidyverse method for adding row numbers to a data frame.
dialogDF <- tibble::rowid_to_column(dialogDF, "LineNumber")
## Add the episode number and show title to the episode's dialog data frame
dialogDF$EpisodeNumber <- episodeNumber
dialogDF$EpisodeTitle <- episodeTitle
return(dialogDF)
#
#
# ## The following builds one big dataset containing the dialog from every script
# ## by appending each script's dialog to the previous ones.
# if (exists('allDialogDF')) {
# allDialogDF <<- rbind(allDialogDF, dialogDF) ## use <<- to make allDialogDF a global object between function calls
# } else {
# allDialogDF <<- dialogDF
# }
#
# ## Tidytext function that puts each word into a separate row in a dataframe.
# oneScriptDF <- dialogDF %>% unnest_tokens(Word, Text) ## words for one script
#
# ## Append the script words to all the previous
# if (exists('allWordsDF')) {
# allWordsDF <<- rbind(allWordsDF, oneScriptDF) ## use <<- to make allWordsDF a global object between function calls
# } else {
# allWordsDF <<- oneScriptDF
# }
## allWordsDF will have all the words for all the scripts when we are done the looping through the scripts.
}
Main Loop for Tokenizing
This loops through all the saved script files and tokenizes each in turn. The result will be one big dataframe containing the words for all the scripts.
## fileNames <- dir(outputPath, pattern = "^003|004|005") ## Pattern finds files that start with 003, 004 or 005. Test code
fileNames <- dir(outputPath, pattern = "^([0-1][0-9][0-9])") ## Pattern finds files that start with three digit numbers.
## Loop through each file name, tokenize the script text and build a single dataframe with all the scripts dialog.
for (filename in fileNames) {
df <- tokenizeFunc(filename)
## The following builds one big dataset containing the dialog from every script
## by appending each script's dialog to the previous ones.
if (exists('allDialogDF')) {
allDialogDF <<- rbind(allDialogDF, df) ## use <<- to make allDialogDF a global object between function calls
} else {
allDialogDF <<- df
}
}
## Tidytext function that puts each word into a separate row in a dataframe.
allWordsDF <- allDialogDF %>% unnest_tokens(Word, Text) ## words for one script
allWordsTB <- as.tibble(allWordsDF) ## Convert the data frame result to a "tibble". See https://cran.r-project.org/web/packages/tibble/tibble.pdf
## Load the stop words lexicons from tidytext package
data(stop_words)
## Remove the stop words from script data using a tidytext function.
allWordsTB <- allWordsTB %>% anti_join(stop_words, by=c("Word"="word")) ##
nrow(allWordsTB) ## The word count went from 606,000 to 199,000 when stop words were removed.
## [1] 188165
Where Are We Now?
There are two main datasets at this pont.
allDialogDF is a dataframe that contains the dialog from every script. Each row represents one line of dialog
allWordsTB is a dataframe where each row is a single word.
Explore Script Line Counts
Count the number of lines in each script
## Total number of lines across all scripts
dim(allDialogDF)
## [1] 55073 5
## Number of lines in each script using dplyr, in ascending order.
scripts <- group_by(allDialogDF, EpisodeNumber, EpisodeTitle)
linecounts <- summarize(scripts, count = n(), na.rm = T)
linecounts[order(linecounts$count),]
## # A tibble: 180 x 4
## # Groups: EpisodeNumber [180]
## EpisodeNumber EpisodeTitle count na.rm
## <chr> <chr> <int> <lgl>
## 1 100 Highlights-of-100-1 1 TRUE
## 2 101 Highlights-of-100-2 1 TRUE
## 3 177 The-Clip-Show-1 2 TRUE
## 4 178 The-Clip-Show-2 2 TRUE
## 5 023 TheParkingGarage 20 TRUE
## 6 133 TheWaitOut 105 TRUE
## 7 030 TheSubway 180 TRUE
## 8 001 TheSeinfeldChronicles 211 TRUE
## 9 057 TheOuting 211 TRUE
## 10 035 TheBoyfriend2 212 TRUE
## # ... with 170 more rows
Explore Word Counts Across All Scripts
## Show a plot of word counts. Words that appear more than 600 times.
allWordsTB %>%
count(Word, sort = TRUE) %>%
filter(n > 600) %>%
mutate(Word = reorder(Word, n), na.remove = TRUE) %>%
ggplot(aes(Word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
Where Does the Word Jerry Appear?
head(allWordsTB)
## # A tibble: 6 x 5
## LineNumber Character EpisodeNumber EpisodeTitle Word
## <int> <chr> <chr> <chr> <chr>
## 1 1 JERRY 001 TheSeinfeldChronicles single
## 2 1 JERRY 001 TheSeinfeldChronicles enjoyable
## 3 1 JERRY 001 TheSeinfeldChronicles experiences
## 4 1 JERRY 001 TheSeinfeldChronicles life
## 5 1 JERRY 001 TheSeinfeldChronicles people
## 6 1 JERRY 001 TheSeinfeldChronicles hear
allWordsTB[ which(allWordsTB$Word=='jerry'), ]
## # A tibble: 3,676 x 5
## LineNumber Character EpisodeNumber EpisodeTitle Word
## <int> <chr> <chr> <chr> <chr>
## 1 61 GEORGE 001 TheSeinfeldChronicles jerry
## 2 63 GEORGE 001 TheSeinfeldChronicles jerry
## 3 69 GEORGE 001 TheSeinfeldChronicles jerry
## 4 73 GEORGE 001 TheSeinfeldChronicles jerry
## 5 122 GEORGE 001 TheSeinfeldChronicles jerry
## 6 191 LAURA 001 TheSeinfeldChronicles jerry
## 7 194 LAURA 001 TheSeinfeldChronicles jerry
## 8 1 Cast 002 TheStakeout jerry
## 9 1 Cast 002 TheStakeout jerry
## 10 50 Elaine 002 TheStakeout jerry
## # ... with 3,666 more rows
Build a Term Document Matrix
Another fundamental structure sometimes used for analyzing documents using the tm package is the so-called Corpus, representing a collection of text documents. In this case the text documents are the dialog files that I created.
src <- DirSource(outputPath)
scriptCorpus <- VCorpus(src)
inspect(scriptCorpus[1])
tdm <- TermDocumentMatrix(scriptCorpus)
findFreqTerms(tdm, 1000)
Other Notes
There is no dialog transcribed for the clips shows (1 & 2)