M. Fawcett - 09/24/2016

Published at http://rpubs.com/mrfawcettjr/209121

Introduction

In this report I use statistics to (almost seriously) answer the question “Is Donald Trump really a Republican”. It’s a question that many have asked because of Trump’s sometimes liberal-sounding rhetoric, and also the fact that he was a registered Democrat until 2012 1. As a secondary question, I address whether Hillary Clinton is really a Democrat, since coming from a prominant conservative family she campaigned for Barry Goldwater as a Young Republican, and did not switch to the Democratic party until her time at Yale.

My approach is to use presidential speeches from 1960 to present to create a statistical model that can accurately classify speech as either “conservative” or “liberal”. In doing this I will make the naive assumption that Republican = Conservative and Democratic = Liberal.

Nearly 400 speeches from the University of Virginia Miller Center (http://millercenter.org) were used in the training of a predictive model. There are representative speeches from every president from Richard Nixon to Barack Obama. I also used a sprinkling of writings from non-presidential conservative and liberal thinkers throughout history 2.

Two modeling algorithms were explored, Naive Bayes and C5.0. A boosted version of C5.0 was found to have the best predictive results. I chose these two algorithms because they are high performers and they have easily explained intuitions.

Both modeling techniques treat text as a “bag of words”. Frequency of word occurance was the only attribute used in model training and prediction. Syntax, semantics and meaning were completely ignored.

In the final result, Donald Trump was found to be statistically on the conservative side. Clinton came out on the liberal side, albeit less strongly than Trump’s conservative leaning. To jump to the prediction results click here section.

Methodology

A big part of this project was finding and preparing text from presidential speeches. Their speeches can be found all over the Internet, but fortunately the Miller Center at the University of Virginia provides a centralized list of links to transcripts of every important presidential address going all the way back to George Washington.

The “XML” package in R provided all the necessary tools to download and parse the html code from the Miller Center Website in order to filter out just the text of the speeches. The text of the couple dozen non-presidential political “thinkers” was manually prepared through copy and paste methods.

To see how the speeches were downloaded and processed go to http://rpubs.com/mrfawcettjr/209115.

The text mining package “tm” was used to prepare the speeches and writings for statistical analysis. It has R functions for loading text into “corpora” and turning them into matrices for training models. The “e1071” and “C50” packages were used to build the Naive Bayes and C5.0 models.

An outline of the model building process is as follows:

  1. Find and download writings and speeches associated with liberal or conservative stances.
  2. Organize the text files into a Conservative directory and a Liberal directory.
  3. Load all the text files into a single corpus.
  4. Process the corpus to remove punctuation, extra whitespace, numbers, profanity, non-ascii characters, stopwords, etc.
  5. Build a Document Term Matrix from the corpus.
  6. Reduce the number of features (words) in the DTM by setting a minimum frequency threshold.
  7. Convert word counts to yes/no indicator feature.
  8. Divide data into training, validation and testing datasets.
  9. Train model.
  10. Evaluate model performance.
  11. Improve model performance.
  12. Select model.
  13. Make predictions to answer the questions of interest.

A Tricky Thing About Naive Bayes Prediction Using R

To build and test models with Naive Bayes using R you must have the same number of levels for the independent variables in the training, validation and testing datasets. For example, if dealing with Yes/No values, you must assure that for each variable there are two levels in all three sets of data. If a training variable only has ‘Yes’ responses but either the validation or testing data has both ‘Yes’ and ‘No’, you will get an error when you test your model (subscript out of bounds).

One way to avoid this is by converting all the variables to a “factor” (step 6 below) before splitting the data into training, validation and testing sets.

  1. Load corpus.
  2. Convert corpus to a Document Term Matrix.
  3. Convert DTM to a matrix.
  4. Convert matrix to a data frame.
  5. Convert data frame variables into factors.
  6. Split data frame into training, validation and testing datasets.
  7. Complete model building process.

Load Libraries

Load R libraries needed by the program.

## Clean up
rm(list = ls())


#Load libraries
require("tm") ## for text mining

require("R.utils") ## for countLines

require("reader") ## for n.readLines

require("SnowballC")  ## for word stemming

require("e1071")  ## for naiveBayes

## require("caret")  ## for naiveBayes

require("gmodels")  ## for prediction evaluation

require("C50")   ## for J48 (C5.0) model

Prepare Training, Validation and Testing Data

The experimental data consists of presidential speeches since 1960 plus a few classic essays by conservative and liberal thinkers. The training data consists of plain text files, one file per speech or essay, stored in a directory called SourceData (/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/SourceData/…Liberal or Conservative sub-directory).

168 are Republican/Conservative, 236 are Democratic/Liberal

The data files were divided randomly as follows: 60% training, 20% validation, 20% testing,

The next block of R code copies random files from the SourceData Liberal and Conservative directories to respective directories under the /Training, /Validation and /Testing directories.

## set the seed to make the results reproducible
set.seed(123)

## Set some path variables
sourceDir <- "/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/SourceData"
trainingDir <- "/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/TrainingData"
validationDir <- "/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/ValidationData"
testingDir <- "/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/TestingData"

## Create lists containing all the liberal and conservative file names.
liberalSourceFiles <- list.files(paste(sourceDir, 'Liberal', sep = '/'))
conservativeSourceFiles <- list.files(paste(sourceDir, 'Conservative', sep = '/'))

#### Divvy up the liberal files, 60%, 20%, 20%
## 60% of the sample size will be training
smp_size_train <- floor(0.60 * length(liberalSourceFiles))

## Index of random element numbers from the list of liberal files.
train_index <- sample(seq_len(length(liberalSourceFiles)), size = smp_size_train, replace = FALSE)

## This is the training data. The 40% not-training files will be divided 50/50 into validation and testing.
liberalSourceFiles_train <- liberalSourceFiles[train_index]

## This remainder will get split 50/50 below
liberalSourceFiles_not_train <- liberalSourceFiles[-train_index]

## Get the 50/50 index split of the not-training files to make the validation and testing sets.
smp_size_validation <- floor(0.50 * length(liberalSourceFiles_not_train))
validation_index <- sample(seq_len(length(liberalSourceFiles_not_train)), size = smp_size_validation, replace = FALSE)

## Validation data
liberalSourceFiles_validation <- liberalSourceFiles_not_train[validation_index]
## Test data
liberalSourceFiles_test <- liberalSourceFiles_not_train[-validation_index]

#### Divvy up the conservative files, 60%, 20%, 20%
## 60% of the sample size will be training
smp_size_train <- floor(0.60 * length(conservativeSourceFiles))

## Index of random element numbers from the list of liberal files.
train_index <- sample(seq_len(length(conservativeSourceFiles)), size = smp_size_train, replace = FALSE)

## This gets the training set. The 40% not-training files will be divided 50/50 into validation and testing.
conservativeSourceFiles_train <- conservativeSourceFiles[train_index]
conservativeSourceFiles_not_train <- conservativeSourceFiles[-train_index]

## Get the 50/50 split of the not-training files to make the validation and testing sets.
smp_size_validation <- floor(0.50 * length(conservativeSourceFiles_not_train))
validation_index <- sample(seq_len(length(conservativeSourceFiles_not_train)), size = smp_size_validation, replace = FALSE)

conservativeSourceFiles_validation <- conservativeSourceFiles_not_train[validation_index]
conservativeSourceFiles_test <- conservativeSourceFiles_not_train[-validation_index]

## Copy data files to where they need to go....
## Clear directory before copying to it.
#### Copy liberal files:
## Training
result0 <- file.remove(dir(paste(trainingDir, 'Liberal', sep = '/'), full.names=TRUE))  ## clear out diretory
result1 <- file.copy(from = paste(sourceDir, 'Liberal', liberalSourceFiles_train, sep = '/'),
          to = paste(trainingDir, 'Liberal', sep = '/'))

## Validation
result2 <- file.remove(dir(paste(validationDir, 'Liberal', sep = '/'), full.names=TRUE))
result3 <- file.copy(from = paste(sourceDir, 'Liberal', liberalSourceFiles_validation, sep = '/'),
          to = paste(validationDir, 'Liberal', sep = '/'))

## Test
result4 <- file.remove(dir(paste(testingDir, 'Liberal', sep = '/'), full.names=TRUE))
result5 <- file.copy(from = paste(sourceDir, 'Liberal', liberalSourceFiles_test, sep = '/'),
          to = paste(testingDir, 'Liberal', sep = '/'))

#### Copy conservative files:
## Training
result6 <- file.remove(dir(paste(trainingDir, 'Conservative', sep = '/'), full.names=TRUE))
result7 <- file.copy(from = paste(sourceDir, 'Conservative', conservativeSourceFiles_train, sep = '/'),
          to = paste(trainingDir, 'Conservative', sep = '/'))

## Validation
result8 <- file.remove(dir(paste(validationDir, 'Conservative', sep = '/'), full.names=TRUE))
result9 <- file.copy(from = paste(sourceDir, 'Conservative', conservativeSourceFiles_validation, sep = '/'),
          to = paste(validationDir, 'Conservative', sep = '/'))

## Test
result10 <- file.remove(dir(paste(testingDir, 'Conservative', sep = '/'), full.names=TRUE))
result11 <- file.copy(from = paste(sourceDir, 'Conservative', conservativeSourceFiles_test, sep = '/'),
          to = paste(testingDir, 'Conservative', sep = '/'))

Build Corpus

Each data file contains the text of one speech or one writing. The classification of each item as Conservative or Liberal is based simply on who the author is and their political and philisophical association. Classification of the training data was not based on “expert” analysis of ideas, words or semantics.

The corpus will serve as a single container for all the texts that can be manipulated using functions found in the tm package.

## Load the conservative documents into corpora
myCorpus_conservative_train <- VCorpus(DirSource("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/TrainingData/Conservative"))
myCorpus_conservative_validate <- VCorpus(DirSource("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/ValidationData/Conservative"))
myCorpus_conservative_test <- VCorpus(DirSource("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/TestingData/Conservative"))

## Get the number of documents in the conservative folders
numberConservDocs_train <- nrow(as.matrix(summary(myCorpus_conservative_train)))
numberConservDocs_validate <- nrow(as.matrix(summary(myCorpus_conservative_validate)))
numberConservDocs_test <- nrow(as.matrix(summary(myCorpus_conservative_test)))

## Create a factor of labels for the conservative documents
conserveLabels_train <- replicate(numberConservDocs_train, 'conservative', simplify = "vector")
conserveLabels_validate <- replicate(numberConservDocs_validate, 'conservative', simplify = "vector")
conserveLabels_test <- replicate(numberConservDocs_test, 'conservative', simplify = "vector")

## Load the liberal documents into corpora
myCorpus_liberal_train <- VCorpus(DirSource("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/TrainingData/Liberal"))
myCorpus_liberal_validate <- VCorpus(DirSource("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/ValidationData/Liberal"))
myCorpus_liberal_test <- VCorpus(DirSource("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/TestingData/Liberal"))

## Get the number of documents in the liberal folders
numberLiberalDocs_train <- nrow(as.matrix(summary(myCorpus_liberal_train)))
numberLiberalDocs_validate <- nrow(as.matrix(summary(myCorpus_liberal_validate)))
numberLiberalDocs_test <- nrow(as.matrix(summary(myCorpus_liberal_test)))

## Create a factor of labels for the liberal documents
liberalLabels_train <- replicate(numberLiberalDocs_train, 'liberal', simplify = "vector")
liberalLabels_validate <- replicate(numberLiberalDocs_validate, 'liberal', simplify = "vector")
liberalLabels_test <- replicate(numberLiberalDocs_test, 'liberal', simplify = "vector")

## Combine the corpora. The order is important so we can later identify which rows in the document term matrix belong to which dataset when we build the model and test it.
myCorpus <- c(myCorpus_conservative_train, myCorpus_liberal_train, myCorpus_conservative_validate, myCorpus_liberal_validate, myCorpus_conservative_test, myCorpus_liberal_test)

myCorpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 404
## Combine the two label factors. The order is important.
targetLabels <- c(conserveLabels_train, liberalLabels_train, conserveLabels_validate, liberalLabels_validate, conserveLabels_test, liberalLabels_test)

Prepare the Corpus

The next steps use various tm functions to clean the corpus for further use. This is where we convert all words to lower case, remove numbers, remove stopwords, remove punctuation, remove extra whitespace, and perform stemming.

## See: http://stackoverflow.com/questions/26834576/big-text-corpus-breaks-tm-map
##  for why the next statement is important.
myCorpus <- tm_map(myCorpus,
                     content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')),
                     mc.cores=1)

## Copy & change name of the corpus
myCorpus_clean <- myCorpus

## Convert all words to lowercase
myCorpus_clean <- tm_map(myCorpus_clean, content_transformer(tolower))

## Remove numbers
myCorpus_clean <- tm_map(myCorpus_clean, removeNumbers)

## Remove stop words
myCorpus_clean <- tm_map(myCorpus_clean, removeWords, stopwords())
## myCorpus_clean <- tm_map(myCorpus_clean, removeWords, stopwords(c("laughter","applause")))

## Remove punctuation - first create a function to replace punctuation with spaces
replacePunctuation <- function(x) {
    gsub("[[:punct:]]+", " ", x)
}

myCorpus_clean <- tm_map(myCorpus_clean, content_transformer(replacePunctuation))

## Perform stemming
myCorpus_clean <- tm_map(myCorpus_clean, stemDocument)

## Remove extra whitespace
myCorpus_clean <- tm_map(myCorpus_clean, stripWhitespace)

## Examine purified contents
print(myCorpus_clean)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 404

Split Messages Into Words - Tokenize

Create a Document Term Matrix (DTM) from the corpus. Each row in the matrix contains the data from a single speech or writing. The columns correspond to the words found in the text.

## Build a Document Term Matrix.
## Rows are documents, columns are words.
dataDTM <- DocumentTermMatrix(myCorpus_clean)

Here is the size of the Document Term matrix.

## [1]   404 18251

Only use frequent words

There are over 18,000 columns in the matrix, each column representing a different token (word). The next step reduces the matrix to words that occur 25 times or more across all of the training documents.

## minimum times a word must appear in the training data for it to be incuded in the model
n <- 25

train_freq_words <- findFreqTerms(dataDTM, n)  ## words that appear at least n times
## str(train_freq_words)

frequentDTM <- dataDTM[ , train_freq_words]

Here is the size of the Document Term matrix and a sample of the data after limiting it to frequent terms.

## [1]  404 3151
## <<DocumentTermMatrix (documents: 5, terms: 6)>>
## Non-/sparse entries: 4/26
## Sparsity           : 87%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## 
##                                       Terms
## Docs                                   abandon abid abil abl abolish abort
##   _president_bush_speeches_speech-3419       0    0    0   0       0     0
##   _president_bush_speeches_speech-3420       0    0    1   0       0     0
##   _president_bush_speeches_speech-3422       0    0    0   0       0     0
##   _president_bush_speeches_speech-3423       0    0    0   1       0     0
##   _president_bush_speeches_speech-3424       0    0    1   1       0     0

Convert Frequency to Yes/No

## The following convert_counts() function to convert counts to Yes/No strings:
convert_counts <- function(x) {
    x <- ifelse(x > 0, "Yes", "No")
}

yesnoDTM <- apply(frequentDTM, MARGIN = 2,
                                       convert_counts)

## convert the DTM to a matrix
yesnoMatrix <- as.matrix(yesnoDTM)

## convert the matrix to a data frame
yesnoDF <- as.data.frame(yesnoMatrix)

## make the variables into factors. Need to cast back to a data frame after using lappl (list apply).
dataDF <- as.data.frame(lapply(yesnoDF, factor))

This is what the training matrix looks like after converting the term frequencies to yes/no.

##                                      abandon abid abil abl abolish abort
## _president_bush_speeches_speech-3419      No   No   No  No      No    No
## _president_bush_speeches_speech-3420      No   No  Yes  No      No    No
## _president_bush_speeches_speech-3422      No   No   No  No      No    No
## _president_bush_speeches_speech-3423      No   No   No Yes      No    No
## _president_bush_speeches_speech-3424      No   No  Yes Yes      No    No

Split Data into Training and Validation Datasets

Doing the split depends on knowing the order in which source documents were added to the corpus when it was built.

We are using counts of number of conservative and liberal documents in the training, validation and testing splits as determined in a previous code block.

## Identify the start and end rows to be used for training, validation and testing 
start_train <- 1
end_train <- numberConservDocs_train + numberLiberalDocs_train

start_validate <- end_train + 1
end_validate <- end_train + numberConservDocs_validate + numberLiberalDocs_validate

start_test <- end_validate + 1
end_test <- end_validate + numberConservDocs_test + numberLiberalDocs_test

## Split the data into three partitions
trainDF <- dataDF[start_train:end_train, ]
validateDF <- dataDF[start_validate:end_validate, ]
testDF <- dataDF[start_test:end_test, ]


## Create vectors for the outcome values
trainLabels <- targetLabels[start_train:end_train]
validateLabels <- targetLabels[start_validate:end_validate]
testLabels <- targetLabels[start_test:end_test]

## proportion of liberal to conservative messages & labels
prop.table(table(trainLabels))
## trainLabels
## conservative      liberal 
##    0.4132231    0.5867769
prop.table(table(validateLabels))
## validateLabels
## conservative      liberal 
##       0.4125       0.5875
prop.table(table(testLabels))
## testLabels
## conservative      liberal 
##    0.4146341    0.5853659

Train the Naive Bayes Model

## Be sure the outcomes vector trainLabels is a factor.
nbClassifier <- naiveBayes(as.factor(trainLabels) ~ ., data = trainDF, Laplace = 1)  ## e1071 package

## myClassifier

Assess Naive Bayes Model Using Validation Data

validate_pred <- predict(nbClassifier, validateDF)  

## Using gmodels package
CrossTable(validate_pred, validateLabels,
    prop.chisq = FALSE, prop.t = FALSE,
    dnn = c('predicted', 'actual'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  80 
## 
##  
##              | actual 
##    predicted | conservative |      liberal |    Row Total | 
## -------------|--------------|--------------|--------------|
## conservative |           21 |           14 |           35 | 
##              |        0.600 |        0.400 |        0.438 | 
##              |        0.636 |        0.298 |              | 
## -------------|--------------|--------------|--------------|
##      liberal |           12 |           33 |           45 | 
##              |        0.267 |        0.733 |        0.562 | 
##              |        0.364 |        0.702 |              | 
## -------------|--------------|--------------|--------------|
## Column Total |           33 |           47 |           80 | 
##              |        0.412 |        0.588 |              | 
## -------------|--------------|--------------|--------------|
## 
## 

Train the C5.0 Model

## train C5.0 model
C5_model <- C5.0(trainDF, as.factor(trainLabels))
C5_model
## 
## Call:
## C5.0.default(x = trainDF, y = as.factor(trainLabels))
## 
## Classification Tree
## Number of samples: 242 
## Number of predictors: 3151 
## 
## Tree size: 29 
## 
## Non-standard options: attempt to group attributes
## summary(C5_model)

Assess the C5.0 Model Using Validation Date.

C5_pred <- predict(C5_model, validateDF)

## Assess result
CrossTable(validateLabels, C5_pred,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c('actual default', 'predicted default'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  80 
## 
##  
##                | predicted default 
## actual default | conservative |      liberal |    Row Total | 
## ---------------|--------------|--------------|--------------|
##   conservative |           25 |            8 |           33 | 
##                |        0.312 |        0.100 |              | 
## ---------------|--------------|--------------|--------------|
##        liberal |           25 |           22 |           47 | 
##                |        0.312 |        0.275 |              | 
## ---------------|--------------|--------------|--------------|
##   Column Total |           50 |           30 |           80 | 
## ---------------|--------------|--------------|--------------|
## 
## 

Train C5.0 with Boosting

C5_model_boost <- C5.0(trainDF, as.factor(trainLabels), trials = 10)
C5_model_boost
## 
## Call:
## C5.0.default(x = trainDF, y = as.factor(trainLabels), trials = 10)
## 
## Classification Tree
## Number of samples: 242 
## Number of predictors: 3151 
## 
## Number of boosting iterations: 10 
## Average tree size: 19.5 
## 
## Non-standard options: attempt to group attributes
## summary(C5_model_boost)

Assess the C5.0 Boosted Model Using Validation Date.

## Predict with C5 boost model using validation date.
C5_pred_boost <- predict(C5_model_boost, validateDF)

CrossTable(validateLabels, C5_pred_boost,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c('actual default', 'predicted default'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  80 
## 
##  
##                | predicted default 
## actual default | conservative |      liberal |    Row Total | 
## ---------------|--------------|--------------|--------------|
##   conservative |           24 |            9 |           33 | 
##                |        0.300 |        0.113 |              | 
## ---------------|--------------|--------------|--------------|
##        liberal |           11 |           36 |           47 | 
##                |        0.138 |        0.450 |              | 
## ---------------|--------------|--------------|--------------|
##   Column Total |           35 |           45 |           80 | 
## ---------------|--------------|--------------|--------------|
## 
## 

The C5.0 boosted model offers the best performance of the three models.

Use the Boosted C5.0 Model on Testing Data

Assess the C5.0 boosted model using testing data. The model is also saved for future use.

## Save the C5 model for later use.
save(C5_model_boost, file = "/Users/mitchellfawcett/Documents/Data Science/LeftRight/Models/C5_model_boost")

C5_pred_boost_test <- predict(C5_model_boost, testDF)

## Assess boost
CrossTable(testLabels, C5_pred_boost_test,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c('actual default', 'predicted default'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  82 
## 
##  
##                | predicted default 
## actual default | conservative |      liberal |    Row Total | 
## ---------------|--------------|--------------|--------------|
##   conservative |           24 |           10 |           34 | 
##                |        0.293 |        0.122 |              | 
## ---------------|--------------|--------------|--------------|
##        liberal |            8 |           40 |           48 | 
##                |        0.098 |        0.488 |              | 
## ---------------|--------------|--------------|--------------|
##   Column Total |           32 |           50 |           82 | 
## ---------------|--------------|--------------|--------------|
## 
## 

Define a Function to Make Predictions

Set up a function that can be called to run the C5.0 model against a selected speech. The results will say whether the speech is conservative or liberal.

The function will create a corpus using the target speech. Cleanse it using the same methods as used for the training data (lower case, remove numbers, etc). Build a Document Term Matrix and then run the prediction

require(C50)
set.seed(1234)

makePrediction <- function(x) {

  ## His speech is located here:
  targetSpeechPath <- x
  ## targetSpeechPath <- "/Users/mitchellfawcett/Documents/Data Science/LeftRight/TargetData"
  
  ## load model: "C5_model_boost""
  load("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Models/C5_model_boost")
  
  ## load corpus
  targetCorpus <- VCorpus(DirSource(targetSpeechPath))
  ## inspect(targetCorpus)
  
  ## Prepare the target corpus
  ## See: http://stackoverflow.com/questions/26834576/big-text-corpus-breaks-tm-map
  ##  for why the next statement is important.
  targetCorpus <- tm_map(targetCorpus,
                         content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')),
                         mc.cores=1)
  
  ## Copy & change name of the corpus
  targetCorpus_clean <- targetCorpus
  
  ## Convert all words to lowercase
  targetCorpus_clean <- tm_map(targetCorpus_clean, content_transformer(tolower))
  
  ## Remove numbers
  targetCorpus_clean <- tm_map(targetCorpus_clean, removeNumbers)
  
  ## Remove stop words
  targetCorpus_clean <- tm_map(targetCorpus_clean, removeWords, stopwords())
  
  ## Remove punctuation - first create a function to replace punctuation with spaces
  replacePunctuation <- function(x) {
    gsub("[[:punct:]]+", " ", x)
  }
  
  targetCorpus_clean <- tm_map(targetCorpus_clean, content_transformer(replacePunctuation))
  
  ## Perform stemming
  targetCorpus_clean <- tm_map(targetCorpus_clean, stemDocument)
  
  ## Remove extra whitespace
  targetCorpus_clean <- tm_map(targetCorpus_clean, stripWhitespace)
  
  ## Examine purified contents
  ## print(targetCorpus_clean)
  ## inspect(targetCorpus_clean)
  
  
  ## Build a Document Term Matrix.
  ## Rows are documents, columns are words.
  targetdataDTM <- DocumentTermMatrix(targetCorpus_clean)
  
  ## dim(targetdataDTM)
  
  
  ## convert frequency to yes/no
  ## The following convert_counts() function to convert counts to Yes/No strings:
  convert_counts <- function(x) {
    x <- ifelse(x > 0, "Yes", "No")
  }
  
  targetyesnoDTM <- apply(targetdataDTM, MARGIN = 2,
                          convert_counts)
  
  ## convert the DTM to a matrix; transpose so columns are words.  Not sure why I am forced to do this.
  ## The training data didn't need tobe transposed.
  targetyesnoMatrix <- t(as.matrix(targetyesnoDTM))
  
  ## convert the matrix to a data frame
  targetyesnoDF <- as.data.frame(targetyesnoMatrix)
  
  ## make the variables into factors. Need to cast back to a data frame after using lappl (list apply).
  targetdataDF <- as.data.frame(lapply(targetyesnoDF, factor))
  
  
  ###### Map the target speech words to the words used in the model so words in the target that are not in the model are excluded.
  ###### We need a data frame for the target speech that matches the columns in the training data.
  
  ## Get the list of words in the training data used to build the model
  trainWords <-colnames(trainDF)
  
  ## Get the list of words in the target data
  targetWords <- colnames(targetdataDF)
  
  ## Get a list of the words in the target data that are in the training data
  intersectWords <- intersect(trainWords, targetWords)
  
  ## Get a list of words in the training data that are not found in the target data
  diffWords <- setdiff(trainWords, targetWords)
  
  ## Make a data frame of Yeses for the words in target speech that are part of the training words
  intersectDF <- data.frame(matrix(ncol = length(intersectWords), nrow = 1 ))
  colnames(intersectDF) <- intersectWords
  intersectDF[1, ] <- 'Yes'
  
  ## Make a data frame of Nos for the words in training data that are not part of the target speech
  diffDF <- data.frame(matrix(ncol = length(diffWords), nrow = 1 ))
  colnames(diffDF) <- diffWords
  diffDF[1, ] <- 'No'
  
  ## Create the data frame that will be used for the prediction
  targetDF <- cbind(intersectDF, diffDF)
  ## Sort its columns alphabetically
  targetDF <- targetDF[, order(names(targetDF))]
  
  ## Make the prediction
  C5_pred_boost_target <- predict(C5_model_boost, targetDF, type = 'prob')
  
  ## examine model
  ## summary(C5_model_boost)
  ## importance <- C5imp(C5_model_boost, metric = 'splits')
  
  ## display result of prediction
  C5_pred_boost_target

}

Making Predictions

Use the C5.0 boosted model to evaluate a target speech and determine if it is conservative or liberal.

The following three predictions were based on presidential nomination acceptance speeches.

Here is Donald Trump’s prediction:

##   conservative  liberal
## 1     0.784227 0.215773

Here is Hillary Clinton’s prediction:

##   conservative   liberal
## 1    0.3657169 0.6342831

Here is Hubert Humphrey’s prediction:

##   conservative  liberal
## 1     0.892277 0.107723

To return to the top click here.

Sources

  1. Machine Learning with R, 2nd Edition, B. Lantz
  2. XPath and XPointer, J E Simpson
  3. The Elements of Statistical Learning, 2nd Edition, T. Hastie
  4. Package XML, D T Lang
  5. Package tm, I Feinerer

Footnotes


  1. Actually Trump has changed parties more than once. See http://www.washingtontimes.com/news/2015/jun/16/donald-trump-changed-political-parties-at-least-fi/

  2. Conservative: Confucius, Cato the Elder, Edmund Burke, Goethe, Alexander Hamilton, Irving Babbitt, Eric Hoffer, Russell Kirk, Barry Goldwater, William F. Buckley Jr.
    Liberal: Lao Tsu, Charles de Montesqieu, John Stuart Mill, Frederick Douglass, Mary Wollstonecraft, Harriet Martineau, Aristotle, Desiderius Erasmus, Thomas Paine, John Maynard Keynes