M. Fawcett - 09/24/2016
Published at http://rpubs.com/mrfawcettjr/209121
In this report I use statistics to (almost seriously) answer the question “Is Donald Trump really a Republican”. It’s a question that many have asked because of Trump’s sometimes liberal-sounding rhetoric, and also the fact that he was a registered Democrat until 2012 1. As a secondary question, I address whether Hillary Clinton is really a Democrat, since coming from a prominant conservative family she campaigned for Barry Goldwater as a Young Republican, and did not switch to the Democratic party until her time at Yale.
My approach is to use presidential speeches from 1960 to present to create a statistical model that can accurately classify speech as either “conservative” or “liberal”. In doing this I will make the naive assumption that Republican = Conservative and Democratic = Liberal.
Nearly 400 speeches from the University of Virginia Miller Center (http://millercenter.org) were used in the training of a predictive model. There are representative speeches from every president from Richard Nixon to Barack Obama. I also used a sprinkling of writings from non-presidential conservative and liberal thinkers throughout history 2.
Two modeling algorithms were explored, Naive Bayes and C5.0. A boosted version of C5.0 was found to have the best predictive results. I chose these two algorithms because they are high performers and they have easily explained intuitions.
Both modeling techniques treat text as a “bag of words”. Frequency of word occurance was the only attribute used in model training and prediction. Syntax, semantics and meaning were completely ignored.
In the final result, Donald Trump was found to be statistically on the conservative side. Clinton came out on the liberal side, albeit less strongly than Trump’s conservative leaning. To jump to the prediction results click here section.
A big part of this project was finding and preparing text from presidential speeches. Their speeches can be found all over the Internet, but fortunately the Miller Center at the University of Virginia provides a centralized list of links to transcripts of every important presidential address going all the way back to George Washington.
The “XML” package in R provided all the necessary tools to download and parse the html code from the Miller Center Website in order to filter out just the text of the speeches. The text of the couple dozen non-presidential political “thinkers” was manually prepared through copy and paste methods.
To see how the speeches were downloaded and processed go to http://rpubs.com/mrfawcettjr/209115.
The text mining package “tm” was used to prepare the speeches and writings for statistical analysis. It has R functions for loading text into “corpora” and turning them into matrices for training models. The “e1071” and “C50” packages were used to build the Naive Bayes and C5.0 models.
An outline of the model building process is as follows:
To build and test models with Naive Bayes using R you must have the same number of levels for the independent variables in the training, validation and testing datasets. For example, if dealing with Yes/No values, you must assure that for each variable there are two levels in all three sets of data. If a training variable only has ‘Yes’ responses but either the validation or testing data has both ‘Yes’ and ‘No’, you will get an error when you test your model (subscript out of bounds).
One way to avoid this is by converting all the variables to a “factor” (step 6 below) before splitting the data into training, validation and testing sets.
Load R libraries needed by the program.
## Clean up
rm(list = ls())
#Load libraries
require("tm") ## for text mining
require("R.utils") ## for countLines
require("reader") ## for n.readLines
require("SnowballC") ## for word stemming
require("e1071") ## for naiveBayes
## require("caret") ## for naiveBayes
require("gmodels") ## for prediction evaluation
require("C50") ## for J48 (C5.0) model
The experimental data consists of presidential speeches since 1960 plus a few classic essays by conservative and liberal thinkers. The training data consists of plain text files, one file per speech or essay, stored in a directory called SourceData (/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/SourceData/…Liberal or Conservative sub-directory).
168 are Republican/Conservative, 236 are Democratic/Liberal
The data files were divided randomly as follows: 60% training, 20% validation, 20% testing,
The next block of R code copies random files from the SourceData Liberal and Conservative directories to respective directories under the /Training, /Validation and /Testing directories.
## set the seed to make the results reproducible
set.seed(123)
## Set some path variables
sourceDir <- "/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/SourceData"
trainingDir <- "/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/TrainingData"
validationDir <- "/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/ValidationData"
testingDir <- "/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/TestingData"
## Create lists containing all the liberal and conservative file names.
liberalSourceFiles <- list.files(paste(sourceDir, 'Liberal', sep = '/'))
conservativeSourceFiles <- list.files(paste(sourceDir, 'Conservative', sep = '/'))
#### Divvy up the liberal files, 60%, 20%, 20%
## 60% of the sample size will be training
smp_size_train <- floor(0.60 * length(liberalSourceFiles))
## Index of random element numbers from the list of liberal files.
train_index <- sample(seq_len(length(liberalSourceFiles)), size = smp_size_train, replace = FALSE)
## This is the training data. The 40% not-training files will be divided 50/50 into validation and testing.
liberalSourceFiles_train <- liberalSourceFiles[train_index]
## This remainder will get split 50/50 below
liberalSourceFiles_not_train <- liberalSourceFiles[-train_index]
## Get the 50/50 index split of the not-training files to make the validation and testing sets.
smp_size_validation <- floor(0.50 * length(liberalSourceFiles_not_train))
validation_index <- sample(seq_len(length(liberalSourceFiles_not_train)), size = smp_size_validation, replace = FALSE)
## Validation data
liberalSourceFiles_validation <- liberalSourceFiles_not_train[validation_index]
## Test data
liberalSourceFiles_test <- liberalSourceFiles_not_train[-validation_index]
#### Divvy up the conservative files, 60%, 20%, 20%
## 60% of the sample size will be training
smp_size_train <- floor(0.60 * length(conservativeSourceFiles))
## Index of random element numbers from the list of liberal files.
train_index <- sample(seq_len(length(conservativeSourceFiles)), size = smp_size_train, replace = FALSE)
## This gets the training set. The 40% not-training files will be divided 50/50 into validation and testing.
conservativeSourceFiles_train <- conservativeSourceFiles[train_index]
conservativeSourceFiles_not_train <- conservativeSourceFiles[-train_index]
## Get the 50/50 split of the not-training files to make the validation and testing sets.
smp_size_validation <- floor(0.50 * length(conservativeSourceFiles_not_train))
validation_index <- sample(seq_len(length(conservativeSourceFiles_not_train)), size = smp_size_validation, replace = FALSE)
conservativeSourceFiles_validation <- conservativeSourceFiles_not_train[validation_index]
conservativeSourceFiles_test <- conservativeSourceFiles_not_train[-validation_index]
## Copy data files to where they need to go....
## Clear directory before copying to it.
#### Copy liberal files:
## Training
result0 <- file.remove(dir(paste(trainingDir, 'Liberal', sep = '/'), full.names=TRUE)) ## clear out diretory
result1 <- file.copy(from = paste(sourceDir, 'Liberal', liberalSourceFiles_train, sep = '/'),
to = paste(trainingDir, 'Liberal', sep = '/'))
## Validation
result2 <- file.remove(dir(paste(validationDir, 'Liberal', sep = '/'), full.names=TRUE))
result3 <- file.copy(from = paste(sourceDir, 'Liberal', liberalSourceFiles_validation, sep = '/'),
to = paste(validationDir, 'Liberal', sep = '/'))
## Test
result4 <- file.remove(dir(paste(testingDir, 'Liberal', sep = '/'), full.names=TRUE))
result5 <- file.copy(from = paste(sourceDir, 'Liberal', liberalSourceFiles_test, sep = '/'),
to = paste(testingDir, 'Liberal', sep = '/'))
#### Copy conservative files:
## Training
result6 <- file.remove(dir(paste(trainingDir, 'Conservative', sep = '/'), full.names=TRUE))
result7 <- file.copy(from = paste(sourceDir, 'Conservative', conservativeSourceFiles_train, sep = '/'),
to = paste(trainingDir, 'Conservative', sep = '/'))
## Validation
result8 <- file.remove(dir(paste(validationDir, 'Conservative', sep = '/'), full.names=TRUE))
result9 <- file.copy(from = paste(sourceDir, 'Conservative', conservativeSourceFiles_validation, sep = '/'),
to = paste(validationDir, 'Conservative', sep = '/'))
## Test
result10 <- file.remove(dir(paste(testingDir, 'Conservative', sep = '/'), full.names=TRUE))
result11 <- file.copy(from = paste(sourceDir, 'Conservative', conservativeSourceFiles_test, sep = '/'),
to = paste(testingDir, 'Conservative', sep = '/'))
Each data file contains the text of one speech or one writing. The classification of each item as Conservative or Liberal is based simply on who the author is and their political and philisophical association. Classification of the training data was not based on “expert” analysis of ideas, words or semantics.
The corpus will serve as a single container for all the texts that can be manipulated using functions found in the tm package.
## Load the conservative documents into corpora
myCorpus_conservative_train <- VCorpus(DirSource("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/TrainingData/Conservative"))
myCorpus_conservative_validate <- VCorpus(DirSource("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/ValidationData/Conservative"))
myCorpus_conservative_test <- VCorpus(DirSource("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/TestingData/Conservative"))
## Get the number of documents in the conservative folders
numberConservDocs_train <- nrow(as.matrix(summary(myCorpus_conservative_train)))
numberConservDocs_validate <- nrow(as.matrix(summary(myCorpus_conservative_validate)))
numberConservDocs_test <- nrow(as.matrix(summary(myCorpus_conservative_test)))
## Create a factor of labels for the conservative documents
conserveLabels_train <- replicate(numberConservDocs_train, 'conservative', simplify = "vector")
conserveLabels_validate <- replicate(numberConservDocs_validate, 'conservative', simplify = "vector")
conserveLabels_test <- replicate(numberConservDocs_test, 'conservative', simplify = "vector")
## Load the liberal documents into corpora
myCorpus_liberal_train <- VCorpus(DirSource("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/TrainingData/Liberal"))
myCorpus_liberal_validate <- VCorpus(DirSource("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/ValidationData/Liberal"))
myCorpus_liberal_test <- VCorpus(DirSource("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Data/TestingData/Liberal"))
## Get the number of documents in the liberal folders
numberLiberalDocs_train <- nrow(as.matrix(summary(myCorpus_liberal_train)))
numberLiberalDocs_validate <- nrow(as.matrix(summary(myCorpus_liberal_validate)))
numberLiberalDocs_test <- nrow(as.matrix(summary(myCorpus_liberal_test)))
## Create a factor of labels for the liberal documents
liberalLabels_train <- replicate(numberLiberalDocs_train, 'liberal', simplify = "vector")
liberalLabels_validate <- replicate(numberLiberalDocs_validate, 'liberal', simplify = "vector")
liberalLabels_test <- replicate(numberLiberalDocs_test, 'liberal', simplify = "vector")
## Combine the corpora. The order is important so we can later identify which rows in the document term matrix belong to which dataset when we build the model and test it.
myCorpus <- c(myCorpus_conservative_train, myCorpus_liberal_train, myCorpus_conservative_validate, myCorpus_liberal_validate, myCorpus_conservative_test, myCorpus_liberal_test)
myCorpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 404
## Combine the two label factors. The order is important.
targetLabels <- c(conserveLabels_train, liberalLabels_train, conserveLabels_validate, liberalLabels_validate, conserveLabels_test, liberalLabels_test)
The next steps use various tm functions to clean the corpus for further use. This is where we convert all words to lower case, remove numbers, remove stopwords, remove punctuation, remove extra whitespace, and perform stemming.
## See: http://stackoverflow.com/questions/26834576/big-text-corpus-breaks-tm-map
## for why the next statement is important.
myCorpus <- tm_map(myCorpus,
content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')),
mc.cores=1)
## Copy & change name of the corpus
myCorpus_clean <- myCorpus
## Convert all words to lowercase
myCorpus_clean <- tm_map(myCorpus_clean, content_transformer(tolower))
## Remove numbers
myCorpus_clean <- tm_map(myCorpus_clean, removeNumbers)
## Remove stop words
myCorpus_clean <- tm_map(myCorpus_clean, removeWords, stopwords())
## myCorpus_clean <- tm_map(myCorpus_clean, removeWords, stopwords(c("laughter","applause")))
## Remove punctuation - first create a function to replace punctuation with spaces
replacePunctuation <- function(x) {
gsub("[[:punct:]]+", " ", x)
}
myCorpus_clean <- tm_map(myCorpus_clean, content_transformer(replacePunctuation))
## Perform stemming
myCorpus_clean <- tm_map(myCorpus_clean, stemDocument)
## Remove extra whitespace
myCorpus_clean <- tm_map(myCorpus_clean, stripWhitespace)
## Examine purified contents
print(myCorpus_clean)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 404
Create a Document Term Matrix (DTM) from the corpus. Each row in the matrix contains the data from a single speech or writing. The columns correspond to the words found in the text.
## Build a Document Term Matrix.
## Rows are documents, columns are words.
dataDTM <- DocumentTermMatrix(myCorpus_clean)
Here is the size of the Document Term matrix.
## [1] 404 18251
There are over 18,000 columns in the matrix, each column representing a different token (word). The next step reduces the matrix to words that occur 25 times or more across all of the training documents.
## minimum times a word must appear in the training data for it to be incuded in the model
n <- 25
train_freq_words <- findFreqTerms(dataDTM, n) ## words that appear at least n times
## str(train_freq_words)
frequentDTM <- dataDTM[ , train_freq_words]
Here is the size of the Document Term matrix and a sample of the data after limiting it to frequent terms.
## [1] 404 3151
## <<DocumentTermMatrix (documents: 5, terms: 6)>>
## Non-/sparse entries: 4/26
## Sparsity : 87%
## Maximal term length: 7
## Weighting : term frequency (tf)
##
## Terms
## Docs abandon abid abil abl abolish abort
## _president_bush_speeches_speech-3419 0 0 0 0 0 0
## _president_bush_speeches_speech-3420 0 0 1 0 0 0
## _president_bush_speeches_speech-3422 0 0 0 0 0 0
## _president_bush_speeches_speech-3423 0 0 0 1 0 0
## _president_bush_speeches_speech-3424 0 0 1 1 0 0
## The following convert_counts() function to convert counts to Yes/No strings:
convert_counts <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
yesnoDTM <- apply(frequentDTM, MARGIN = 2,
convert_counts)
## convert the DTM to a matrix
yesnoMatrix <- as.matrix(yesnoDTM)
## convert the matrix to a data frame
yesnoDF <- as.data.frame(yesnoMatrix)
## make the variables into factors. Need to cast back to a data frame after using lappl (list apply).
dataDF <- as.data.frame(lapply(yesnoDF, factor))
This is what the training matrix looks like after converting the term frequencies to yes/no.
## abandon abid abil abl abolish abort
## _president_bush_speeches_speech-3419 No No No No No No
## _president_bush_speeches_speech-3420 No No Yes No No No
## _president_bush_speeches_speech-3422 No No No No No No
## _president_bush_speeches_speech-3423 No No No Yes No No
## _president_bush_speeches_speech-3424 No No Yes Yes No No
Doing the split depends on knowing the order in which source documents were added to the corpus when it was built.
We are using counts of number of conservative and liberal documents in the training, validation and testing splits as determined in a previous code block.
## Identify the start and end rows to be used for training, validation and testing
start_train <- 1
end_train <- numberConservDocs_train + numberLiberalDocs_train
start_validate <- end_train + 1
end_validate <- end_train + numberConservDocs_validate + numberLiberalDocs_validate
start_test <- end_validate + 1
end_test <- end_validate + numberConservDocs_test + numberLiberalDocs_test
## Split the data into three partitions
trainDF <- dataDF[start_train:end_train, ]
validateDF <- dataDF[start_validate:end_validate, ]
testDF <- dataDF[start_test:end_test, ]
## Create vectors for the outcome values
trainLabels <- targetLabels[start_train:end_train]
validateLabels <- targetLabels[start_validate:end_validate]
testLabels <- targetLabels[start_test:end_test]
## proportion of liberal to conservative messages & labels
prop.table(table(trainLabels))
## trainLabels
## conservative liberal
## 0.4132231 0.5867769
prop.table(table(validateLabels))
## validateLabels
## conservative liberal
## 0.4125 0.5875
prop.table(table(testLabels))
## testLabels
## conservative liberal
## 0.4146341 0.5853659
## Be sure the outcomes vector trainLabels is a factor.
nbClassifier <- naiveBayes(as.factor(trainLabels) ~ ., data = trainDF, Laplace = 1) ## e1071 package
## myClassifier
validate_pred <- predict(nbClassifier, validateDF)
## Using gmodels package
CrossTable(validate_pred, validateLabels,
prop.chisq = FALSE, prop.t = FALSE,
dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 80
##
##
## | actual
## predicted | conservative | liberal | Row Total |
## -------------|--------------|--------------|--------------|
## conservative | 21 | 14 | 35 |
## | 0.600 | 0.400 | 0.438 |
## | 0.636 | 0.298 | |
## -------------|--------------|--------------|--------------|
## liberal | 12 | 33 | 45 |
## | 0.267 | 0.733 | 0.562 |
## | 0.364 | 0.702 | |
## -------------|--------------|--------------|--------------|
## Column Total | 33 | 47 | 80 |
## | 0.412 | 0.588 | |
## -------------|--------------|--------------|--------------|
##
##
## train C5.0 model
C5_model <- C5.0(trainDF, as.factor(trainLabels))
C5_model
##
## Call:
## C5.0.default(x = trainDF, y = as.factor(trainLabels))
##
## Classification Tree
## Number of samples: 242
## Number of predictors: 3151
##
## Tree size: 29
##
## Non-standard options: attempt to group attributes
## summary(C5_model)
C5_pred <- predict(C5_model, validateDF)
## Assess result
CrossTable(validateLabels, C5_pred,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c('actual default', 'predicted default'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 80
##
##
## | predicted default
## actual default | conservative | liberal | Row Total |
## ---------------|--------------|--------------|--------------|
## conservative | 25 | 8 | 33 |
## | 0.312 | 0.100 | |
## ---------------|--------------|--------------|--------------|
## liberal | 25 | 22 | 47 |
## | 0.312 | 0.275 | |
## ---------------|--------------|--------------|--------------|
## Column Total | 50 | 30 | 80 |
## ---------------|--------------|--------------|--------------|
##
##
C5_model_boost <- C5.0(trainDF, as.factor(trainLabels), trials = 10)
C5_model_boost
##
## Call:
## C5.0.default(x = trainDF, y = as.factor(trainLabels), trials = 10)
##
## Classification Tree
## Number of samples: 242
## Number of predictors: 3151
##
## Number of boosting iterations: 10
## Average tree size: 19.5
##
## Non-standard options: attempt to group attributes
## summary(C5_model_boost)
## Predict with C5 boost model using validation date.
C5_pred_boost <- predict(C5_model_boost, validateDF)
CrossTable(validateLabels, C5_pred_boost,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c('actual default', 'predicted default'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 80
##
##
## | predicted default
## actual default | conservative | liberal | Row Total |
## ---------------|--------------|--------------|--------------|
## conservative | 24 | 9 | 33 |
## | 0.300 | 0.113 | |
## ---------------|--------------|--------------|--------------|
## liberal | 11 | 36 | 47 |
## | 0.138 | 0.450 | |
## ---------------|--------------|--------------|--------------|
## Column Total | 35 | 45 | 80 |
## ---------------|--------------|--------------|--------------|
##
##
The C5.0 boosted model offers the best performance of the three models.
Assess the C5.0 boosted model using testing data. The model is also saved for future use.
## Save the C5 model for later use.
save(C5_model_boost, file = "/Users/mitchellfawcett/Documents/Data Science/LeftRight/Models/C5_model_boost")
C5_pred_boost_test <- predict(C5_model_boost, testDF)
## Assess boost
CrossTable(testLabels, C5_pred_boost_test,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c('actual default', 'predicted default'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 82
##
##
## | predicted default
## actual default | conservative | liberal | Row Total |
## ---------------|--------------|--------------|--------------|
## conservative | 24 | 10 | 34 |
## | 0.293 | 0.122 | |
## ---------------|--------------|--------------|--------------|
## liberal | 8 | 40 | 48 |
## | 0.098 | 0.488 | |
## ---------------|--------------|--------------|--------------|
## Column Total | 32 | 50 | 82 |
## ---------------|--------------|--------------|--------------|
##
##
Set up a function that can be called to run the C5.0 model against a selected speech. The results will say whether the speech is conservative or liberal.
The function will create a corpus using the target speech. Cleanse it using the same methods as used for the training data (lower case, remove numbers, etc). Build a Document Term Matrix and then run the prediction
require(C50)
set.seed(1234)
makePrediction <- function(x) {
## His speech is located here:
targetSpeechPath <- x
## targetSpeechPath <- "/Users/mitchellfawcett/Documents/Data Science/LeftRight/TargetData"
## load model: "C5_model_boost""
load("/Users/mitchellfawcett/Documents/Data Science/LeftRight/Models/C5_model_boost")
## load corpus
targetCorpus <- VCorpus(DirSource(targetSpeechPath))
## inspect(targetCorpus)
## Prepare the target corpus
## See: http://stackoverflow.com/questions/26834576/big-text-corpus-breaks-tm-map
## for why the next statement is important.
targetCorpus <- tm_map(targetCorpus,
content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')),
mc.cores=1)
## Copy & change name of the corpus
targetCorpus_clean <- targetCorpus
## Convert all words to lowercase
targetCorpus_clean <- tm_map(targetCorpus_clean, content_transformer(tolower))
## Remove numbers
targetCorpus_clean <- tm_map(targetCorpus_clean, removeNumbers)
## Remove stop words
targetCorpus_clean <- tm_map(targetCorpus_clean, removeWords, stopwords())
## Remove punctuation - first create a function to replace punctuation with spaces
replacePunctuation <- function(x) {
gsub("[[:punct:]]+", " ", x)
}
targetCorpus_clean <- tm_map(targetCorpus_clean, content_transformer(replacePunctuation))
## Perform stemming
targetCorpus_clean <- tm_map(targetCorpus_clean, stemDocument)
## Remove extra whitespace
targetCorpus_clean <- tm_map(targetCorpus_clean, stripWhitespace)
## Examine purified contents
## print(targetCorpus_clean)
## inspect(targetCorpus_clean)
## Build a Document Term Matrix.
## Rows are documents, columns are words.
targetdataDTM <- DocumentTermMatrix(targetCorpus_clean)
## dim(targetdataDTM)
## convert frequency to yes/no
## The following convert_counts() function to convert counts to Yes/No strings:
convert_counts <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
targetyesnoDTM <- apply(targetdataDTM, MARGIN = 2,
convert_counts)
## convert the DTM to a matrix; transpose so columns are words. Not sure why I am forced to do this.
## The training data didn't need tobe transposed.
targetyesnoMatrix <- t(as.matrix(targetyesnoDTM))
## convert the matrix to a data frame
targetyesnoDF <- as.data.frame(targetyesnoMatrix)
## make the variables into factors. Need to cast back to a data frame after using lappl (list apply).
targetdataDF <- as.data.frame(lapply(targetyesnoDF, factor))
###### Map the target speech words to the words used in the model so words in the target that are not in the model are excluded.
###### We need a data frame for the target speech that matches the columns in the training data.
## Get the list of words in the training data used to build the model
trainWords <-colnames(trainDF)
## Get the list of words in the target data
targetWords <- colnames(targetdataDF)
## Get a list of the words in the target data that are in the training data
intersectWords <- intersect(trainWords, targetWords)
## Get a list of words in the training data that are not found in the target data
diffWords <- setdiff(trainWords, targetWords)
## Make a data frame of Yeses for the words in target speech that are part of the training words
intersectDF <- data.frame(matrix(ncol = length(intersectWords), nrow = 1 ))
colnames(intersectDF) <- intersectWords
intersectDF[1, ] <- 'Yes'
## Make a data frame of Nos for the words in training data that are not part of the target speech
diffDF <- data.frame(matrix(ncol = length(diffWords), nrow = 1 ))
colnames(diffDF) <- diffWords
diffDF[1, ] <- 'No'
## Create the data frame that will be used for the prediction
targetDF <- cbind(intersectDF, diffDF)
## Sort its columns alphabetically
targetDF <- targetDF[, order(names(targetDF))]
## Make the prediction
C5_pred_boost_target <- predict(C5_model_boost, targetDF, type = 'prob')
## examine model
## summary(C5_model_boost)
## importance <- C5imp(C5_model_boost, metric = 'splits')
## display result of prediction
C5_pred_boost_target
}
Use the C5.0 boosted model to evaluate a target speech and determine if it is conservative or liberal.
The following three predictions were based on presidential nomination acceptance speeches.
Here is Donald Trump’s prediction:
## conservative liberal
## 1 0.784227 0.215773
Here is Hillary Clinton’s prediction:
## conservative liberal
## 1 0.3657169 0.6342831
Here is Hubert Humphrey’s prediction:
## conservative liberal
## 1 0.892277 0.107723
To return to the top click here.
Actually Trump has changed parties more than once. See http://www.washingtontimes.com/news/2015/jun/16/donald-trump-changed-political-parties-at-least-fi/↩
Conservative: Confucius, Cato the Elder, Edmund Burke, Goethe, Alexander Hamilton, Irving Babbitt, Eric Hoffer, Russell Kirk, Barry Goldwater, William F. Buckley Jr.
Liberal: Lao Tsu, Charles de Montesqieu, John Stuart Mill, Frederick Douglass, Mary Wollstonecraft, Harriet Martineau, Aristotle, Desiderius Erasmus, Thomas Paine, John Maynard Keynes↩