The goal of the ‘prediction assignment’ is to predict the manner in which (six) participants did a weight lifting exercise. For this, I downloaded a training dataset and a test dataset. I created a model, used cross validation, calculated an expected out of sample error. In this write-up, I will also describe how I built the model, and why I made the choices that I did. I also used the model to predict the (20) test cases.
Note: the document has between approximately 1,435 words (counted from Word-version using OpenOffice) and 1,539 words (counted from the HTML-version by this online tool).
First, I download the test data and the training data. The R code is available in the repository (in the compiled version, some code will be suppressed by echo=F
).
I then read the data with read.csv
. The raw training data will be in a data-frame variable called training
and raw testing data in testing
. In numeric variables, the value #DIV/0!
was present; this value was treated as NA
(otherwise the numeric variables cannot be treated as numeric).
training <- read.csv('pml-training.csv', na.strings=c("NA", "#DIV/0!"))
testing <- read.csv('pml-testing.csv', na.strings=c("NA", "#DIV/0!"))
I removed all timestamp variables: the testing data consists of (20) single rows on which to base 20 predictions, so we need to make our predictions on a single measurement ‘moment’ and therefore I did not treat the data as time series data.
library(dplyr)
training2 <- select(training, -contains('timestamp')) # Remove timestamp variables
new_window
variableThe new_window
variable seems to have a special meaning: in the testing dataset, all rows have new_window
equal to 0
. We therefore do not use rows with new_window == 0
to make a prediction model for the new data (in the testing set). And we can drop the variable if we subset to cases with new_window == 0
.
training2 <- training2[training2$new_window == 'no',]
training2 <- select(training2, -contains('new_window'))
The variables X
and user_name
do not contain any actual measurement of how a participant performed an exercise, but may (help to) identify measurements, which may lead to overfitting if included. So I decided to drop those variables.
training2$X <- NULL
training2$user_name <- NULL
Variables where all rows only have NA
left as a value (so the number of NA’s in that column is equal to the number of cases), have nothing left to contribute to the model, so we can remove those.
columns_with_NA_only <- apply(training2, 2,
function (c) sum(is.na(c)) == nrow(training2))
training2 <- training2[, !columns_with_NA_only]
Because of noise in the sensor data, the authors of the paper that was referred to in the assignment, chose for Random Forest. Our situation is not different. So I also choose for Random Forest, which is also known to have good prediction accuracy (compared to e.g. a single tree approach). When I made the decision to start with random forest, I thought I could always try another approach if needed. However, we will see that RF did work very well and a revised strategy was unnecessary.
I decided to split the data into 75% to train the model and 25% to predict the out of sample error. I decided to use 75% training data (instead of a lower value of 60% e.g.) because:
caret
will already use resampling internally (therefore I think the risk of overfitting the model is already mitigated)testing
set (that I have set aside).I will use set.seed
to make my work reproducible.
library(caret)
set.seed(13234)
inTrainingPart <- createDataPartition(training2$classe, p=0.75, list=F)
trainingPart <- training2[inTrainingPart,]
testingPart <- training2[-inTrainingPart,]
Package caret
provides a training function that can be used with Random Forest and also provides sensible defaults. The package also provides multicore/parallel processing, and because of the size of the dataset and the algorithm and defaults used, I will turn it on.
library(doParallel)
cluster <- makeCluster(3) # Adapt as needed, depending on hardware!
registerDoParallel(cluster)
model <- train(classe ~ ., data=trainingPart, method='rf', verbose=T)
caret
will try several models and select a best model automatically. Now that we have such a (best) model (which caret
selected for us), we will get some accuracy/error indications to show how well (or poor) the model did:
# Get accuracy of 'bestTune' model:
accuracy.of.best.model <- model$results$Accuracy[as.integer(row.names(model$bestTune))]
# Get OOB error rate of the final model:
# (note: by looking it up for the number of trees used, which is in 'ntree')
errorRate.OOB.Reported <- model$finalModel$err.rate[model$finalModel$ntree,1]
Results:
Based on the ‘high’ indication for accuracy and the ‘low’ indication for OOB error rate, I decided to use this model to go to the next step, and calculate an expected out of sample error rate using the 25% data we set aside (see below).
Note: According to the lectures, care should be taken to avoid overfitting with Random Forest, so cross-validation is important.
At the end of the video lecture on ‘Random Forest’ (week 3) though, Jeff Leek specifically states that the train
function in caret
will handle that for you. Breiman, a developer of Random Forest, explains in the (online) section “The out-of-bag (oob) error estimate” how the algorithm leaves a subset of cases out of each sample (about one-third), and how the model thus results in a unbiased estimate (proven in practice).
We will still also explicitly validate our model accuracy, with the 25% of ‘training data’ that we have set aside and left untouched.
library(randomForest)
classesReference <- testingPart$classe
testingPart$classe <- NULL
classesPredicted <- predict(model$finalModel, newdata=testingPart)
confMatrix <- confusionMatrix(data=classesPredicted, reference=classesReference)
OutOfSampleErrorRate <- 1 - confMatrix$overall['Accuracy']; names(OutOfSampleErrorRate) = 'ErrorRate'
confMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1367 7 0 0 0
## B 0 919 0 0 0
## C 0 2 838 3 0
## D 0 1 0 783 1
## E 0 0 0 0 881
##
## Overall Statistics
##
## Accuracy : 0.9971
## 95% CI : (0.9951, 0.9984)
## No Information Rate : 0.2847
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9963
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9892 1.0000 0.9962 0.9989
## Specificity 0.9980 1.0000 0.9987 0.9995 1.0000
## Pos Pred Value 0.9949 1.0000 0.9941 0.9975 1.0000
## Neg Pred Value 1.0000 0.9974 1.0000 0.9993 0.9997
## Prevalence 0.2847 0.1935 0.1745 0.1637 0.1837
## Detection Rate 0.2847 0.1914 0.1745 0.1631 0.1835
## Detection Prevalence 0.2861 0.1914 0.1756 0.1635 0.1835
## Balanced Accuracy 0.9990 0.9946 0.9994 0.9978 0.9994
The expected out-of-sample error is calculated as 0.0029.
The Kappa statistic value (based on the ‘hold out’ dataset), is 0.996.
First, we will apply the same preprocessing to the dataset, which in this case consists mainly of removing columns (given the dataset and the use of Random Forest, no scaling or normalizing was needed). Then we will apply predict()
on the model and the newdata. The result will be used for the second part of the assignment: the submission of the predictions on the 20 test cases.
testing2 <- select(testing, -contains('timestamp'))
# We already checked that dataset 'testing' has only rows with new_window == no,
# we will remove the column however
testing2 <- select(testing2, -contains('new_window'))
testing2$X <- NULL
testing2$user_name <- NULL
testing2 <- testing2[,!columns_with_NA_only]
# make sure the prediction order matches the problem IDs
testing2 <- arrange(testing2, problem_id)
predictions_on_pml_testing <- predict(model$finalModel, newdata=testing2)
predictions_on_pml_testing
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
To write out the final results to (individual) answer-files, I used the script suggested by the course assignment:
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(predictions_on_pml_testing)
The creation of the answer files finishes this writeup (and provides the files necessary for the ‘submission’ part of the assignment).