1 Introduction / summary
2 Getting and cleaning the data
3 Building and evaluating the model
4 Predictions for the official (downloaded) test set

1 Introduction / summary

The goal of the ‘prediction assignment’ is to predict the manner in which (six) participants did a weight lifting exercise. For this, I downloaded a training dataset and a test dataset. I created a model, used cross validation, calculated an expected out of sample error. In this write-up, I will also describe how I built the model, and why I made the choices that I did. I also used the model to predict the (20) test cases.

Note: the document has between approximately 1,435 words (counted from Word-version using OpenOffice) and 1,539 words (counted from the HTML-version by this online tool).

2 Getting and cleaning the data

2.1 Getting the data

First, I download the test data and the training data. The R code is available in the repository (in the compiled version, some code will be suppressed by echo=F).
I then read the data with read.csv. The raw training data will be in a data-frame variable called training and raw testing data in testing. In numeric variables, the value #DIV/0! was present; this value was treated as NA (otherwise the numeric variables cannot be treated as numeric).

training <- read.csv('pml-training.csv', na.strings=c("NA", "#DIV/0!"))
testing <- read.csv('pml-testing.csv', na.strings=c("NA", "#DIV/0!"))

2.2 Timestamp variables

I removed all timestamp variables: the testing data consists of (20) single rows on which to base 20 predictions, so we need to make our predictions on a single measurement ‘moment’ and therefore I did not treat the data as time series data.

library(dplyr)
training2 <- select(training, -contains('timestamp')) # Remove timestamp variables

2.3 Treatment of the `new_window` variable

The new_window variable seems to have a special meaning: in the testing dataset, all rows have new_window equal to 0. We therefore do not use rows with new_window == 0 to make a prediction model for the new data (in the testing set). And we can drop the variable if we subset to cases with new_window == 0.

training2 <- training2[training2$new_window == 'no',]
training2 <- select(training2, -contains('new_window'))

2.4 Variables that may identify specific cases

The variables X and user_name do not contain any actual measurement of how a participant performed an exercise, but may (help to) identify measurements, which may lead to overfitting if included. So I decided to drop those variables.

training2$X <- NULL
training2$user_name <- NULL

2.5 Remove variables that only have NA-values left

Variables where all rows only have NA left as a value (so the number of NA’s in that column is equal to the number of cases), have nothing left to contribute to the model, so we can remove those.

columns_with_NA_only <- apply(training2, 2,
                              function (c) sum(is.na(c)) == nrow(training2))
training2 <- training2[, !columns_with_NA_only]

3 Building and evaluating the model

3.1 Initial choice of Machine Learning algorithm

Because of noise in the sensor data, the authors of the paper that was referred to in the assignment, chose for Random Forest. Our situation is not different. So I also choose for Random Forest, which is also known to have good prediction accuracy (compared to e.g. a single tree approach). When I made the decision to start with random forest, I thought I could always try another approach if needed. However, we will see that RF did work very well and a revised strategy was unnecessary.

3.2 Splitting the ‘training’ data

I decided to split the data into 75% to train the model and 25% to predict the out of sample error. I decided to use 75% training data (instead of a lower value of 60% e.g.) because:

the training set is not too small
caret will already use resampling internally (therefore I think the risk of overfitting the model is already mitigated)
I want the model trained more thoroughly to use it on the official testing set (that I have set aside).

I will use set.seed to make my work reproducible.

library(caret)
set.seed(13234)
inTrainingPart <- createDataPartition(training2$classe, p=0.75, list=F)
trainingPart <- training2[inTrainingPart,]
testingPart <- training2[-inTrainingPart,]

3.3 Training the model

Package caret provides a training function that can be used with Random Forest and also provides sensible defaults. The package also provides multicore/parallel processing, and because of the size of the dataset and the algorithm and defaults used, I will turn it on.

library(doParallel)
cluster <- makeCluster(3) # Adapt as needed, depending on hardware!
registerDoParallel(cluster)

model <- train(classe ~ ., data=trainingPart, method='rf', verbose=T)

caret will try several models and select a best model automatically. Now that we have such a (best) model (which caret selected for us), we will get some accuracy/error indications to show how well (or poor) the model did:

# Get accuracy of 'bestTune' model:
accuracy.of.best.model <- model$results$Accuracy[as.integer(row.names(model$bestTune))]

# Get OOB error rate of the final model:
#   (note: by looking it up for the number of trees used, which is in 'ntree')
errorRate.OOB.Reported <- model$finalModel$err.rate[model$finalModel$ntree,1]

Results:

The accuracy of the chosen model as reported by the algorithm is 0.995502
The OOB (out of bag) error rate as reported is 0.002151.

Based on the ‘high’ indication for accuracy and the ‘low’ indication for OOB error rate, I decided to use this model to go to the next step, and calculate an expected out of sample error rate using the 25% data we set aside (see below).

3.4 Cross-validation

Note: According to the lectures, care should be taken to avoid overfitting with Random Forest, so cross-validation is important.

At the end of the video lecture on ‘Random Forest’ (week 3) though, Jeff Leek specifically states that the train function in caret will handle that for you. Breiman, a developer of Random Forest, explains in the (online) section “The out-of-bag (oob) error estimate” how the algorithm leaves a subset of cases out of each sample (about one-third), and how the model thus results in a unbiased estimate (proven in practice).

We will still also explicitly validate our model accuracy, with the 25% of ‘training data’ that we have set aside and left untouched.

library(randomForest)
classesReference <- testingPart$classe
testingPart$classe <- NULL

classesPredicted <- predict(model$finalModel, newdata=testingPart)
confMatrix <- confusionMatrix(data=classesPredicted, reference=classesReference)
OutOfSampleErrorRate  <- 1 - confMatrix$overall['Accuracy']; names(OutOfSampleErrorRate) = 'ErrorRate'
confMatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1367    7    0    0    0
##          B    0  919    0    0    0
##          C    0    2  838    3    0
##          D    0    1    0  783    1
##          E    0    0    0    0  881
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9971          
##                  95% CI : (0.9951, 0.9984)
##     No Information Rate : 0.2847          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9963          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9892   1.0000   0.9962   0.9989
## Specificity            0.9980   1.0000   0.9987   0.9995   1.0000
## Pos Pred Value         0.9949   1.0000   0.9941   0.9975   1.0000
## Neg Pred Value         1.0000   0.9974   1.0000   0.9993   0.9997
## Prevalence             0.2847   0.1935   0.1745   0.1637   0.1837
## Detection Rate         0.2847   0.1914   0.1745   0.1631   0.1835
## Detection Prevalence   0.2861   0.1914   0.1756   0.1635   0.1835
## Balanced Accuracy      0.9990   0.9946   0.9994   0.9978   0.9994

The expected out-of-sample error is calculated as 0.0029.

The Kappa statistic value (based on the ‘hold out’ dataset), is 0.996.

4 Predictions for the official (downloaded) test set

First, we will apply the same preprocessing to the dataset, which in this case consists mainly of removing columns (given the dataset and the use of Random Forest, no scaling or normalizing was needed). Then we will apply predict() on the model and the newdata. The result will be used for the second part of the assignment: the submission of the predictions on the 20 test cases.

testing2 <- select(testing, -contains('timestamp'))
# We already checked that dataset 'testing' has only rows with new_window == no,
# we will remove the column however
testing2 <- select(testing2, -contains('new_window'))
testing2$X <- NULL
testing2$user_name <- NULL

testing2 <- testing2[,!columns_with_NA_only]

# make sure the prediction order matches the problem IDs
testing2 <- arrange(testing2, problem_id)

predictions_on_pml_testing <- predict(model$finalModel, newdata=testing2)
predictions_on_pml_testing

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

To write out the final results to (individual) answer-files, I used the script suggested by the course assignment:

pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}
pml_write_files(predictions_on_pml_testing)

The creation of the answer files finishes this writeup (and provides the files necessary for the ‘submission’ part of the assignment).

Prediction Assignment Writeup

Marc van den Hoogen

June 15th - June 21th, 2015

1 Introduction / summary

2 Getting and cleaning the data

2.1 Getting the data

2.2 Timestamp variables

2.3 Treatment of the `new_window` variable

2.4 Variables that may identify specific cases

2.5 Remove variables that only have NA-values left

3 Building and evaluating the model

3.1 Initial choice of Machine Learning algorithm

3.2 Splitting the ‘training’ data

3.3 Training the model

3.4 Cross-validation

4 Predictions for the official (downloaded) test set

Prediction Assignment Writeup

Marc van den Hoogen

June 15th - June 21th, 2015

1 Introduction / summary

2 Getting and cleaning the data

2.1 Getting the data

2.2 Timestamp variables

2.3 Treatment of the new_window variable

2.4 Variables that may identify specific cases

2.5 Remove variables that only have NA-values left

3 Building and evaluating the model

3.1 Initial choice of Machine Learning algorithm

3.2 Splitting the ‘training’ data

3.3 Training the model

3.4 Cross-validation

4 Predictions for the official (downloaded) test set

2.3 Treatment of the `new_window` variable