Synopsis

We use a Random Forest model to predict the types of activities performed by the users of the personal tracking devices.

Load the libraries

library(ggplot2)
library(caret)

Load the data and clean it

training = read.csv("./pml-training.csv",na.strings=c("NA",""))
testing = read.csv("./pml-testing.csv",na.strings=c("NA",""))

We will remove some of the identification columns from the data to avoid predictions using things like user name and data indexing as shown in this figure.

qplot(x=X,y=classe,data=training);

training = subset(training, select = -c(X,user_name,raw_timestamp_part_1,raw_timestamp_part_2,cvtd_timestamp,new_window))
testing = subset(testing, select = -c(X,user_name,raw_timestamp_part_1,raw_timestamp_part_2,cvtd_timestamp,new_window))

Also, let’s get rid of the variables that are mostly NAs. Most likely these variables won’t be very useful. If the variable has more that 50% NAs, then we discard it.

n = nrow(testing)
ratio = 0.5
keep = colSums(is.na(training))<ratio*n
training = training[, keep]
testing  = testing [, keep]

Model design

The model we choose to work with is the Random Forests. This is a popular model and works in many different applications. For Random Forests, it is not necessary to do a separate cross validation study or a separate validation data set since this is taken care of during the model building. For this study, we will use a 5 fold cross-validation.

k = 5
set.seed(42)
model = train(classe ~ ., method = 'rf', data = training, trControl=trainControl(method="cv",number=k), allowParallel=TRUE,prox=TRUE)
print(model$finalModel)
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE,      allowParallel = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.14%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 5578    1    0    0    1 0.0003584229
## B    5 3789    2    1    0 0.0021069265
## C    0    4 3418    0    0 0.0011689071
## D    0    0   10 3205    1 0.0034203980
## E    0    0    0    2 3605 0.0005544774

This model is fairly good since the expected out-of-sample error, which is defined as 1 - accuracy of the model for predictions made against the cross-validation data set, is very small (listed above in as OOB estimate of error rate).

Predictions

Now we apply our model to the test data and get the predictions. We output these to a text file so we can later submit them to the quiz.

predictions = predict(model, newdata = testing)
predictions = data.frame(predictions)
print(predictions)
##    predictions
## 1            B
## 2            A
## 3            B
## 4            A
## 5            A
## 6            E
## 7            D
## 8            B
## 9            A
## 10           A
## 11           B
## 12           C
## 13           B
## 14           A
## 15           E
## 16           E
## 17           A
## 18           B
## 19           B
## 20           B
write.csv(predictions,file="answers.csv",quote=FALSE)

Conclusions

Using a Random Forest model to predict the types of activities, we were able to correctly predict the activities in the test data.

Appendix

The version history of this document can be found at the GitHub repository page. Here is the full code used in this document.

## ------------------------------------------------------------------------
library(ggplot2)
library(caret)

## ------------------------------------------------------------------------
training = read.csv("./pml-training.csv",na.strings=c("NA",""))
testing = read.csv("./pml-testing.csv",na.strings=c("NA",""))

## ------------------------------------------------------------------------
qplot(x=X,y=classe,data=training);
training = subset(training, select = -c(X,user_name,raw_timestamp_part_1,raw_timestamp_part_2,cvtd_timestamp,new_window))
testing = subset(testing, select = -c(X,user_name,raw_timestamp_part_1,raw_timestamp_part_2,cvtd_timestamp,new_window))

## ------------------------------------------------------------------------
n = nrow(testing)
ratio = 0.5
keep = colSums(is.na(training))<ratio*n
training = training[, keep]
testing  = testing [, keep]

## ------------------------------------------------------------------------
k = 5
set.seed(42)
model = train(classe ~ ., method = 'rf', data = training, trControl=trainControl(method="cv",number=k), allowParallel=TRUE,prox=TRUE)
print(model$finalModel)

## ------------------------------------------------------------------------
predictions = predict(model, newdata = testing)
predictions = data.frame(predictions)
print(predictions)
write.csv(predictions,file="answers.csv",quote=FALSE)

## ----code=readLines(knitr::purl('./predicting_activities.Rmd', documentation = 1)), eval = FALSE----
## NA