The goal of this project was to create a predictive model for classification of human motion data. The data set comes from a Human Activity Recognition dataset. The data used is from sensor captures of subjects doing a simple dumbell curl weightlifting exercise. Each capture is classified into 5 different ways one can incorrectly perform the exericse. For each capture sensors are used to record measurements of the dumbell/arm/waist etc. The authors used the time series nature of the data in their analysis to classify the motions. Our evaluation data is for instants in time. Since we do not have the entire timeseries in our evaluation data, the aim is to create a predictive model to classify the data from a single instant.
The objective is to classify the data for instants in time. The underlying structure of the data is a time series. I didn’t notice a lot of variation of the data vs. the window time, and so I removed the window time variable. I ultimately did not use any of the time or window information. I removed the columns that contain NA’s, e.g. all the summary statistics. I then tested several different modeling procedures with the caret package. I separated my code out into functions in a few different files to ease reproduction. I did some explorartory analysis, plotting: data vs. window time, pair plots of data, and desnity plots. I didn’t really discern obvious patterns and correlations. The data seemed pretty noisy. So I used all the non-sparse raw data columns in my initial modeling transformed by a simple centering and scaling. As I began training models, I also saved models so that I could avoid retraining them later.
What’s in the Data? The data for the project includes a larger training set and a smaller evaluation data set.
To prepare the data, I removed the columns that I didn’t want to train on. Initially the column summemry statistics and later decided to remove the first 7 columns with user_name, time and windowing informaiton. The loading, basic processing and bifurcating routines are in the file process.R.
source('process.R');
## Loading required package: lattice
## Loading required package: ggplot2
load_data() # Loads data, creates pml global var
process(pml) # Cuts the data, creates pmldata var
Let’s take a quick look at some of the pmldata variables.
dim(pmldata); summary(pmldata[,c(1,2,3,4,5,6,7,8,53)])
## [1] 19622 53
## roll_belt pitch_belt yaw_belt total_accel_belt
## Min. :-28.90 Min. :-55.8000 Min. :-180.00 Min. : 0.00
## 1st Qu.: 1.10 1st Qu.: 1.7600 1st Qu.: -88.30 1st Qu.: 3.00
## Median :113.00 Median : 5.2800 Median : -13.00 Median :17.00
## Mean : 64.41 Mean : 0.3053 Mean : -11.21 Mean :11.31
## 3rd Qu.:123.00 3rd Qu.: 14.9000 3rd Qu.: 12.90 3rd Qu.:18.00
## Max. :162.00 Max. : 60.3000 Max. : 179.00 Max. :29.00
## gyros_belt_x gyros_belt_y gyros_belt_z
## Min. :-1.040000 Min. :-0.64000 Min. :-1.4600
## 1st Qu.:-0.030000 1st Qu.: 0.00000 1st Qu.:-0.2000
## Median : 0.030000 Median : 0.02000 Median :-0.1000
## Mean :-0.005592 Mean : 0.03959 Mean :-0.1305
## 3rd Qu.: 0.110000 3rd Qu.: 0.11000 3rd Qu.:-0.0200
## Max. : 2.220000 Max. : 0.64000 Max. : 1.6200
## accel_belt_x classe
## Min. :-120.000 A:5580
## 1st Qu.: -21.000 B:3797
## Median : -15.000 C:3422
## Mean : -5.595 D:3216
## 3rd Qu.: -5.000 E:3607
## Max. : 85.000
I found that looking at the covariance matrix gives us a quick and dirty view of the variables in the data set and a feel for the correlations. The plot indicates that there’s some, but pobably not a lot of very correlated data:
library(lattice)
cv <- cov(scale(pmldata[,-53])) # scale the data before getting the covariance
levelplot(cv, aspect="iso", scales=list(x=list(rot=90)))
It can also be useful to visualize some of the variables with pair plottings. I did this for several types of variables that might have correlations. For example pair plotting the accelerations for the dumbbell:
library(caret)
library(AppliedPredictiveModeling)
transparentTheme(trans = .3)
accel_vars<-grep("accel_dumb",names(pmldata))
featurePlot(x=pmldata[,accel_vars],y=pmldata$classe,
plot="pairs", auto.key=list(columns = 5))
There generally appeared to be a lot of overlap for the distinct classes for most of the independent variables. At this point I began trying to train some models with all of the data columns. I started by bifurcating the training data into a training and test set and attempted train several models on 20% of the training data. I then created a pre processing routine, using the entire data set for estimating the centering and scaling parameters.
bifurcate(pmldata, p=0.2) # create training and testing data partitions
pp<-preProcess(pmldata) # default behaivor is to do a centering and scaling transformation
# I also tried using PCA decompositions
pp2<-preProcess(pmldata, method=c("center","scale","pca"),thresh=.90)
pmltraint<-predict(pp,pmltraining) # this actually applies the basic centering and scaling transformation to the training data
My model training routing used the caret function train. In this function a simple formula can be trained using many different learning models. I used method=“rf” to train Random Forest models. I used “svmLinear”, “svnPoly” and “svmRadial” to train Support Vector Machine models. I tried using “nb” for Naive Bayes models, “glmnet” for a multiclassification logistic regression model, and a few others without luck. The trainControl function sets up some meta parameters. It’s method “cv” causes a 5 fold cross validation on the input data to estimate the out of sample error. Very convenient.
ctrl=trainControl(method="cv",number=5)
mx <- train(classe ~ ., data=pmltraint, method="x",trControl=ctrl)
I focused my efforts on Support Vector Machine and Random Forest models, using 40% of the data, and trying PCA decompositions of the training data. Once a model is trained, I tested it on the remaining data. The caret confusion matrix command is useful to get out of sample accuracy and a truth table with Type I and II errors.
For example, the model mrf4 is a Random Forest model, trained with method=“rf”, with 40% of the data, and had an accuracy of a little better close to 99%.
pmltestt<-predict(pp,pmltesting) # transform the test data
confusionMatrix(pmltestt$classe, predict(mrf4, pmltestt))
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4455 3 5 0 1
## B 36 2979 16 5 1
## C 0 19 2708 10 0
## D 0 0 27 2542 3
## E 0 1 3 17 2864
##
## Overall Statistics
##
## Accuracy : 0.9906
## 95% CI : (0.989, 0.9921)
## No Information Rate : 0.2861
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9882
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9920 0.9923 0.9815 0.9876 0.9983
## Specificity 0.9992 0.9954 0.9978 0.9977 0.9984
## Pos Pred Value 0.9980 0.9809 0.9894 0.9883 0.9927
## Neg Pred Value 0.9968 0.9982 0.9961 0.9976 0.9996
## Prevalence 0.2861 0.1913 0.1758 0.1640 0.1828
## Detection Rate 0.2838 0.1898 0.1725 0.1620 0.1825
## Detection Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9956 0.9939 0.9896 0.9926 0.9983
For my initial round of model training I used 20% of the training data on all 5 classes. I was only able to get Random Forest and Support Vector Machine models to work as expected and have reasonable accuracy. I found that a Support Vector Machine model w/ a Polynomial kernel with degree 3 had an accuracy on the testing data of ~ 93.4%. I found that Random Forest models had an accuracy of ~ 82%.
Next I trained models with 40% of the input data. I found, remarkably, that Random Forest models had improved to ~ 98.9% accuracy and that an SVM with polynomial kernel of degree 3 had ~98.3% accuracy. The SVM polynomial model also took about 42 minutes to train compared to less than 10 minutes for the Random Forest. I believe this is due to the SVM model trying out a larger variety of parameters and kernel degrees. I also trained a Random Forest model on the entire data set - this took about 38 minutes and estimated a training accuracy of 99.8%.
Estimates of out of sample error is important in any predictive analysis, and for this sale a simple k-fold cross validation is excellent. In caret this step can easily be incorporated in the model training process. I used a 5 fold cross validation during training.
The “good” models were applied to the 20 test cases in the evaluation set. The “good” models (those with accuracy of at least 98%) were in agreement on the classification.
The classifications I found on the evaluation set: - B A B A A E D B A A B C B A E E A B B B
I ultimately was able to get decent accuracy wihout using the window time or the person as a dummy variable. And ultimately was able to create models with an accuracy of around 99%.