Motion Project

Author - Arnie Larson

Project is part of a Coursera Machine Learning class

HTML version available here

Coure Taught By: Jeff Leek, Johns Hopkins University

Date: 2015-11-19

Introduction

The goal of this project was to create a predictive model for classification of human motion data. The data set comes from a Human Activity Recognition dataset. The data used is from sensor captures of subjects doing a simple dumbell curl weightlifting exercise. Each capture is classified into 5 different ways one can incorrectly perform the exericse. For each capture sensors are used to record measurements of the dumbell/arm/waist etc. The authors used the time series nature of the data in their analysis to classify the motions. Our evaluation data is for instants in time. Since we do not have the entire timeseries in our evaluation data, the aim is to create a predictive model to classify the data from a single instant.

Project Overview

The objective is to classify the data for instants in time. The underlying structure of the data is a time series. I didn’t notice a lot of variation of the data vs. the window time, and so I removed the window time variable. I ultimately did not use any of the time or window information. I removed the columns that contain NA’s, e.g. all the summary statistics. I then tested several different modeling procedures with the caret package. I separated my code out into functions in a few different files to ease reproduction. I did some explorartory analysis, plotting: data vs. window time, pair plots of data, and desnity plots. I didn’t really discern obvious patterns and correlations. The data seemed pretty noisy. So I used all the non-sparse raw data columns in my initial modeling transformed by a simple centering and scaling. As I began training models, I also saved models so that I could avoid retraining them later.

Data Overview

What’s in the Data? The data for the project includes a larger training set and a smaller evaluation data set.

  • Training Data: 19622 rows, 160 columns, with 5 classifications
  • Evaluation Data: 20 rows, 160 columns, not classified

Dataset Columns

  • X, ordinal data, just the row number
  • a user_name { 6 different people }
  • new_window: yes/no - (yes is the last row for a given time window)
  • num_window: ordinal, tracking the windows, also groups dataset
  • raw_timestamp_part_1: unix time
  • raw_timestamp_part_2: window time, presumed starting from 0 within a window
  • cvtd_timestamp: a human readable time
  • lots of data columns (accelerations, etc of arm/waist/dumbell etc..)
  • data columns also include (sparse) summary statistics (min/max,mean,stdev)
  • summary statistic data colums also contain !div/0
  • classe { A -E } - the classification of each motion

Modeling Procedure

To prepare the data, I removed the columns that I didn’t want to train on. Initially the column summemry statistics and later decided to remove the first 7 columns with user_name, time and windowing informaiton. The loading, basic processing and bifurcating routines are in the file process.R.

source('process.R');
## Loading required package: lattice
## Loading required package: ggplot2
load_data()              # Loads data, creates pml global var
process(pml)        # Cuts the data, creates pmldata var

Let’s take a quick look at some of the pmldata variables.

dim(pmldata); summary(pmldata[,c(1,2,3,4,5,6,7,8,53)])
## [1] 19622    53
##    roll_belt        pitch_belt          yaw_belt       total_accel_belt
##  Min.   :-28.90   Min.   :-55.8000   Min.   :-180.00   Min.   : 0.00   
##  1st Qu.:  1.10   1st Qu.:  1.7600   1st Qu.: -88.30   1st Qu.: 3.00   
##  Median :113.00   Median :  5.2800   Median : -13.00   Median :17.00   
##  Mean   : 64.41   Mean   :  0.3053   Mean   : -11.21   Mean   :11.31   
##  3rd Qu.:123.00   3rd Qu.: 14.9000   3rd Qu.:  12.90   3rd Qu.:18.00   
##  Max.   :162.00   Max.   : 60.3000   Max.   : 179.00   Max.   :29.00   
##   gyros_belt_x        gyros_belt_y       gyros_belt_z    
##  Min.   :-1.040000   Min.   :-0.64000   Min.   :-1.4600  
##  1st Qu.:-0.030000   1st Qu.: 0.00000   1st Qu.:-0.2000  
##  Median : 0.030000   Median : 0.02000   Median :-0.1000  
##  Mean   :-0.005592   Mean   : 0.03959   Mean   :-0.1305  
##  3rd Qu.: 0.110000   3rd Qu.: 0.11000   3rd Qu.:-0.0200  
##  Max.   : 2.220000   Max.   : 0.64000   Max.   : 1.6200  
##   accel_belt_x      classe  
##  Min.   :-120.000   A:5580  
##  1st Qu.: -21.000   B:3797  
##  Median : -15.000   C:3422  
##  Mean   :  -5.595   D:3216  
##  3rd Qu.:  -5.000   E:3607  
##  Max.   :  85.000

I found that looking at the covariance matrix gives us a quick and dirty view of the variables in the data set and a feel for the correlations. The plot indicates that there’s some, but pobably not a lot of very correlated data:

library(lattice)
cv <- cov(scale(pmldata[,-53]))  # scale the data before getting the covariance 
levelplot(cv, aspect="iso",  scales=list(x=list(rot=90)))

It can also be useful to visualize some of the variables with pair plottings. I did this for several types of variables that might have correlations. For example pair plotting the accelerations for the dumbbell:

library(caret)
library(AppliedPredictiveModeling)
transparentTheme(trans = .3)
accel_vars<-grep("accel_dumb",names(pmldata))
featurePlot(x=pmldata[,accel_vars],y=pmldata$classe, 
            plot="pairs", auto.key=list(columns = 5)) 

There generally appeared to be a lot of overlap for the distinct classes for most of the independent variables. At this point I began trying to train some models with all of the data columns. I started by bifurcating the training data into a training and test set and attempted train several models on 20% of the training data. I then created a pre processing routine, using the entire data set for estimating the centering and scaling parameters.

bifurcate(pmldata, p=0.2)     # create training and testing data partitions
pp<-preProcess(pmldata)       # default behaivor is to do a centering and scaling transformation
# I also tried using PCA decompositions
pp2<-preProcess(pmldata, method=c("center","scale","pca"),thresh=.90)
pmltraint<-predict(pp,pmltraining)  # this actually applies the basic centering and scaling transformation to the training data

My model training routing used the caret function train. In this function a simple formula can be trained using many different learning models. I used method=“rf” to train Random Forest models. I used “svmLinear”, “svnPoly” and “svmRadial” to train Support Vector Machine models. I tried using “nb” for Naive Bayes models, “glmnet” for a multiclassification logistic regression model, and a few others without luck. The trainControl function sets up some meta parameters. It’s method “cv” causes a 5 fold cross validation on the input data to estimate the out of sample error. Very convenient.

ctrl=trainControl(method="cv",number=5)
mx <- train(classe ~ ., data=pmltraint, method="x",trControl=ctrl)

I focused my efforts on Support Vector Machine and Random Forest models, using 40% of the data, and trying PCA decompositions of the training data. Once a model is trained, I tested it on the remaining data. The caret confusion matrix command is useful to get out of sample accuracy and a truth table with Type I and II errors.

For example, the model mrf4 is a Random Forest model, trained with method=“rf”, with 40% of the data, and had an accuracy of a little better close to 99%.

pmltestt<-predict(pp,pmltesting)  # transform the test data
confusionMatrix(pmltestt$classe, predict(mrf4, pmltestt))
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4455    3    5    0    1
##          B   36 2979   16    5    1
##          C    0   19 2708   10    0
##          D    0    0   27 2542    3
##          E    0    1    3   17 2864
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9906         
##                  95% CI : (0.989, 0.9921)
##     No Information Rate : 0.2861         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9882         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9920   0.9923   0.9815   0.9876   0.9983
## Specificity            0.9992   0.9954   0.9978   0.9977   0.9984
## Pos Pred Value         0.9980   0.9809   0.9894   0.9883   0.9927
## Neg Pred Value         0.9968   0.9982   0.9961   0.9976   0.9996
## Prevalence             0.2861   0.1913   0.1758   0.1640   0.1828
## Detection Rate         0.2838   0.1898   0.1725   0.1620   0.1825
## Detection Prevalence   0.2844   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9956   0.9939   0.9896   0.9926   0.9983

Results

For my initial round of model training I used 20% of the training data on all 5 classes. I was only able to get Random Forest and Support Vector Machine models to work as expected and have reasonable accuracy. I found that a Support Vector Machine model w/ a Polynomial kernel with degree 3 had an accuracy on the testing data of ~ 93.4%. I found that Random Forest models had an accuracy of ~ 82%.

Next I trained models with 40% of the input data. I found, remarkably, that Random Forest models had improved to ~ 98.9% accuracy and that an SVM with polynomial kernel of degree 3 had ~98.3% accuracy. The SVM polynomial model also took about 42 minutes to train compared to less than 10 minutes for the Random Forest. I believe this is due to the SVM model trying out a larger variety of parameters and kernel degrees. I also trained a Random Forest model on the entire data set - this took about 38 minutes and estimated a training accuracy of 99.8%.

Estimates of out of sample error is important in any predictive analysis, and for this sale a simple k-fold cross validation is excellent. In caret this step can easily be incorporated in the model training process. I used a 5 fold cross validation during training.

Out of sample error rates:
  • Random Forest trained on 40% of data: 1.3%
  • Random Forest trained on 100% of data: 0.4%

The “good” models were applied to the 20 test cases in the evaluation set. The “good” models (those with accuracy of at least 98%) were in agreement on the classification.

The classifications I found on the evaluation set: - B A B A A E D B A A B C B A E E A B B B

I ultimately was able to get decent accuracy wihout using the window time or the person as a dummy variable. And ultimately was able to create models with an accuracy of around 99%.