Financial Data Modeling

The idea for doing this project comes from the book Doing Data Science by Cathy O'Neil and Rachel Schutt.

Goals

Gather financial data and create and investigate potential features for modeling use cases. Compute daily log returns and log Volume data. Generate a volatility index that looks at an exponentially weighted window function of the variance. Generate simple linear forecasting models based on these features. Compare a strategy using these models with a random model and a buy and hold model.

  • Generate linear models using recent log returns, volume and volatility estimates.
  • Look for correlations with volatility and intraday high and low ticker based features from today with tomorrow's close.
  • Apply the model to future data and calculate returns on investment.
  • Compare the model returns to simply buying and holding the stock as well as stochastically buying and selling the stock.

Conclusion

The goal of this project was to code simple predictive models and compare returns over some time period with using "naive models", e.g. buy and hold. In this notebook I use data for AAPL stock from 1995 to 2018. I generate a model trained from data between 1995 and 2015 and apply this model to the years 2015, 2016 and 2017. It is not surprising to me that I did not find any significant correlation with the model and next day stock returns. The models do not show any predictive tendencies relative to a stochastic model that randomly chose to buy and sell on any given day or just buying and holding.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
import pandas_datareader 
In [3]:
## Get some financial data
## Quandl - is a private a financial data provider.    

start_date = '1990-01-01'
end_date = '2018-11-12'

# Quandl Key, add here:
key="YOURKEYHERE"
# this is the sp500 - just the daily value
sp_code="MULTPL/SP500_REAL_PRICE_MONTH"


## To start = let's use an index
#q1 = pandas_datareader.quandl.QuandlReader("AMZN",  start_date, end_date, api_key=key)
#q2 = pandas_datareader.quandl.QuandlReader("GOOGL",  start_date, end_date, api_key=key)
q3 = pandas_datareader.quandl.QuandlReader("AAPL",  start_date, end_date, api_key=key)
#q4 = pandas_datareader.quandl.QuandlReader("FB",  start_date, end_date, api_key=key)
#q = pandas_datareader.quandl.QuandlReader(sp_code,  start_date, end_date, api_key=key)
#amzn=q1.read()
#goog=q2.read()
aapl=q3.read()
In [4]:
# Note - data stops around 3/27/2018
# In previous notebooks I've looked at this data in more depth, here we'll take a peak, 
# and then do a series of transformations before we look at modeling results

#print(amzn.shape, amzn.index.min(),amzn.index.max())
#print(goog.shape, goog.index.min(),goog.index.max())
print(aapl.shape, aapl.index.min(),aapl.index.max())
((7113, 12), Timestamp('1990-01-02 00:00:00'), Timestamp('2018-03-27 00:00:00'))
In [5]:
# Aapl
aapl.head()
Out[5]:
Open High Low Close Volume ExDividend SplitRatio AdjOpen AdjHigh AdjLow AdjClose AdjVolume
Date
2018-03-27 173.68 175.15 166.92 168.340 38962839.0 0.0 1.0 173.68 175.15 166.92 168.340 38962839.0
2018-03-26 168.07 173.10 166.44 172.770 36272617.0 0.0 1.0 168.07 173.10 166.44 172.770 36272617.0
2018-03-23 168.39 169.92 164.94 164.940 40248954.0 0.0 1.0 168.39 169.92 164.94 164.940 40248954.0
2018-03-22 170.00 172.68 168.60 168.845 41051076.0 0.0 1.0 170.00 172.68 168.60 168.845 41051076.0
2018-03-21 175.04 175.09 171.26 171.270 35247358.0 0.0 1.0 175.04 175.09 171.26 171.270 35247358.0
In [6]:
## Ok - a while ago the adjusted close values are pretty low!  
aapl[(aapl.index>'1997') & (aapl.index<'1999') ].AdjClose.plot()
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b49ee90>
In [7]:
# Oh right - there was a bit split a few years back
aapl[['AdjClose','Close']].plot()
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b49e210>

Transormations on the data

First transform the data so that it is feasible to generate linear models with y = Bx + C and find B and C that are the best fit to the data

  • Reorder so that oldest is at the top.
  • Add Log Returns (log(AdjClose_t1) - Log(AdjClose_t0)), and normalized Log Returns . (LR, NLR)
  • Add Volume Returns (log(AdjVolume_t1) - Log(AdjVolume_t0))
  • Add Interday Volatility (AdjHigh - AdjLow)/AdjClose
In [8]:
# Transformations we'll do:  
# 
aapl=aapl.reindex(index=aapl.index[::-1])
aapl['LR']=aapl[['AdjClose']].apply(lambda x: np.log(x) - np.log(x.shift(1)))
aapl['NLR']=(aapl.LR - aapl.LR.mean())/aapl.LR.std()
aapl['vol']=aapl.LR.ewm(halflife=0.97).std()
aapl['LV']=aapl[['AdjVolume']].apply(lambda x: np.log(x) - np.log(x.shift(1)))
aapl['IV']=(aapl.AdjHigh - aapl.AdjLow)/aapl.AdjClose
aapl['IC']=(aapl.AdjClose - aapl.AdjLow)/(aapl.AdjHigh - aapl.AdjLow)
In [9]:
aapl[['LR','NLR','vol','LV','IV','IC']].head()
Out[9]:
LR NLR vol LV IV IC
Date
1990-01-02 NaN NaN NaN NaN 0.067114 0.900000
1990-01-03 0.006689 0.207857 NaN 0.126945 0.013333 0.000000
1990-01-04 0.003461 0.095719 0.002283 0.062969 0.039862 0.253333
1990-01-05 0.003184 0.086104 0.001576 -0.585766 0.033113 0.600000
1990-01-08 0.006601 0.204789 0.002081 -0.193942 0.026316 1.000000
In [10]:
# Look at the log return and Inter Close
aapl[(aapl.index > '2002') & (aapl.index < '2003')][['NLR','IC']].plot(figsize=(12,6))
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c0a9d90>
In [11]:
# Are there any interesting correlations?
aapl[(aapl.index > '2002') & (aapl.index < '2003')][['NLR','IC']].corr()
Out[11]:
NLR IC
NLR 1.000000 0.696897
IC 0.696897 1.000000
In [13]:
## Look at the volatility and the Interday Volatility
aapl[(aapl.index > '2004') & (aapl.index < '2005')][['vol','IV']].plot(figsize=(12,6))
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x115e5cf50>
In [14]:
## are the vols correlated?
aapl[(aapl.index > '2004') & (aapl.index < '2005')][['vol','IV']].corr()
Out[14]:
vol IV
vol 1.000000 0.429172
IV 0.429172 1.000000
In [15]:
# Ok, now add previous days data:
# This will add the LR-1 as previous NormedLR, making it available for a y = Ax + B type linear problem.
# Since data is ordered old => new, a positive shift, shifts the index "upwards" and gets older data
aapl['NLR-1']=aapl[['NLR']].apply(lambda x: x.shift(1))
aapl['NLR-2']=aapl[['NLR']].apply(lambda x: x.shift(2))
aapl['NLR-3']=aapl[['NLR']].apply(lambda x: x.shift(3))
aapl['NLR-4']=aapl[['NLR']].apply(lambda x: x.shift(4))
In [16]:
# And the future value - what the model is to predict, y = NLR+1
aapl['NLR+1']=aapl[['NLR']].apply(lambda x: x.shift(-1))
In [17]:
# Peak at this:
aapl[['LR', 'AdjClose','NLR+1','NLR','NLR-1','NLR-2','NLR-3','NLR-4','vol','LV','AdjVolume','IV','IC']].head(6)
Out[17]:
LR AdjClose NLR+1 NLR NLR-1 NLR-2 NLR-3 NLR-4 vol LV AdjVolume IV IC
Date
1990-01-02 NaN 1.118093 0.207857 NaN NaN NaN NaN NaN NaN NaN 45799600.0 0.067114 0.900000
1990-01-03 0.006689 1.125597 0.095719 0.207857 NaN NaN NaN NaN NaN 0.126945 51998800.0 0.013333 0.000000
1990-01-04 0.003461 1.129499 0.086104 0.095719 0.207857 NaN NaN NaN 0.002283 0.062969 55378400.0 0.039862 0.253333
1990-01-05 0.003184 1.133101 0.204789 0.086104 0.095719 0.207857 NaN NaN 0.001576 -0.585766 30828000.0 0.033113 0.600000
1990-01-08 0.006601 1.140605 -0.364365 0.204789 0.086104 0.095719 0.207857 NaN 0.002081 -0.193942 25393200.0 0.026316 1.000000
1990-01-09 -0.009785 1.129499 -1.562685 -0.364365 0.204789 0.086104 0.095719 0.207857 0.009535 -0.164811 21534800.0 0.026575 0.630000
In [18]:
aapl[['LR', 'AdjClose','NLR+1','NLR','NLR-1','NLR-2','NLR-3','NLR-4','vol','LV','IV','IC']].describe().transpose()
Out[18]:
count mean std min 25% 50% 75% max
LR 7112.0 7.050564e-04 0.028789 -0.731247 -0.012852 0.000054 0.014317 0.286796
AdjClose 7113.0 2.828904e+01 43.039528 0.415743 1.229236 3.080479 43.196105 181.720000
NLR+1 7112.0 4.247633e-17 1.000000 -25.424949 -0.470909 -0.022607 0.472828 9.937587
NLR 7112.0 4.247633e-17 1.000000 -25.424949 -0.470909 -0.022607 0.472828 9.937587
NLR-1 7111.0 1.303291e-04 1.000010 -25.424949 -0.470355 -0.022505 0.472882 9.937587
NLR-2 7110.0 -9.279425e-05 0.999903 -25.424949 -0.470618 -0.022607 0.472579 9.937587
NLR-3 7109.0 2.497108e-05 0.999924 -25.424949 -0.469828 -0.022505 0.472775 9.937587
NLR-4 7108.0 9.810736e-05 0.999976 -25.424949 -0.469301 -0.022386 0.472828 9.937587
vol 7111.0 2.297688e-02 0.017499 0.001576 0.012274 0.019062 0.028985 0.470795
LV 7112.0 -2.273158e-05 0.436074 -4.030927 -0.260247 -0.022251 0.236022 4.025519
IV 7113.0 3.302855e-02 0.021041 0.003953 0.018308 0.028265 0.041760 0.277191
IC 7113.0 5.094351e-01 0.307046 0.000000 0.239617 0.500000 0.786885 1.000000
In [19]:
## The modeling process could be as simple as seeing if there's any correlation with NLR+1 with any variable?
aapl[(aapl.index > '2002') & (aapl.index < '2003')][['NLR+1', 'AdjClose','NLR+1','NLR','NLR-1','NLR-2','NLR-3','NLR-4','vol','LV','IV','IC']].corr()
Out[19]:
NLR+1 AdjClose NLR+1 NLR NLR-1 NLR-2 NLR-3 NLR-4 vol LV IV IC
NLR+1 1.000000 -0.063285 1.000000 -0.024893 -0.124988 -0.092695 0.089660 -0.018519 0.011842 0.019582 0.080712 -0.101481
AdjClose -0.063285 1.000000 -0.063285 0.085514 0.081302 0.063937 0.054149 0.069897 0.048722 0.010819 -0.004778 0.126551
NLR+1 1.000000 -0.063285 1.000000 -0.024893 -0.124988 -0.092695 0.089660 -0.018519 0.011842 0.019582 0.080712 -0.101481
NLR -0.024893 0.085514 -0.024893 1.000000 -0.033522 -0.119694 -0.081344 0.090525 -0.049604 -0.067566 -0.014254 0.696897
NLR-1 -0.124988 0.081302 -0.124988 -0.033522 1.000000 -0.035378 -0.120355 -0.082399 -0.098442 0.017899 -0.061903 0.044624
NLR-2 -0.092695 0.063937 -0.092695 -0.119694 -0.035378 1.000000 -0.032962 -0.119734 -0.051526 0.001735 -0.063011 -0.016333
NLR-3 0.089660 0.054149 0.089660 -0.081344 -0.120355 -0.032962 1.000000 -0.031626 -0.083384 0.062351 -0.018205 -0.054379
NLR-4 -0.018519 0.069897 -0.018519 0.090525 -0.082399 -0.119734 -0.031626 1.000000 -0.066132 0.031008 -0.097918 0.136069
vol 0.011842 0.048722 0.011842 -0.049604 -0.098442 -0.051526 -0.083384 -0.066132 1.000000 0.071383 0.475645 0.037181
LV 0.019582 0.010819 0.019582 -0.067566 0.017899 0.001735 0.062351 0.031008 0.071383 1.000000 0.308876 0.043641
IV 0.080712 -0.004778 0.080712 -0.014254 -0.061903 -0.063011 -0.018205 -0.097918 0.475645 0.308876 1.000000 0.000201
IC -0.101481 0.126551 -0.101481 0.696897 0.044624 -0.016333 -0.054379 0.136069 0.037181 0.043641 0.000201 1.000000

Testing and Applying Models - Part 1

Now that we've prepared historic data we'd like to test and apply a predictive model and strategy for trading securities. Here is my initial work that explores the code used to create the different model returns.

  • Does a model trained in the past predict closes in the future?
  • (Also, does a model trained in the future, predict closes in the past?)
  • Does a model beat just holding the stock?
  • Does a model beat a simple stochastic one??
In [44]:
## How do I see how much just holding AAPL would do?
# Let's say I make a bet of 1000 shares.  What would happen?

hold = aapl[(aapl.index > '2002') & (aapl.index < '2003')][['AdjClose','LR']] 

# Calculate the daily change by looking at the Return from the previous day, and multiplying by previous days price
hold['daily_change']=1000*(np.exp(hold['LR'])-1)*hold['AdjClose'].shift(1)
hold['investment_value']=1000*hold.AdjClose

initial_investment=1000*hold[hold.index==hold.index.min()].AdjClose[0]

# This works - but is hella ugly
#hold['overall_return']=np.ones(hold.count()[0])*initial_investment[0]*hold.AdjClose

hold['overall_return']=1000*hold.AdjClose - initial_investment
In [46]:
# calculate daily change, by looking at overall_return_1 - overall_return_0
hold['daily_change2']=hold[['overall_return']].apply(lambda x: x - x.shift(1))
hold['overall_return2']=hold.daily_change2.cumsum()
In [52]:
# checksum:
hold['chksm1']=hold['overall_return2']-hold['overall_return']
hold.chksm1.sum()
Out[52]:
0.0
In [60]:
del hold['chksm1']
hold.head()
Out[60]:
AdjClose LR daily_change investment_value overall_return daily_change2 overall_return2
Date
2002-01-02 1.497187 0.061967 NaN 1497.187374 0.000000 NaN NaN
2002-01-03 1.515179 0.011946 17.991951 1515.179325 17.991951 17.991951 17.991951
2002-01-04 1.522248 0.004654 7.068267 1522.247591 25.060218 7.068267 25.060218
2002-01-07 1.471485 -0.033916 -50.763005 1471.484586 -25.702788 -50.763005 -25.702788
2002-01-08 1.452850 -0.012745 -18.634521 1452.850065 -44.337308 -18.634521 -44.337308
In [75]:
hold.tail()
Out[75]:
AdjClose LR daily_change investment_value overall_return daily_change2 overall_return2
Date
2002-12-24 0.922730 -0.009012 -8.353406 922.730072 -574.457301 -8.353406 -574.457301
2002-12-26 0.925300 0.002782 2.570279 925.300351 -571.887023 2.570279 -571.887023
2002-12-27 0.903453 -0.023894 -21.847369 903.452982 -593.734392 -21.847369 -593.734392
2002-12-30 0.904096 0.000711 0.642570 904.095551 -593.091822 0.642570 -593.091822
2002-12-31 0.920802 0.018310 16.706812 920.802363 -576.385010 16.706812 -576.385010
In [78]:
# Using the plt.figure()
hold['investment_value'].plot(grid=True, legend=True)
hold['daily_change'].plot(grid=True, legend=True, secondary_y=True)
Out[78]:
<matplotlib.axes._subplots.AxesSubplot at 0x1063bb050>
In [19]:
## Ok, now lets make a stochastic model, (perhaps a few.. ) and see how how it does..
# Let's say bet on 1000 shares.  And always sell on the next day after a bet.  What would happen?
# Also - what would the range of outcomes be, etc..

# first, what is the approx number of days that returns are positive in the data?   close to 50% over all..
print(aapl[aapl.index<'2002'].count()[0])
print(aapl[(aapl.index<'2002') & (aapl.LR > 0)].count()[0])
print(np.float(aapl[(aapl.index<'2002') & (aapl.LR > 0)].count()[0]) / aapl[aapl.index<'2002'].count()[0])
3028
1425
0.470607661823
In [130]:
# First cut a simple stochastic model

rand = aapl[(aapl.index > '2002') & (aapl.index < '2003')][['AdjClose','LR']] 
rand['make_bet']=np.random.randint(0,2, rand.shape[0])
print("number of days: ",rand.shape[0]," bets: ",rand.make_bet.sum())
# Calculate the daily change by looking at the Return from the previous day, and multiplying by previous days price


#rand['daily_change']=1000*(np.exp(hold['LR'])-1)*hold['AdjClose'].shift(1)
rand['daily_change']=1000*rand[['AdjClose']].apply(lambda x: x - x.shift(1))
rand['investment_change']=rand['daily_change']*rand['make_bet']
rand['investment_return']=rand['investment_change'].cumsum()
('number of days: ', 252, ' bets: ', 116)
In [131]:
rand.head(10)
Out[131]:
AdjClose LR make_bet daily_change investment_change investment_return
Date
2002-01-02 1.497187 0.061967 0 NaN NaN NaN
2002-01-03 1.515179 0.011946 1 17.991951 17.991951 17.991951
2002-01-04 1.522248 0.004654 1 7.068267 7.068267 25.060218
2002-01-07 1.471485 -0.033916 0 -50.763005 -0.000000 25.060218
2002-01-08 1.452850 -0.012745 1 -18.634521 -18.634521 6.425697
2002-01-09 1.391163 -0.043387 0 -61.686690 -0.000000 6.425697
2002-01-10 1.364175 -0.019590 0 -26.987927 -0.000000 6.425697
2002-01-11 1.352609 -0.008515 0 -11.566254 -0.000000 6.425697
2002-01-14 1.359035 0.004739 1 6.425697 6.425697 12.851394
2002-01-15 1.394376 0.025672 1 35.341333 35.341333 48.192727
In [133]:
hold.head()
Out[133]:
AdjClose LR daily_change investment_value overall_return daily_change2 overall_return2
Date
2002-01-02 1.497187 0.061967 NaN 1497.187374 0.000000 NaN NaN
2002-01-03 1.515179 0.011946 17.991951 1515.179325 17.991951 17.991951 17.991951
2002-01-04 1.522248 0.004654 7.068267 1522.247591 25.060218 7.068267 25.060218
2002-01-07 1.471485 -0.033916 -50.763005 1471.484586 -25.702788 -50.763005 -25.702788
2002-01-08 1.452850 -0.012745 -18.634521 1452.850065 -44.337308 -18.634521 -44.337308
In [140]:
# Compare the holding vs random buying and selling
ax = hold['overall_return'].plot(grid=True, legend=True)
rand['investment_return'].plot(grid=True, legend=True)
hold['daily_change'].plot(grid=True, legend=True, secondary_y=True)
ax.set_ylabel('Cumulative Return [$]')
ax.right_ax.set_ylabel('Daily Price Change [$]')
ax.set_xlabel('Year')
Out[140]:
Text(0.5,0,u'Year')

Testing and Applying Models - Part 2

Here I train a simple linear model apply it to future data. I also compare to the "buy and hold" and "stochastic" models and generate explanatory statistics and plots.

Goals

  • generate the return comparison graphs for 3 different years, 15', 16', 17'
  • generate error bars on the stochastic model
  • does the stochastic model tend to the buy and hold model in the mean?

Note that the since the stochastic model chooses to bet ~ half the time, the return is roughly half of the buy and hold strategy.

In [20]:
# Simple model -
# Train a simple model with AAPL on say 1998 to 2012, then test the model for each year post 2012

X_train = aapl[(aapl.index>'1996') & (aapl.index<'2015')][['NLR','NLR-1','NLR-2','NLR-3','vol','LV','IV','IC']]
Y_train = aapl[(aapl.index>'1996') & (aapl.index<'2015')][['NLR+1']]
In [21]:
# Lets test the model on various years:  (Select out 15, 16, 17 for plots later)

X = aapl[(aapl.index>'2015') & (aapl.index<'2018')][['NLR','NLR-1','NLR-2','NLR-3','vol','LV','IV','IC']]
Y = aapl[(aapl.index>'2015') & (aapl.index<'2018')][['NLR+1']]

X15 = aapl[(aapl.index>'2015') & (aapl.index<'2016')][['NLR','NLR-1','NLR-2','NLR-3','vol','LV','IV','IC']]
Y15 = aapl[(aapl.index>'2015') & (aapl.index<'2016')][['NLR+1']]

X16 = aapl[(aapl.index>'2016') & (aapl.index<'2017')][['NLR','NLR-1','NLR-2','NLR-3','vol','LV','IV','IC']]
Y16 = aapl[(aapl.index>'2016') & (aapl.index<'2017')][['NLR+1']]

X17 = aapl[(aapl.index>'2017') & (aapl.index<'2018')][['NLR','NLR-1','NLR-2','NLR-3','vol','LV','IV','IC']]
Y17 = aapl[(aapl.index>'2017') & (aapl.index<'2018')][['NLR+1']]
In [22]:
X_train.describe().transpose()
Out[22]:
count mean std min 25% 50% 75% max
NLR 4784.0 0.009120 1.072609 -25.424949 -0.494113 -0.003268 0.509806 9.937587
NLR-1 4784.0 0.009229 1.072565 -25.424949 -0.493764 -0.003268 0.509806 9.937587
NLR-2 4784.0 0.009233 1.072563 -25.424949 -0.493764 -0.003268 0.509806 9.937587
NLR-3 4784.0 0.009310 1.072572 -25.424949 -0.493764 -0.002572 0.509806 9.937587
vol 4784.0 0.024499 0.019071 0.001967 0.013005 0.020015 0.030851 0.470795
LV 4784.0 -0.000127 0.418232 -1.820601 -0.250602 -0.026271 0.224533 2.830664
IV 4784.0 0.034869 0.022401 0.004150 0.019452 0.029125 0.044096 0.277191
IC 4784.0 0.509037 0.310176 0.000000 0.231474 0.508292 0.788937 1.000000
In [23]:
# Creat the model
X_train.head()
Out[23]:
NLR NLR-1 NLR-2 NLR-3 vol LV IV IC
Date
1996-01-02 0.257739 -0.165892 -0.434549 0.320498 0.010242 -0.780885 0.015562 0.760000
1996-01-03 -0.024491 0.257739 -0.165892 -0.434549 0.007258 1.126808 0.031435 0.257426
1996-01-04 -0.646251 -0.024491 0.257739 -0.165892 0.012666 -0.359008 0.032003 0.188119
1996-01-05 2.816762 -0.646251 -0.024491 0.257739 0.056497 0.395767 0.084088 1.000000
1996-01-08 0.358777 2.816762 -0.646251 -0.024491 0.042763 -1.301554 0.043315 0.420000
In [24]:
Y_train.describe().transpose()
Out[24]:
count mean std min 25% 50% 75% max
NLR+1 4784.0 0.008991 1.072616 -25.424949 -0.494113 -0.003724 0.509806 9.937587
In [25]:
Y_train.head()
Out[25]:
NLR+1
Date
1996-01-02 -0.024491
1996-01-03 -0.646251
1996-01-04 2.816762
1996-01-05 0.358777
1996-01-08 -1.963353
In [88]:
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
In [89]:
from sklearn.metrics import mean_squared_error
In [90]:
# Actual model - lasso model
regressor = Ridge(alpha=0.5)
regressor.fit(X_train, Y_train)
Out[90]:
Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)
In [111]:
r2 = Lasso(alpha=0.0001)
r2.fit(X_train, Y_train)
Out[111]:
Lasso(alpha=0.0001, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)
In [101]:
# Predictions for each of 2015, 2016, 2017
Y_predict = regressor.predict(X)

Y15_predict = regressor.predict(X15)
Y16_predict = regressor.predict(X16)
Y17_predict = regressor.predict(X17)
In [112]:
# Predictions for each of 2015, 2016, 2017 using the Lasso model
Y_predict2 = r2.predict(X)

Y15_predict2 = r2.predict(X15)
Y16_predict2 = r2.predict(X16)
Y17_predict2 = r2.predict(X17)
In [103]:
mean_squared_error(Y_predict, Y)
Out[103]:
0.2550722671112245
In [113]:
mean_squared_error(Y_predict2, Y)
Out[113]:
0.25551893583106344
In [105]:
regressor.coef_
Out[105]:
array([[ 0.00487861, -0.00553775, -0.01053907,  0.0484403 , -2.53586982,
         0.00515271,  0.77611617, -0.18425681]])
In [114]:
r2.coef_
Out[114]:
array([ 3.06115900e-03, -6.00148672e-03, -1.09980476e-02,  4.76754539e-02,
       -3.29158265e+00,  0.00000000e+00,  1.16650863e+00, -1.79054810e-01])

What's in the modeled data

Column Name Model Details
daily_change None Difference in stock price between today and yesterday
hold_current_value Hold Today's AdjClose * Number of shares purchased, Hold investment value
hold_current_return Hold Current Return, today's investment - initial investment
rand_make_bet Random boolean on whether to buy on that day
rand_daily_change Random If a bet is made today, tomorrow will reflect that bet as (amount * daily_change)
predict Predictive Output of predictive model (Real number)
predict_make_bet Predictive Boolean output of predictive model (1 if predict > 0)
predict_daily_change Predictive If a bet is made today, tomorrow will reflect that bet as (amount * daily_change)
In [115]:
## Ok - gather the data for 2015


data_15 = aapl[(aapl.index > '2015') & (aapl.index < '2016')][['AdjClose','LR']] 

## Random model - randomly buy and sell stock, if this is a 1, buy (and sell the next day)
data_15['rand_make_bet']=np.random.randint(0,2, data_15.shape[0])
print("Number of days: {0}".format(data_15.shape[0]))
print("Number of random bets: {0}".format(data_15.rand_make_bet.sum()))

## Recall - LR = Log(C_t/C_(t-1)))
data_15['daily_change']=(np.exp(data_15['LR'])-1)*data_15['AdjClose'].shift(1)

## Static model - buy and hold
data_15['hold_current_value']=1000*data_15.AdjClose

initial_investment15=1000*data_15[data_15.index==data_15.index.min()].AdjClose[0]
print("initial investment: {0:8.2f}".format(initial_investment15))

# This is the current Investment amount
data_15['hold_current_return']=1000*data_15.AdjClose - initial_investment15

roi_hold_net15=data_15.hold_current_return[-1:][0]

print("Final hold net return: {0:8.2f}".format(roi_hold_net15))

#rand['daily_change']=1000*(np.exp(hold['LR'])-1)*hold['AdjClose'].shift(1)
#data_15['rand_daily_change']=1000*data_15[['AdjClose']].apply(lambda x: x - x.shift(1))
data_15['rand_daily_change']=1000*data_15['daily_change']*data_15['rand_make_bet'].shift(1)
#data_15['rand_current_return']=data_15['rand_daily_change'].cumsum()

data_15['predict']=Y15_predict
data_15['predict_make_bet']=np.where(data_15['predict']>0, 1, 0)
print("Number of model bets: {0}".format(data_15.predict_make_bet.sum()))
data_15['predict_daily_change']=1000*data_15['daily_change']*data_15['predict_make_bet'].shift(1)
#data_15['predict_current_return']=data_15['predict_daily_change'].cumsum()

data_15['predict2']=Y15_predict2
data_15['predict_make_bet2']=np.where(data_15['predict2']>0, 1, 0)
print("Number of model bets: {0}".format(data_15.predict_make_bet2.sum()))
data_15['predict_daily_change2']=1000*data_15['daily_change']*data_15['predict_make_bet2'].shift(1)
#data_15['predict_current_return']=data_15['predict_daily_change'].cumsum()


# Calculate my return:  
print("\nBuy and Hold Results")
print("Net gain: {0:8.2f}".format(roi_hold_net15))
prct_roi_hold15= 100*roi_hold_net15/initial_investment15
print("Percent gain: {0:2.2f} %".format(prct_roi_hold15))

print("Random Results:")
roi_rand15=data_15['rand_daily_change'].cumsum()[-1:][0]
print("Net gain: {0:8.2f}".format(roi_rand15))
prct_roi_rand15=100*(roi_rand15 ) / initial_investment15
print("Percent gain: {0:2.2f} %".format(prct_roi_rand15))

print("Model Results:")
roi_model15=data_15['predict_daily_change'].cumsum()[-1:][0]
print("Net gain: {0:8.2f}".format(roi_model15))
prct_roi_model15=100*(roi_model15 ) / initial_investment15
print("Percent gain: {0:2.2f} %".format(prct_roi_model15))

print("Model Results  (Lasso, alpha=.001):")
roi_model15_2=data_15['predict_daily_change2'].cumsum()[-1:][0]
print("Net gain: {0:8.2f}".format(roi_model15_2))
prct_roi_model15_2=100*(roi_model15_2 ) / initial_investment15
print("Percent gain: {0:2.2f} %".format(prct_roi_model15_2))
Number of days: 252
Number of random bets: 127
initial investment: 103863.96
Final hold net return: -2167.15
Number of model bets: 158
Number of model bets: 162

Buy and Hold Results
Net gain: -2167.15
Percent gain: -2.09 %
Random Results:
Net gain: -1403.68
Percent gain: -1.35 %
Model Results:
Net gain: -10773.05
Percent gain: -10.37 %
Model Results  (Lasso, alpha=.001):
Net gain: -16071.66
Percent gain: -15.47 %
In [116]:
# Ok - gather the data for 2016

data_16 = aapl[(aapl.index > '2016') & (aapl.index < '2017')][['AdjClose','LR']] 

## Random model - randomly buy and sell stock
data_16['rand_make_bet']=np.random.randint(0,2, data_16.shape[0])
print("Number of days: {0}".format(data_16.shape[0]))
print("Number of random bets: {0}".format(data_16.rand_make_bet.sum()))

## Recall - LR = Log(C_t/C_(t-1)))
data_16['daily_change']=(np.exp(data_16['LR'])-1)*data_16['AdjClose'].shift(1)

## Static model - buy and hgold
data_16['hold_current_value']=1000*data_16.AdjClose

initial_investment16=1000*data_16[data_16.index==data_16.index.min()].AdjClose[0]
print("initial investment: {0:8.2f}".format(initial_investment16))

data_16['hold_current_return']=1000*data_16.AdjClose - initial_investment16

roi_hold_net16=data_16.hold_current_return[-1:][0]
print("Final hold net return: {0:8.2f}".format(roi_hold_net16))


data_16['rand_daily_change']=1000*data_16['daily_change']*data_16['rand_make_bet'].shift(1)
data_16['predict']=Y16_predict
data_16['predict_make_bet']=np.where(data_16['predict']>0, 1, 0)
print("Number of model bets: {0}".format(data_16.predict_make_bet.sum()))
data_16['predict_daily_change']=1000*data_16['daily_change']*data_16['predict_make_bet'].shift(1)

data_16['predict2']=Y16_predict2
data_16['predict_make_bet2']=np.where(data_16['predict2']>0, 1, 0)
print("Number of model bets: {0}".format(data_16.predict_make_bet2.sum()))
data_16['predict_daily_change2']=1000*data_16['daily_change']*data_16['predict_make_bet2'].shift(1)
#data_15['predict_current_return']=data_15['predict_daily_change'].cumsum()


# Calculate my return:  
print("Buy and Hold Results")
print("Net gain: {0:8.2f}".format(roi_hold_net16))
prct_roi_hold16= 100*roi_hold_net16/initial_investment16
print("Percent gain: {0:2.2f} %".format(prct_roi_hold16))

print("Random Results:")
roi_rand16=data_16['rand_daily_change'].cumsum()[-1:][0]
print("Net gain: {0:8.2f}".format(roi_rand16))
prct_roi_rand16=100*(roi_rand16 ) / initial_investment16
print("Percent gain: {0:2.2f} %".format(prct_roi_rand16))

print("Model Results:")
roi_model16=data_16['predict_daily_change'].cumsum()[-1:][0]
print("Net gain: {0:8.2f}".format(roi_model16))
prct_roi_model16=100*(roi_model16 ) / initial_investment16
print("Percent gain: {0:2.2f} %".format(prct_roi_model16))

print("Model Results  (Lasso, alpha=.001):")
roi_model16_2=data_16['predict_daily_change2'].cumsum()[-1:][0]
print("Net gain: {0:8.2f}".format(roi_model16_2))
prct_roi_model16_2=100*(roi_model16_2 ) / initial_investment16
print("Percent gain: {0:2.2f} %".format(prct_roi_model16_2))
Number of days: 252
Number of random bets: 128
initial investment: 101783.76
Final hold net return: 12605.69
Number of model bets: 161
Number of model bets: 167
Buy and Hold Results
Net gain: 12605.69
Percent gain: 12.38 %
Random Results:
Net gain: 27292.94
Percent gain: 26.81 %
Model Results:
Net gain: 19246.17
Percent gain: 18.91 %
Model Results  (Lasso, alpha=.001):
Net gain: 15901.36
Percent gain: 15.62 %
In [122]:
# Ok - gather the data for 2017

data_17 = aapl[(aapl.index > '2017') & (aapl.index < '2018')][['AdjClose','LR']] 

## Random model - randomly buy and sell stock
data_17['rand_make_bet']=np.random.randint(0,2, data_17.shape[0])
print("Number of days: {0}".format(data_17.shape[0]))
print("Number of random bets: {0}".format(data_17.rand_make_bet.sum()))

## Recall - LR = Log(C_t/C_(t-1)))
data_17['daily_change']=(np.exp(data_17['LR'])-1)*data_17['AdjClose'].shift(1)

## Static model - buy and hgold
data_17['hold_current_value']=1000*data_17.AdjClose

initial_investment17=1000*data_17[data_17.index==data_17.index.min()].AdjClose[0]
print("initial investment: {0:8.2f}".format(initial_investment17))

data_17['hold_current_return']=1000*data_17.AdjClose - initial_investment17

roi_hold_net17=data_17.hold_current_return[-1:][0]
print("Final hold net return: {0:8.2f}".format(roi_hold_net17))


data_17['rand_daily_change']=1000*data_17['daily_change']*data_17['rand_make_bet'].shift(1)
data_17['predict']=Y17_predict
data_17['predict_make_bet']=np.where(data_17['predict']>0, 1, 0)
print("Number of model bets: {0}".format(data_17.predict_make_bet.sum()))
data_17['predict_daily_change']=1000*data_17['daily_change']*data_17['predict_make_bet'].shift(1)

data_17['predict2']=Y17_predict2
data_17['predict_make_bet2']=np.where(data_17['predict2']>0, 1, 0)
print("Number of model bets: {0}".format(data_17.predict_make_bet2.sum()))
data_17['predict_daily_change2']=1000*data_17['daily_change']*data_17['predict_make_bet2'].shift(1)

# Calculate my return:  
print("Buy and Hold Results 2017")
print("Net gain: {0:8.2f}".format(roi_hold_net17))
prct_roi_hold17= 100*roi_hold_net17/initial_investment17
print("Percent gain: {0:2.2f} %".format(prct_roi_hold17))

print("Random Results:")
roi_rand17=data_17['rand_daily_change'].cumsum()[-1:][0]
print("Net gain: {0:8.2f}".format(roi_rand17))
prct_roi_rand17=100*(roi_rand17 ) / initial_investment17
print("Percent gain: {0:2.2f} %".format(prct_roi_rand17))

print("Model Results:")
roi_model17=data_17['predict_daily_change'].cumsum()[-1:][0]
print("Net gain: {0:8.2f}".format(roi_model17))
prct_roi_model17=100*(roi_model17 ) / initial_investment17
print("Percent gain: {0:2.2f} %".format(prct_roi_model17))

print("Model Results  (Lasso, alpha=.001):")
roi_model17_2=data_17['predict_daily_change2'].cumsum()[-1:][0]
print("Net gain: {0:8.2f}".format(roi_model17_2))
prct_roi_model17_2=100*(roi_model17_2 ) / initial_investment17
print("Percent gain: {0:2.2f} %".format(prct_roi_model17_2))
Number of days: 249
Number of random bets: 126
initial investment: 114715.38
Final hold net return: 54514.62
Number of model bets: 158
Number of model bets: 161
Buy and Hold Results 2017
Net gain: 54514.62
Percent gain: 47.52 %
Random Results:
Net gain: 15926.57
Percent gain: 13.88 %
Model Results:
Net gain: 21241.47
Percent gain: 18.52 %
Model Results  (Lasso, alpha=.001):
Net gain: 22677.20
Percent gain: 19.77 %
In [124]:
year=2015
"Investment Return on {0} in {1}".format(initial_investment, year)
Out[124]:
'Investment Return on 114715.377802 in 2015'
In [42]:
# 2015 just loses, based on the last day, -2%
data_15.AdjClose.plot(legend=True,grid=True,title="AAPL Adjusted Close, 2015",figsize=(10,5))
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a18630490>
In [43]:
# 2016 is a big win, 12%
data_16.AdjClose.plot(legend=True,grid=True,title="AAPL Adjusted Close, 2016",figsize=(10,5))
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c0bfcd0>
In [45]:
# 2017 is a huge win, 46%
data_17.AdjClose.plot(legend=True,grid=True,title="AAPL Adjusted Close, 2017",figsize=(10,5))
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c24c890>
In [117]:
## Plot the various returns - first 2015:
# Hold
# Random
# Applying the model
year=2015
ax = data_15['hold_current_return'].plot(grid=True, legend=True, figsize=(8,6), 
    title="Investment Return on {0:9.2f}$ in {1}".format(initial_investment15, year))
data_15['rand_daily_change'].cumsum().plot(grid=True, legend=True)
data_15['predict_daily_change'].cumsum().plot(grid=True, legend=True)
data_15['predict_daily_change2'].cumsum().plot(grid=True, legend=True)
# data_15['daily_change'].plot(grid=True, legend=True, secondary_y=True)
ax.set_ylabel('Cumulative Return [$]')
# ax.right_ax.set_ylabel('Daily Investment Flu [$]')
ax.set_xlabel('Year')
Out[117]:
Text(0.5,0,u'Year')
In [118]:
## Plot the various returns:
# Hold
# Random
# Applying the model
year = 2016
ax = data_16['hold_current_return'].plot(grid=True, legend=True, figsize=(8,6), 
    title="Investment Return on {0:9.2f}$ in {1}".format(initial_investment16, year))
data_16['rand_daily_change'].cumsum().plot(grid=True, legend=True)
data_16['predict_daily_change'].cumsum().plot(grid=True, legend=True)
data_16['predict_daily_change2'].cumsum().plot(grid=True, legend=True)
# data_15['daily_change'].plot(grid=True, legend=True, secondary_y=True)
ax.set_ylabel('Cumulative Return [$]')
# ax.right_ax.set_ylabel('Daily Investment Flu [$]')
ax.set_xlabel('Year')
Out[118]:
Text(0.5,0,u'Year')
In [123]:
## Plot the various returns:
# Hold
# Random
# Applying the model
year=2017
ax = data_17['hold_current_return'].plot(grid=True, legend=True, figsize=(8,6), 
    title="Investment Return on {0:9.2f}$ in {1}".format(initial_investment17, year))
data_17['rand_daily_change'].cumsum().plot(grid=True, legend=True)
data_17['predict_daily_change'].cumsum().plot(grid=True, legend=True)
data_17['predict_daily_change2'].cumsum().plot(grid=True, legend=True)
# data_15['daily_change'].plot(grid=True, legend=True, secondary_y=True)
ax.set_ylabel('Cumulative Return [$]')
# ax.right_ax.set_ylabel('Daily Investment Flu [$]')
ax.set_xlabel('Year')
Out[123]:
Text(0.5,0,u'Year')
In [65]:
## Ok - so these models are pretty crappy.  
# what is the range of outcomes for the random models?  And what does the distribution look like?
# 1. Simulate returns on (say 1000) random models
# 2. What's the mean and std of these outcomes?


# The static data
data_test15 = aapl[(aapl.index > '2015') & (aapl.index < '2016')][['AdjClose','LR']] 
data_test15['daily_change']=(np.exp(data_test15['LR'])-1)*data_test15['AdjClose'].shift(1)
results15 = np.zeros(1000)

# The stochastic model
for i in range(1000):
    data_test15['rand_make_bet']=np.random.randint(0,2, data_test15.shape[0])
    data_test15['rand_daily_change']=1000*data_test15['daily_change']*data_test15['rand_make_bet'].shift(1)
    results15[i]=data_test15['rand_daily_change'].cumsum()[-1:][0]

print("2015 Model Results: {0:8.2f}".format(results15.mean()))
print("2015 Model Results: {0:8.2f}%".format(100*(results15.mean()/initial_investment15)))
print("2015 Model Std: {0:8.2f}".format(results15.std()))
2015 Model Results:  -826.85
2015 Model Results:    -0.80%
2015 Model Std: 15080.42
In [66]:
# Look at 2016
# The static data
data_test16 = aapl[(aapl.index > '2016') & (aapl.index < '2017')][['AdjClose','LR']] 
data_test16['daily_change']=(np.exp(data_test16['LR'])-1)*data_test16['AdjClose'].shift(1)
results16 = np.zeros(1000)

# The stochastic model
for i in range(1000):
    data_test16['rand_make_bet']=np.random.randint(0,2, data_test16.shape[0])
    data_test16['rand_daily_change']=1000*data_test16['daily_change']*data_test16['rand_make_bet'].shift(1)
    results16[i]=data_test16['rand_daily_change'].cumsum()[-1:][0]
    
print("2016 Model Results: {0:8.2f}".format(results16.mean()))
print("2016 Model Results: {0:8.2f}%".format(100*(results16.mean()/initial_investment16)))
print("2016 Model Std: {0:8.2f}".format(results16.std()))
2016 Model Results:  6139.92
2016 Model Results:     6.03%
2016 Model Std: 12074.59
In [62]:
# Look at 2017
# The static data
data_test17 = aapl[(aapl.index > '2017') & (aapl.index < '2018')][['AdjClose','LR']] 
data_test17['daily_change']=(np.exp(data_test17['LR'])-1)*data_test17['AdjClose'].shift(1)
results17 = np.zeros(1000)

# The stochastic model
for i in range(1000):
    data_test17['rand_make_bet']=np.random.randint(0,2, data_test17.shape[0])
    data_test17['rand_daily_change']=1000*data_test17['daily_change']*data_test17['rand_make_bet'].shift(1)
    results17[i]=data_test17['rand_daily_change'].cumsum()[-1:][0]
    
print("2017 Model Results: {0:8.2f}".format(results17.mean()))
print("2017 Model Results: {0:8.2f}%".format(100*(results17.mean()/initial_investment17)))
print("2017 Model Std: {0:8.2f}".format(results17.std()))
2017 Model Results: 27199.69
2017 Model Results:    23.71%
2017 Model Std: 13271.84

Model Results

If you look closer at my model results from 2015. There was a net gain of -14723.44 and a percent gain of -14.18. These results are near the left side of the distribution of normal results from the stochastic model.

In [67]:
# Quick and dirty histogram of the model outcomes
plt.hist(results15, color = 'blue', edgecolor = 'black', bins=25)
Out[67]:
(array([  1.,   3.,   3.,   2.,  12.,  18.,  29.,  50.,  48.,  91.,  70.,
         92.,  89., 124.,  86.,  82.,  60.,  50.,  36.,  25.,  15.,   7.,
          3.,   2.,   2.]),
 array([-50264.06680961, -46379.16524088, -42494.26367216, -38609.36210343,
        -34724.4605347 , -30839.55896598, -26954.65739725, -23069.75582852,
        -19184.8542598 , -15299.95269107, -11415.05112234,  -7530.14955361,
         -3645.24798489,    239.65358384,   4124.55515257,   8009.45672129,
         11894.35829002,  15779.25985875,  19664.16142747,  23549.0629962 ,
         27433.96456493,  31318.86613365,  35203.76770238,  39088.66927111,
         42973.57083983,  46858.47240856]),
 <a list of 25 Patch objects>)
In [73]:
# Quick and dirty density plot of the model outcomes
pd.DataFrame(results15).plot.density()
Out[73]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1d7e96d0>
In [87]:
ax = pd.DataFrame({'2015' : results15, '2016' : results16, '2017' : results17}).plot.hist(
    title='Density plot of Stochastic Model Returns', bins=30, alpha=0.5)
ax.set_xlabel('Return on Investment [$]')
Out[87]:
Text(0.5,0,u'Return on Investment [$]')