In Nate Silver's book, The Signal and the Noise, he examines several reasons why the business of making forecasts is a subtle and difficult proposition. One place where forecasts are notoriously problematic is in polling. Theoretically a well done poll should make a good prediction of how a vote will turn out. In reality, many biases and systematic errors effect polls. When polling is done well, there's still a strong lag effect, where even two months out a good poll becomes very noisy.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Function to get a sample from an approximatley binomial distribution
def get_sample(n=50, p=0.7, error=0.1):
def _sample():
if np.random.random()>error:
return np.random.binomial(1,p)
else:
return np.random.binomial(1,0.5)
return [_sample() for i in range(n)]
# Let's calcualte the expected sample mean and the expected standard deviation
# We'll start with N = 10 people => 14% error expected in the poll estimate of the mean
N = 10
p = 0.7
_smean0 = N*p
_sstd0 = np.sqrt(N*p*(1-p))
print("Expected Sample Mean: {0:2.2f}".format(_smean0))
print("Expected Sample Std: {0:2.2f}".format(_sstd0))
print("Standard deviation as a percent: +/- {0:2.2f}".format(_sstd0/N))
# Let's calcualte this again with 100 people
# => 5% error expected in the estimate
N = 100
p = 0.7
_smean = N*p
_sstd = np.sqrt(N*p*(1-p))
print("Expected Sample Mean: {0:2.2f}".format(_smean))
print("Expected Sample Std: {0:2.2f}".format(_sstd))
print("Standard deviation as a percent: +/- {0:2.2f}".format(_sstd/N))
# Ok, lets look at a sample, with no systematic polling error (error = 0)
s1 = pd.DataFrame( {'sample1' : get_sample(n=N, p=p, error=0.0)})
# What does the data look like:
s1.head()
# This is our poll mean, which can be considerd as equivalent to an estimate for percentage with N = 100
s1.sample1.sum()
# How would this result differ if we polled N=1000 people?
# In this case the expected error goes to 1%
N = 1000
p = 0.7
_smean2 = N*p
_sstd2 = np.sqrt(N*p*(1-p))
print("Expected Sample Mean: {0:2.2f}".format(_smean2))
print("Expected Sample Std: {0:2.2f}".format(_sstd2))
print("Standard deviation as a percent: +/- {0:2.2f}".format(_sstd2/N))
# Let's do this one more time. How would this result differ if we polled N=10000 people?
# The error is now about half a percent
N = 10000
p = 0.7
_smean3 = N*p
_sstd3 = np.sqrt(N*p*(1-p))
print("Expected Sample Mean: {0:2.2f}".format(_smean3))
print("Expected Sample Std: {0:2.2f}".format(_sstd3))
print("Standard deviation as a percent: +/- {0:2.4f}".format(_sstd3/N))
If we take repeated samples, we can see how our well our estimate converges to the true mean. For example, if we do 50 polls, we'd expecte the error of the sample mean to be $\sigma / \sqrt{N}$
s20 = pd.DataFrame()
for i in range(20):
col='s'+str(i)
s20[col]= get_sample(n=N, p=p, error=0.0)
s20.head(3)
# The sum of the columns is our results. 20 repitions of N=100 samples
results = pd.DataFrame( {'estimate' : s20.sum(0)})
results.head()
# After 20 polls, our mean should become a better estimate - with it's own standard deviation of sigma/sqrt(N-1)
print(results.estimate.mean())
# The theoreitcal standard deviation of the estimate: sigma/sqrt(N-1)
print("The estimate of the mean is likely to be within {0:2.2f} of the true mean".format(_sstd/np.sqrt(20-1)))
# This estimates the actual standard deviation
print("The estimate of the standard devaition is {0:2.2f}".format(results.estimate.std()))
# we can take a quick look at the histogram
results.hist(figsize=(8,4))
# Let's do this for a larger set of simulations, say 100
s100 = pd.DataFrame()
for i in range(100):
col='s'+str(i)
s100[col]= get_sample(n=N, p=p, error=0.0)
r100 = pd.DataFrame( {'estimate' : s100.sum(0)})
# The estimate mean should be a slightly better estimate of the real mean withing sigma/sqrt(100 - 1)
print("Real mean likely to be within {0:2.2f} of estimate of the mean".format(_sstd/np.sqrt(100-1)))
print("100 Trials, mean of the estimate {0:2.2f}".format(r100.estimate.mean()))
print("100 Trials, standard deviation of the estimate {0:2.2f}".format(r100.estimate.std()))
r100.hist()
# And again for a larger set of simulations, say 1000
s1000 = pd.DataFrame()
for i in range(1000):
col='s'+str(i)
s1000[col]= get_sample(n=N, p=p, error=0.0)
r1000 = pd.DataFrame( {'estimate' : s1000.sum(0)})
# The estimate mean should be a slightly better estimate of the real mean withing sigma/sqrt(1000 - 1)
print("Real mean likely to be within {0:2.2f} of estimate of the mean".format(_sstd/np.sqrt(1000-1)))
print("1000 Trials, mean of the estimate {0:2.2f}".format(r1000.estimate.mean()))
print("1000 Trials, standard deviation of the estimate {0:2.2f}".format(r1000.estimate.std()))
r1000.hist(bins=20)
What if in a real poll with a binary y/n type question, there's a systematic polling error? For example, what if 10% of the population misinterprets the question, or misreads the question. How would this effect our polling results?
In this model, say, 90% of the polling is expected to sample from the true distribution $B_p$ while 10% is expected to sample stochastically from $B_{0.5}$
We can calculate the true Expected mean as $E[X] = 0.9 \times E[X ~ B_p] + 0.1 \times E[X ~ B_{0.5}] = 0.68 $
# Real expected mean:
0.9*0.7 + 0.1 * 0.5
# Let's start with doing 20 trials
# However, now there's a 10% polling error rate, that is, 10% of respondents answer randomly
ss20 = pd.DataFrame()
for i in range(20):
col='s'+str(i)
ss20[col]= get_sample(n=N, p=p, error=0.1)
# For a single trial, we expect the mean to go down from 70 to something closer to 68
ss20.s0.sum()
rs20 = pd.DataFrame( {'estimate' : ss20.sum(0)})
# The estimate mean should now *NO LONGER ESTIMATE* the real distribution, mean withing sigma/sqrt(20 - 1)
print("Ideally, real mean woulbd likely to be within {0:2.2f} of estimate of the mean".format(_sstd/np.sqrt(20-1)))
print("However, ")
print("20 Trials, mean of the estimate {0:2.2f}".format(rs20.estimate.mean()))
print("20 Trials, standard deviation of the estimate {0:2.2f}".format(rs20.estimate.std()))
rs20.hist(bins=15)
# Let's do 1 more. 1000 trials with N=100, from a mixture: 70% binomial distribution, and 50% binomial
# However, now there's a 10% polling error rate, that is, 10% of respondents answer randomly
ss1000 = pd.DataFrame()
for i in range(1000):
col='s'+str(i)
ss1000[col]= get_sample(n=N, p=p, error=0.1)
rs1000 = pd.DataFrame( {'estimate' : ss1000.sum(0)})
# 1000 trials from our stochastic mixture
print("Ideally, real mean would likely to be within {0:2.2f} of estimate of the mean".format(_sstd/np.sqrt(1000-1)))
print("However, ")
print("1000 Trials, mean of the estimate {0:2.2f}".format(rs1000.estimate.mean()))
print("1000 Trials, standard deviation of the estimate {0:2.2f}".format(rs1000.estimate.std()))
# With 10% noise - the true mean is shifted from the actual population distribution, 70% to 68%
rs1000.hist(bins=15)
What if the true population distribution was heavily for something, say p = 0.85, or 85% of the population would actually agree with the question. However, let's say the systematic error is higher, say 20%, that is 20% will misread the question?
What would the data show in this case?
In this case the conclusions from the poll trials would be quite different from the real underlying distribution.
# The real expected mean:
ss2_mean = 0.8*0.85 + 0.2 * 0.5
ss2_mean
# Let's do 1 more. 1000 trials with N=100, from a mixture: 70% binomial distribution, and 50% binomial
# However, now there's a 10% polling error rate, that is, 10% of respondents answer randomly
ss2 = pd.DataFrame()
for i in range(1000):
col='s'+str(i)
ss2[col]= get_sample(n=N, p=0.85, error=0.2)
rs2 = pd.DataFrame( {'estimate' : ss2.sum(0)})
# Ok, 1000 trials were performed from our new stochastic mixture
# This is a new rough estimate of the std
_ss2_std=np.sqrt(N*0.85*.015)
print("Ideally, real mean would likely to be within {0:2.2f} of estimate of the mean".format(_ss2_std/np.sqrt(1000-1)))
print("However, ")
print("1000 Trials, mean of the estimate {0:2.2f}".format(rs2.estimate.mean()))
print("1000 Trials, standard deviation of the estimate {0:2.2f}".format(rs2.estimate.std()))
# Now - with this 20% noise, and the actual population distribution with p of 85%, the
# samples would no longer give a good estimate of the real population mean
rs2.hist(bins=15)