The goal of this exercise is to simulate taking samples from a known distribution in order to investigate the behavaivor of the sample means. According to theory, if a sample of observations are independent and indentically sampled, then the mean of the sample will itself be distributed with a mean and variance related to the mean and variance of the underlying population. This simulation investigates this behaivor and makes a comparison with the theoretical expectation.
The simulation draws from the exponential distribution. This distribution is both similar in form and has an interesting relationship to the Poisson distribution. The exponential distribution arises by considering a continous random variable \(x\) representing the time between two adjacent events from a Poisson process or distribution.
\[ P(x, \lambda) = \lambda exp(-\lambda); \mu = E[x] = \int_0^{\infty} x P(x, \lambda) dx = \lambda^{-1}\] \[ \sigma^2 = Var[x] = E[(x-\mu)^2] = \int_0^{\infty} (x^2 - \mu^2) P(x, \lambda) dx = \lambda^{-2} \]
According to the central limit theorem, provided that the sample mean and variance are finite, if we take a sample of \(N\) observations from a known distribution, we expect the sample mean to be distributed normally with a mean, variance and standard deviation related to the underlying population distribution. In our case, the underlying population is the exponential distribution, with known \(\mu\) and \(\sigma\), and the sample average \(\bar x\) should be normally distributed with
\[ E[\bar x] = \mu = \lambda^{-1} \] \[ Var[\bar x] = \sigma^2 / N = 1 / (\lambda^{2} N)\] \[ SD[\bar x] = \sigma / \sqrt{N} = 1 / (\lambda \sqrt{N}) \]
This simulation creates a set of samples of \(N=40\) observations in order to investigate the distribution of the mean of the samples. To perform the simulation with sets of \(N\) observations the r function rexp(n, lambda)
is used to make independent observations. The observations are partitioned into sets of \(N=40\) observations. Each set of \(N=40\) observations is considered an independent sample. For the simulation \(\lambda = 0.2\) was chosen. For each sample the sample average \(\bar x\) is calculated and it’s distribution is investigated and compared to the theoretical expectation. With \(N\) and \(\lambda\) chosen, we can expect, in the limit of many simulated samplings, the average of the sample averages will approach \(\mu = \lambda^{-1}=5\), the variance of the sample averages will approach \(\sigma/ \sqrt{N} = 1 / (\lambda^{2} N) = 25/40 = 0.625\), and the standard deviation of the sample averages will appoach \(1 / (\lambda \sqrt{N}) =5/\sqrt{40} =0.79057\). For the 1000 simulations we can compare the simulatied distribution of \(\bar x\) to the theoretical normal distribution.
set.seed(123)
nsims <- 1000 # number of samples
lambda <- 0.2
n <- 40 # number of observations in each sample
# Collect the simulations in a data matrix, each row is a sample of n observations
simulation <- matrix(rexp(n*nsims,lambda),nsims)
# Calculate the mean for each sample:
means <- apply(simulation, 1, mean)
# The sample means (aveage of all the means)
mean(means) # Expect that 1/0.2 = 5
## [1] 5.011911
# The variance of the sample means
var(means) # Expect that sigma^2/N = 25/40 = 0.625
## [1] 0.6088292
# The standard deviation of the sample means
sd(means) # Expect that sigma/sqrt(N) = 5/sqrt(40) = 0.7905
## [1] 0.7802751
The simulated sample averages and distribution can be visualized with a simple histogram. We expect that on average about \(2/3\) of the simulations will have \(\bar x\) falling between \(5 \pm 0.7905\).
library(ggplot2)
q<-qplot(means, col="red",fill=I("green"),alpha=I(.2),fig.align='center',
binwidth=0.2, main="Simulation of 1000 N=40 sample means",
xlab="Sample Mean", ylab="Frequency")
q + geom_vline(xintercept=5+.79, col="blue", alpha=I(.5)) +
geom_vline(xintercept=5-.79, col="blue", alpha=I(.5))
This shows that the simulations of the draws of \(N=40\) observations indeed appears like it could be distributed normally with the expected theoretical \(\mu = 5\) and \(\sigma/ \sqrt{N} = 0.7905\). We can also plot a density estimate of this simulation against the theoretical normal distribution. This shows that the density of the 1000 simulations is becoming close to representative of the theoretical distribution.
q<-qplot(means,col="red",fill=I("green"),alpha=I(.2),geom="density",
main="Mass Density of Sample Means vs. Theory",
xlab="Sample Mean", ylab="Probability Density")
q+stat_function(fun=dnorm, args = list(mean = 5, sd = 0.7905), col="blue",alpha=I(.5))
That is pretty neat eh? This simple exercise attempts to show the central limit theorem. We found that our simuluation of samples from an exponential distribution does appear to show the expected behaivor: that the sample averages are distributed normally with mean and variance related to the Sample size \(N\) and the population mean \(\mu\) and variance \(\sigma^2\).