PyMC Introduction

6.6. PyMC Introduction#

Last revised 25-Oct-2021 by Dick Furnstahl (furnstahl.1@osu.edu);

Further revisions Dec-2023 by Daniel Phillips for PyMC3->PyMC

A good starting point for notebooks with PyMC examples is the official documentation site: https://docs.pymc.io/. We’ve adapted some examples from that site here and in other notebooks.

Aside. Here is a good quote from Rob Hicks on HMC and No U-Turn:

“The idea: rather than blindly stumbling around the posterior, use the posterior gradient to skate around the gradient contour. As you skate closer to a drop-off (gradient is steep and probability is lower), potential energy decreases and kinetic energy increases (since energy is always conserved). When this happens the skater is turned back uphill and pushed from the precipice and skates on along a posterior likelihood contour. The No U-Turn sampler keeps skating until the skater tries to turn back towards the original point.”

Imports#

import numpy as np
import scipy.stats as stats

import arviz as az
import matplotlib.pyplot as plt

import pymc as pm

# Recommended: document what PyMC version we are using
print(f'Running on PyMC v{pm.__version__}')

Running on PyMC v5.10.3

Basic setup of a model#

First we need to create a model, which will be an instance of the Model class. The model has references to all random variables (RVs) and computes the model log posterior (logp) and its gradients. We typically instantiate it using a with context. For example:

with pm.Model() as my_model:
    mu = pm.Normal('mu', mu=0, sigma=1)
    obs = pm.Normal('obs', mu=mu, sigma=1, observed=np.random.randn(100))

So my_model is an instance of the PyMC Model class, and we have set up a prior for mu in the form of a standard normal distribution (i.e., mean = 0 and standard deviation = 1). The last line sets up the likelihood, also distributed as a normal with observed data taken as 100 random draws from a standard normal distribution. The standard deviation sigma for the mu posterior is given. The goal will be to sample the posterior for mu.

Sampling#

The main entry point to MCMC sampling algorithms is via the pm.sample() function. By default, this function tries to auto-assign the right sampler(s) and auto-initialize if you don’t pass anything.

As you can see, on a continuous model, PyMC assigns the NUTS sampler, which is very efficient even for complex models. PyMC also runs variational inference (i.e. ADVI) to find good starting parameters for the sampler. Here we draw 1000 samples from the posterior and allow the sampler to adjust its parameters in an additional 500 iterations. These 500 samples are discarded by default:

with pm.Model() as my_NUTS_model:
    avg = pm.Normal('avg', mu=0, sigma=1)
    
    model = avg
    
    obs = pm.Normal('obs', mu=model, sigma=1, observed=np.random.randn(100))

    trace_NUTS = pm.sample(1000, tune=1000, return_inferencedata=True)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [avg]

100.00% [8000/8000 00:00<00:00 Sampling 4 chains, 0 divergences]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 28 seconds.

with my_NUTS_model:    
    az.plot_trace(trace_NUTS);

../../_images/feef6d1fdf7d9d4fbd31250ebf36f0444908a3fb89b1c9cf94c98939cd2187e7.png

Available samplers#

PyMC offers a variety of samplers, found in pm.step_methods:

list(filter(lambda x: x[0].isupper(), dir(pm.step_methods)))

['BinaryGibbsMetropolis',
 'BinaryMetropolis',
 'CategoricalGibbsMetropolis',
 'CauchyProposal',
 'CompoundStep',
 'DEMetropolis',
 'DEMetropolisZ',
 'HamiltonianMC',
 'LaplaceProposal',
 'Metropolis',
 'MultivariateNormalProposal',
 'NUTS',
 'NormalProposal',
 'PoissonProposal',
 'STEP_METHODS',
 'Slice',
 'UniformProposal']

Commonly used step-methods besides NUTS include Metropolis and Slice. The claim is that for almost all continuous models, NUTS should be preferred. There are hard-to-sample models for which NUTS will be very slow causing many users to use Metropolis instead. This practice, however, is rarely successful. NUTS is fast on simple models but can be slow if the model is very complex or it is badly initialized. In the case of a complex model that is hard for NUTS, Metropolis, while faster, will have a very low effective sample size or not converge properly at all. A better approach is to instead try to improve initialization of NUTS, or reparameterize the model.

For completeness, other sampling methods can be passed to sample. Here is an example (Metropolis-Hastings):

with pm.Model() as my_Metropolis_model:
    mu = pm.Normal('mu', mu=0, sigma=1)
    obs = pm.Normal('obs', mu=mu, sigma=1, observed=np.random.randn(100))

    step = pm.Metropolis()
    trace_MH = pm.sample(1000, step=step, return_inferencedata=True)

    az.plot_trace(trace_MH);

Multiprocess sampling (4 chains in 4 jobs)
Metropolis: [mu]

100.00% [8000/8000 00:00<00:00 Sampling 4 chains, 0 divergences]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 28 seconds.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details

../../_images/84673827983ca7a4d2393e8fbaf59fbef9f0cfd64adc9c23d153846e3f9c03ba.png

Analyze sampling results#

The most common used plot to analyze sampling results is the so-called trace-plot, now invoked with a arViz plot_trace command:

with my_NUTS_model:    
    az.plot_trace(trace_NUTS);

with pm.Model() as model:
    mu = pm.Normal('mu', mu=0, sigma=1)
    sd = pm.HalfNormal('sd', sigma=1)
    obs = pm.Normal('obs', mu=mu, sigma=sd, observed=np.random.randn(100))

    step1 = pm.Metropolis(vars=[mu])
    step2 = pm.Slice(vars=[sd])
    trace_2_samplers = pm.sample(10000, step=[step1, step2], cores=4, 
                                 return_inferencedata=True)

    az.plot_trace(trace_2_samplers);

Multiprocess sampling (4 chains in 4 jobs)
CompoundStep
>Metropolis: [mu]
>Slice: [sd]

100.00% [44000/44000 00:06<00:00 Sampling 4 chains, 0 divergences]

Sampling 4 chains for 1_000 tune and 10_000 draw iterations (4_000 + 40_000 draws total) took 34 seconds.

../../_images/4f4e01c9c876da2f7399d8d8d67f9b6c15f8321b72cebe9a772e0cf030b86e5c.png

Diagnostics#

with pm.Model() as model:
    display(az.summary(trace_2_samplers))

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
mu	-0.023	0.104	-0.209	0.179	0.001	0.001	5621.0	5657.0	1.0
sd	1.034	0.074	0.902	1.179	0.000	0.000	36668.0	29073.0	1.0

with my_Metropolis_model:
    display(az.summary(trace_MH))

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
mu	0.021	0.107	-0.196	0.219	0.004	0.004	604.0	588.0	1.02

with my_Metropolis_model:
    az.plot_forest(trace_MH, r_hat=True);

../../_images/af64b57d05351461f79753b9a500d6bf2763a8bfcc97af44eb485bd95f5a561e.png

with pm.Model() as my_Metropolis_model:
    pm.plot_posterior(trace_MH);

../../_images/cbc858c08720cc77ab168241e1d2ce51a68d11cd9533c87eff50d959baa73718.png

Examples from Rob Hicks#

See https://rlhick.people.wm.edu/stories/bayesian_7.html. We also have a notebook from his Bayesian 8 “story”.

We start with a very simple one parameter model and then move to slightly more complicated settings:

sigma = 3.  # standard deviation
mu = 10.    # mean
num_samples = 100  # 10**6

# sample from a normal distribution
data = stats.norm(mu, sigma).rvs(num_samples)  

# plot a histogram of the sampled data
num_bins = 20
plt.hist(data, bins=num_bins)
plt.show()

../../_images/184884eb1430818e38cd3174081c21bfc50c13012e7ac684019947d811dd2efa.png

Run the previous cell a few times to see the fluctuations. Crank up the number of samples to 10**6 to see a smoother histogram.

PyMC implementation#

We instantiate a Model with a descriptions of priors and the likelihood. Here, mu is defined to be a random variable (we want to sample this variable by generating a Markov chain) and we provide a prior distribution with associated hyper-parameters for it. The likelihood function is chosen to be Normal, with one parameter to be estimated (mu), and we use known \(\sigma\) (denoted as sigma). Our “dependent variable” is given by observed=data, where data is generated above and shown in the histogram. So we our implementing Bayes theorem in the form:

(6.2)#\[\begin{align} \newcommand{\Pr}{\textrm{pr}} \newcommand{\data}{\textbf{data}} \Pr(\mu | \sigma, \data) \propto \Pr(\data | \mu, \sigma) \times \Pr(\mu |\mu^0_\mu, \sigma^0_\mu) \end{align}\]

# parameters for the prior on mu
mu_prior = 8.
sigma_prior = 1.5  # Note this is our prior on the std of mu

# Could do this instead as:
#   basic_model = pm3.Model()
#   with basic_model:

with pm.Model() as basic_model:  

    # Prior for unknown model parameters (mean and sd of the normal pdf)
    mu = pm.Normal('Mean of Data', mu_prior, sigma_prior)
    
    # Likelihood (sampling distribution) of observations
    data_in = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=data)

Next we define how the Markov chain will be constructed. The example we are following set startvals to be the MAP and used a Metropolis step method. There always seems to be a complaint with the latest pyMC about using find_MAP to start the sampler.

chain_length = 10000

with basic_model:
    # obtain starting values via MAP (maximum a posteriori)
    startvals = pm.find_MAP(model=basic_model)  # model here is optional
    print(startvals)
    
    # instantiate sampler
    step = pm.Metropolis()   # Metropolis-Hastings

    # draw 10000 posterior samples for each chain (4 chains by default?)
    trace = pm.sample(draws=chain_length, step=step, start=startvals, 
                      return_inferencedata=True) 

100.00% [3/3 00:00<00:00 logp = -277.42, ||grad|| = 23.157]

{'Mean of Data': array(10.00394922)}

/var/folders/1s/3wbqg65x5k79l5ww8fg25yvh0000gn/T/ipykernel_15896/2949430162.py:12: FutureWarning: The `start` kwarg was renamed to `initvals` and can now do more. Please check the docstring.
  trace = pm.sample(draws=chain_length, step=step, start=startvals,
Multiprocess sampling (4 chains in 4 jobs)
Metropolis: [Mean of Data]

100.00% [44000/44000 00:14<00:00 Sampling 4 chains, 0 divergences]

Sampling 4 chains for 1_000 tune and 10_000 draw iterations (4_000 + 40_000 draws total) took 41 seconds.

# Plot the two chains
with basic_model:
    az.plot_trace(trace, figsize=(20,5));

../../_images/5901fe8643ed1bc849bd0de03c20ddd5b74edd7e548a46d119b2dd9f0db0900d.png

# Summary information on the Markov chains
with basic_model:
    display(az.summary(trace))
    

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
Mean of Data	10.003	0.291	9.461	10.536	0.003	0.002	8321.0	9544.0	1.0

Remember that what we are generating is a posterior for the mean given the data and our (assumed) knowledge of the standard deviation.

So for the summary info we get the mean and standard deviation (sd) of the distribution, with an estimate of the Monte Carlo error. What does hpd stand for? “Highest posterior density” 2.5 and 97.5 are percentages, so one talks of a 95% hpd interval in this case.

From an answer online: “You create the parameter trace plots to make sure that your a priori distribution is well calibrated which is indicated by your parameters having sufficient state changes as the MCMC algorithm runs.”

“All the results are contained in the trace variable. This is a pymc results object. It contains some information that we might want to extract at times. Note that this data structure changed between pymc3 and pymc.

This was set up when we initiated our model (in specifying the prior for mu). With the variable names, we can extract chain values for each variable using trace.posterior[‘’].

Is this one chain or both chains? Check the length! Looks like both.

print(len(trace.posterior['Mean of Data']))
print(trace.posterior['Mean of Data'].shape)

4
(4, 10000)

Now for diagnostics.

Autocorrelation plots#

with basic_model:
    az.plot_autocorr(trace, figsize=(17,5));

../../_images/8442af66be99c477c5d4ba11b8b48a3585160cbab93080ad5d60dbd1f5a0d279.png

What do we see here? An autocorrelation time around 10 or so.

Acceptance rate#

First we have to flatten the samples to make it easier to compute the acceptance rate. But we can only flatten after we have converted to a regular array from a “Dataarray”.

samples=trace.posterior['Mean of Data'].values.flatten()

accept = np.sum(samples[1:] != samples[:-1])
print(accept)
print("Acceptance Rate: ", accept/samples.shape[0])

13590
Acceptance Rate:  0.33975

That looks like we have to work harder than one might have expected. It is taking the array of results and comparing each point to the previous one and including it in the sum if it is different. So if there wasn’t an acceptance, then the point remains the same. The ratio to the full length is the acceptance rate. Maybe we should define a function here instead.

def acceptance_rate(trace_array):
    """Calculate how many times the entry in the trace array changed compared
       to the total length.
    """
    changed = np.sum(trace_array[1:] != trace_array[:-1])
    total_length = trace_array.shape[0]
    return changed / total_length

acceptance_rate(samples)

0.33975

Gelman Rubin Diagnostic (quoted verbatim from the Hicks notebook)#

If our MH MCMC Chain reaches a stationary distribution, and we repeat the excercise multiple times, then we can examine if the posterior for each chain converges to the same place in the distribution of the parameter space.

Steps:

Run \(M>1\) Chains of length \(2 \times N\).
Discard the first \(N\) draws of each chain, leaving \(N\) iterations in the chain.
Calculate the within and between chain variance.
- Within chain variance: \(W = \frac{1}{M}\sum_{j=1}^M s_j^2 \) where \(s_j^2\) is the variance of each chain (after throwing out the first \(N\) draws).
- Between chain variance: \(B = \frac{N}{M-1} \sum_{j=1}^M (\bar{\theta_j} - \bar{\bar{\theta}})^2\) where \(\bar{\bar{\theta}}\) is the mean of each of the M means.
Calculate the estimated variance of \(\theta\) as the weighted sum of between and within chain variance. \(\hat{var}(\theta) = \left ( 1 - \frac{1}{N}\right ) W + \frac{1}{N}B\)
Calculate the potential scale reduction factor. \(\hat{R} = \sqrt{\frac{\hat{var}(\theta)}{W}}\)

We want this number to be close to 1. Why? This would indicate that the between chain variance is small. This makes sense, if between chain variance is small, that means both chains are mixing around the stationary distribution. Gelmen and Rubin show that when \(\hat{R}\) is greater than 1.1 or 1.2, we need longer burn-in.

Trying without find_MAP, i.e., not specifying start in pm.sample.

chain_length = 100000 

with basic_model:
    # obtain starting values via MAP
    #startvals = pm.find_MAP(model=basic_model)
    #print(startvals)
    # instantiate sampler
    step = pm.Metropolis() 

    # draw 5000 posterior samples
    trace = pm.sample(chain_length, step=step, return_inferencedata=True)

Multiprocess sampling (4 chains in 4 jobs)
Metropolis: [Mean of Data]

100.00% [404000/404000 00:25<00:00 Sampling 4 chains, 0 divergences]

Sampling 4 chains for 1_000 tune and 100_000 draw iterations (4_000 + 400_000 draws total) took 53 seconds.

# Plot the two chains
with basic_model:
    az.plot_trace(trace, figsize=(20,5));

../../_images/ab66ad951f4132a5efba3c900a38c5186d9a4bfe1701c49a0f99b6d051e33f35.png

“The diagnostics we have discussed are all univariate (they work perfectly when there is only 1 parameter to estimate). Other diagnostics have been derived for the multivariate case, but these are useful only when using Gibbs Samplers or other specialized versions of Metropolis-Hastings.

So most people examine univariate diagnostics for each variable, examine autocorrelation plots, acceptance rates and try to argue chain convergence based on that- unless they are using Gibbs or other specialized samplers.”

In-class exercise#

Let’s try to modify the code below to estimate sigma as well as the mean (first three cells are original code, copied from above):

sigma = 3.  # standard deviation
mu = 10.    # mean
num_samples = 100  # 10**6

# sample from a normal distribution
data = stats.norm(mu, sigma).rvs(num_samples)  


# plot a histogram of the sampled data
num_bins = 20
plt.hist(data, bins=num_bins)
plt.show()

../../_images/39c08e9c13f44c2c9a8cf61dbefd72693680ec6c15342e5b48bd1f15e4cea2a9.png

# parameters for the prior on mu
mu_mean_prior = 8.
mu_sd_prior = 1.5  # Note this is our prior on the std of mu

with pm.Model() as basic_model:

    # Priors for unknown model parameters
    mu = pm.Normal('Mean of Data', mu_mean_prior, mu_sd_prior)
    
    # Likelihood (sampling distribution) of observations
    data_in = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=data)

chain_length = 10000 

with basic_model:
    # obtain starting values via MAP
    startvals = pm.find_MAP(model=basic_model)
    print(startvals)
    # instantiate sampler
    step = pm.Metropolis() 

    # draw 10000 posterior samples
    trace = pm.sample(chain_length, step=step, start=startvals,
                      return_inferencedata=True) 

100.00% [3/3 00:00<00:00 logp = -277.69, ||grad|| = 24.375]

{'Mean of Data': array(10.10937835)}

/var/folders/1s/3wbqg65x5k79l5ww8fg25yvh0000gn/T/ipykernel_15896/1249115293.py:11: FutureWarning: The `start` kwarg was renamed to `initvals` and can now do more. Please check the docstring.
  trace = pm.sample(chain_length, step=step, start=startvals,
Multiprocess sampling (4 chains in 4 jobs)
Metropolis: [Mean of Data]

100.00% [44000/44000 00:02<00:00 Sampling 4 chains, 0 divergences]

Sampling 4 chains for 1_000 tune and 10_000 draw iterations (4_000 + 40_000 draws total) took 30 seconds.

# NOTE: currently there is an issue with geweke. Try again in the future.
# score=pm.geweke(trace, first=0.1, last=0.5, intervals=20)
# plt.scatter(score[0]['Mean of Data'][:,0],score[0]['Mean of Data'][:,1], 
#             marker = 'o', s=100)
# plt.axhline(-1.98, c='r')
# plt.axhline(1.98, c='r')
# plt.ylim(-2.5,2.5)
# plt.xlim(0-10,.5*trace['Mean of Data'].shape[0]/2+10)
# my_title = 'Geweke Plot Comparing first 10% and Slices of the Last 50%' +\
#            ' of Chain\nDifference in Mean Z score'
# plt.title(my_title)
# plt.show()

Ok, we’re trying it! (New code starts here.)

sigma = 3.  # standard deviation
mu = 10.    # mean
num_samples = 1000  # 100 # 10**6

# sample from a normal distribution
data = stats.norm(mu, sigma).rvs(num_samples)  


# plot a histogram of the sampled data
num_bins = 20
plt.hist(data, bins=num_bins)
plt.show()

../../_images/26e1fc3a2965f97df87e85b42f7e920186903d401ee482bdea998ca602b89277.png

# parameters for the prior on mu
mu_mean_prior = 8.
mu_sd_prior = 1.5  # Note this is our prior on the std of mu

sigma_mean_prior = 1.
sigma_sd_prior = 1.

with pm.Model() as two_param_model:

    # Priors for unknown model parameters
    mu = pm.Normal('Mean of Data', mu_mean_prior, mu_sd_prior)
    sigma = pm.Normal('SD of Data', sigma_mean_prior, sigma_sd_prior)
    
    # Likelihood (sampling distribution) of observations
    data_in = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=data)

chain_length = 10000 

with two_param_model:
    # obtain starting values via MAP
    startvals = pm.find_MAP(model=two_param_model)
    print(startvals)
    # instantiate sampler
    step = pm.Metropolis() 

    # draw 10000 posterior samples
    trace_two_param = pm.sample(chain_length, step=step, start=startvals,
                                return_inferencedata=True) 

100.00% [12/12 00:00<00:00 logp = -2,503.2, ||grad|| = 0.035828]

{'Mean of Data': array(10.08854313), 'SD of Data': array(2.93392367)}

/var/folders/1s/3wbqg65x5k79l5ww8fg25yvh0000gn/T/ipykernel_15896/2453781259.py:11: FutureWarning: The `start` kwarg was renamed to `initvals` and can now do more. Please check the docstring.
  trace_two_param = pm.sample(chain_length, step=step, start=startvals,
Multiprocess sampling (4 chains in 4 jobs)
CompoundStep
>Metropolis: [Mean of Data]
>Metropolis: [SD of Data]

100.00% [44000/44000 00:03<00:00 Sampling 4 chains, 0 divergences]

Sampling 4 chains for 1_000 tune and 10_000 draw iterations (4_000 + 40_000 draws total) took 31 seconds.

with two_param_model:
    az.plot_trace(trace_two_param, figsize=(20,5));

../../_images/ce9dfb98e5d0e8238c37e7b97a75667769837d2ffb3d1f2ccb8591a5108b3315.png

# NOTE: currently there is an issue with geweke. Try again in the future.
# score=pm.geweke(trace_two_param, first=0.1, last=0.5, intervals=20)
# plt.scatter(score[0]['Mean of Data'][:,0],score[0]['Mean of Data'][:,1], 
#             marker = 'o', s=100)
# plt.axhline(-1.98, c='r')
# plt.axhline(1.98, c='r')
# plt.ylim(-2.5,2.5)
# plt.xlim(0-10,.5*trace['Mean of Data'].shape[0]/2+10)
# my_title = 'Geweke Plot Comparing first 10% and Slices of the Last 50%' +\
#            ' of Chain\nDifference in Mean Z score'
# plt.title(my_title)
# plt.show()

# NOTE: currently there is an issue with geweke. Try again in the future.
# score=pm.geweke(trace_two_param, first=0.1, last=0.5, intervals=20)
# plt.scatter(score[0]['SD of Data'][:,0],score[0]['SD of Data'][:,1], marker = 'o', s=100)
# plt.axhline(-1.98, c='r')
# plt.axhline(1.98, c='r')
# plt.ylim(-2.5,2.5)
# #plt.xlim(0-10,.5*trace['SD of Data'].shape[0]/2+10)
# plt.title('Geweke Plot Comparing first 10% and Slices of the Last 50% of Chain\nDifference in SD Z score')
# plt.show()