Consistency of Markov Chain quasi-Monte Carlo for continuous state spaces

This post is based on the paper

and my previous talks on this topic at the UNSW statistic seminar and the Dagstuhl Workshop in 2009. The slides of my talk at Dagstuhl can be found here. I give an illustration of the results rather than rigorous proofs, which can be found in the paper [CDO].

The classical paper on this topic is by [Chentsov 1967]:

N. N. Chentsov, Pseudorandom numbers for modelling Markov chains, Computational Mathematics and Mathematical Physics, 7, 218–2332, 1967.

Further important steps were taken by A. Owen and S. Tribble, see doi: 10.1073/pnas.0409596102 and doi: 10.1214/07-EJS162. Recent papers of interest in this context are also by A. Hinrichs doi:10.1016/j.jco.2009.11.003, and D. Rudolf doi: 10.1016/j.jco.2008.05.005. The slides of the presentations at Dagstuhl of Hinrichs can be found here here and of Rudolf can be found here.

1. Introduction

The task is to approximate an integral

\displaystyle  I_s(f) = \int_{\Omega} f(\boldsymbol{z}) \pi(\boldsymbol{z}) \,{\rm d} \boldsymbol{z}

for some integrand {f}, domain {\Omega \subseteq \mathbb{R}^s}, and density function {\pi} by some Monte Carlo type quadrature rule

\displaystyle  Q_{N,s}(f) = \frac{1}{N} \sum_{n=1}^{N} f(\boldsymbol{x}_n)

at some sample points {\boldsymbol{x}_1,\ldots,\boldsymbol{x}_{N} \in \Omega}. To this end one needs to generate sample points {\boldsymbol{x}_1,\ldots, \boldsymbol{x}_N} with distribution {\pi}.

Generating the sample points involves usually the following steps:

  • Generate random numbers {\boldsymbol{u}_1, \boldsymbol{u}_2, \ldots \sim U (0,1)^d}.
  • Use a transformation {\Psi:(0,1)^d \mapsto \Omega} such that {\Psi(\boldsymbol{u}_i) \sim \pi}.

Depending on the properties of the problem, {\Psi} may be obtained by one of the following methods:

as well as many others.

2. Inverse Slice Sampler

We will use the slice sampler to illustrate the ideas of the paper. The following pictures show how the slice sampler works.

The points along the lines {1, 2, 3, \ldots} are chosen using a driver sequence {u_1, u_2, \ldots \in [0,1)} which are then rescaled to the length of the corresponding line. Notice that if one multiplies the probability density function with a constant c > 0 then the algorithm yields exactly the same point set. Hence we do not need to assume that the probability density function integrates to 1.

An introduction by R. Neal can be found at doi: 10.1214/aos/1056562461 and a java applet illustrating the procedure can be found here.

Formally, the slice sampler in dimension one is defined in the following way. Let {\pi} be a probability density function on {\Omega\subset\mathbb{R}}. Let {\Omega' = \{ (y,\boldsymbol{x})\mid \boldsymbol{x}\in\Omega, 0\le y\le \pi(\boldsymbol{x})\} \subset\mathbb{R}^{2}} and let {\pi'} be the uniform distribution on {\Omega'}. Choose a starting point {x_0 \in \Omega}. Then choose a random number uniformly distributed in the interval {[0, \pi(x_0]}, say {y_0}. Next choose a uniform random number on the level set {\{z: \pi(z) = y_0\}}, say {x_1}. Then choose a uniform random number {y_1} in the interval {[0, \pi(x_1)]}, and so on.

3. Random Number Generation

In MCMC theory one assumes that the driver sequence {u_1, u_2, \ldots} is random. In practice however pseudo-random numbers are used. These pseudo-random numbers are deterministically generated sequences of numbers in {[0,1)} which appear random. That means, they are designed to satisfy certain statistical test, for example they should be uniformly distributed in {[0,1)} and they should be independent from one another. On the other hand, because of the vast number of statistical test one could apply to these pseudo-random numbers, they cannot pass every conceivable statistical test.

The following two questions arise from the above:

  • Which properties should the pseudo-random numbers satisfy if used in a MCMC algorithm in order to obtain the correct result? (Consistency)

  • If we know the property, can the pseudo-random numbers be improved (with respect to this property)? (Convergence)

What we are asking is, what is the criterion which should be applied to pseudo-random numbers which are used in MCMC algorithms, and if this is known, how can we optimise pseudo-random numbers with respect to this criterion? The assumption until now, usually, was to make the pseudo-random numbers appear as random as possible since MCMC theory assume random numbers as input.

Previous work in this direction is by [Chentsov 1967] cited above, Owen and Tribble 2005 doi: 10.1073/pnas.0409596102, and Tribble and Owen 2008 doi: 10.1214/07-EJS162 showed consistency if {\Omega} is finite set. Our paper extends these results to infinite sets {\Omega \subseteq \mathbb{R}^s}.

Before we investigate quality criteria for pseudo-random numbers, let us consider how pseudo-random numbers are used in quasi-Monte Carlo (qMC). This topic is well studied and also gives some ideas on what might be important for pseudo-random numbers for MCMC.

4. Quasi-Monte Carlo

Let us consider a simple example of a function {f:[0,1] \mapsto \mathbb{R}}. We want to estimate {\int_0^1 f(x)\,{\rm d} x} using the average {\frac{1}{N} \sum_{n=1}^N f(x_n)}.

It is best to use points which are uniformly distributed in {[0,1]}, that is, with no `large gaps’ in between them. A one dimensional example of a low discrepancy sequences is the van der Corput sequence, which is defined in the following way: Let {n \in \mathbb{N}_0} have base {2} representation {n = n_0 + n_1 2 + \cdots + n_a 2^a} where {n_0, \ldots, n_a \in \{0,1\}}. Then define the radical inverse function

\displaystyle  \phi(n) = \frac{n_0}{2} + \frac{n_1}{2^2} + \cdots + \frac{n_a}{2^a}.

Then the van der Corput sequence is given by {\phi(0), \phi(1), \phi(2), \ldots}. The first few points of the van der Corput sequence are {0, 1/2, 1/4, 3/4, 1/8, 5/8, 3/8, 7/8, 1/16, 9/16, \ldots}. Hence it fills all the gaps as the number of points increases.

In qMC we are not assuming that the driver sequence is random, rather we think of the sequence {u_1, u_2,\ldots \in [0,1]} as a deterministic sequence with certain properties. The essential property here is that the sequence {u_1, u_2, \ldots \in [0,1)} has `no large gaps’ in between. The technical term for this statement is that the sequence {u_1, u_2, \ldots} has low discrepancy.

In statistical notation, the discrepancy is the Kolmogorov-Smirnov distance between the emperical distribution of the points and the uniform distribution.

Formally, discrepancy of a point set {P = \{\boldsymbol{x}_1, \ldots, \boldsymbol{x}_N \subset [0,1)^s} is defined by

\displaystyle D^\ast_{N,d}(P) = \sup_{(a_1,\ldots, a_s) \in [0,1]^s} \left|\frac{A_N(P, [\boldsymbol{0}, \boldsymbol{a}))}{N} - a_1\cdots a_s \right|,

where {A_N(P, [\boldsymbol{0}, \boldsymbol{a}))} is the number of points of {P} in the interval {[\boldsymbol{0}, \boldsymbol{a}) = \prod_{j=1}^s [0,a_j)}.

In general, qMC sequences are constructed to have low discrepancy.

Would such a sequence work for a MCMC algorithm? Consider the following example of the van der Corput sequence in a slice sampler.

The answer is No. The points obtained do not have the distribution properties given by the probability density function.
Low discrepancy guarantees good distribution properties, but the discrepancy does not measure correlations between points. The points of the van der Corput sequence are highly correlated since {x_0, x_2, x_4, \ldots \in [0,1/2)} and {x_1, x_3, x_5, \ldots \in [1/2, 1)}. Hence they fail statistical tests for randomness.

The property we need is called complete uniform distribution (CUD).

Definition 1 Let {u_1, u_2, u_3, \ldots, \in [0,1)} and define {\boldsymbol{u}^{(d)}_1 = (u_1,\ldots, u_d), \boldsymbol{u}^{(d)}_2 = (u_2,\ldots, u_{d+1}), \ldots \in [0,1)^d}. Then {(u_k)_{k \ge 1}} is completely uniformly distributed (CUD) if

\displaystyle \forall d \ge 1: D^\ast_{N,d}(\{\boldsymbol{u}^{(d)}_1,\ldots, \boldsymbol{u}^{(d)}_N\}) \rightarrow 0 \mbox{ as } N \rightarrow \infty

Consider the van der Corput sequence with {d =2}. Then the first few points are {\boldsymbol{u}_1^{(2)} = (0,1/2)}, {\boldsymbol{u}_2^{(2)} = (1/4,3/4)}, {\boldsymbol{u}_3^{(2)} = (1/8,5/8)}, {\boldsymbol{u}_4^{(2)} = (3/8,7/8)}, {\boldsymbol{u}_5^{(2)} = (1/16,7/16)}, and so on.

5. Result

Roughly speaking we prove that under certain assumptions: If driver sequence {u_1, u_2, \ldots \in [0,1)} is CUD, then MCMC converges to correct result. Note that CUD sequences are not `more’ random, but rather they have the right property.

The main assumptions are roughly speaking the following (though there are several other technical assumptions):

  • (i) The MCMC algorithm works if one uses random numbers as driver sequence.
  • (ii) There exists a coupling region.

The first property just means that we need to assume that the MCMC algorithm works if one uses random numbers. In other words, if random numbers do not work, then there is a fundamental difficulty in the problem and we do not expect that pseudo random numbers fix the problem. In particular, if one chooses a point {\boldsymbol{x}_0} with the desired distribution and uses the update function {m} times to obtain a point {\boldsymbol{x}_m} based on a i.i.d. uniformly distributed driver sequence, then {\boldsymbol{x}_m} also has the desired distribution. If one update requires {d} random numbers, then to obtain {\boldsymbol{x}_m} we need {m d} random numbers.

The second assumption is more subtle. We illustrate it using the slice sampler below. Let {X = (\boldsymbol{x}_n)_{n\ge 0}} and {Z = (\boldsymbol{z}_n)_{n \ge 0}} be two sequences with different starting points but which are updated using the same driver sequence {u_1, u_2, \ldots}.

Here, the fourth line for {X} and {Z} lie in the coupling region, that is, if vertical value of the fourth line of {X} is, say, {a} and the vertical value of the fifth line of {Z} is, say, {b}, then {\{x: \pi(x) = a\} = \{x: \pi(x) = b\}}. Hence, since both sequences are updated using the same driver sequence, it follows that the fifth line of {X} and of {Z} will coincide. Then {x_n = z_n} for all {n \ge 3} and hence the sequences coincide except for a finite number of points.

We now illustrate the prove that the sequence {X = (\boldsymbol{x}_n)_{n \ge 0}} works. `Works’ means that the points {\boldsymbol{x}_n} have the desired distribution. Since, by assumption, MCMC based on random numbers yields points with the desired distribution, we can choose another sequence {Z = (\boldsymbol{z}_n)_{n \ge 0}} from which we know that it has the right distribution. We assume that both sequences are updated with the same driver sequence {u_1,u_2, \ldots}. Since {Z} is an ideal sequence we can assume that {\boldsymbol{z}_0} has the right distribution, whereas for {\boldsymbol{x}_0} we cannot make such an assumption. Then from the coupling region it follows that, with positive probability, after a finite number of steps the two sequences will coincide (and stay the same from thereon). Since {Z} works it follows that, apart from a finite number of points which can be neglected, also {X} works.

The details of the proof can be found in the paper [CDO].

6. Numerical Results

The numerical results by Tribble & Owen show improvements using certain completely uniformly distributed sequences, see Tribble’s PhD thesis. Tribble found the largest improvements for the Gibbs sampler where the driver sequence is a Linear Feedback Shift Register (LFSR). The period length of the LFSR is chosen such that the full period can be used and, additionally, the input sequence is randomised.

This entry printed to pdf.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s