This post is based on the paper
- [CDO] S. Chen, J. Dick, and A. Owen, Consistency of Markov Chain Quasi-Monte Carlo on continuous state spaces, submitted 2009;
and my previous talks on this topic at the UNSW statistic seminar and the Dagstuhl Workshop in 2009. The slides of my talk at Dagstuhl can be found here. I give an illustration of the results rather than rigorous proofs, which can be found in the paper [CDO].
The classical paper on this topic is by [Chentsov 1967]:
N. N. Chentsov, Pseudorandom numbers for modelling Markov chains, Computational Mathematics and Mathematical Physics, 7, 218–2332, 1967.
Further important steps were taken by A. Owen and S. Tribble, see doi: 10.1073/pnas.0409596102 and doi: 10.1214/07-EJS162. Recent papers of interest in this context are also by A. Hinrichs doi:10.1016/j.jco.2009.11.003, and D. Rudolf doi: 10.1016/j.jco.2008.05.005. The slides of the presentations at Dagstuhl of Hinrichs can be found here here and of Rudolf can be found here.
The task is to approximate an integral
for some integrand , domain , and density function by some Monte Carlo type quadrature rule
at some sample points . To this end one needs to generate sample points with distribution .
Generating the sample points involves usually the following steps:
- Generate random numbers .
- Use a transformation such that .
Depending on the properties of the problem, may be obtained by one of the following methods:
as well as many others.
2. Inverse Slice Sampler
We will use the slice sampler to illustrate the ideas of the paper. The following pictures show how the slice sampler works.
The points along the lines are chosen using a driver sequence which are then rescaled to the length of the corresponding line. Notice that if one multiplies the probability density function with a constant then the algorithm yields exactly the same point set. Hence we do not need to assume that the probability density function integrates to .
Formally, the slice sampler in dimension one is defined in the following way. Let be a probability density function on . Let and let be the uniform distribution on . Choose a starting point . Then choose a random number uniformly distributed in the interval , say . Next choose a uniform random number on the level set , say . Then choose a uniform random number in the interval , and so on.
3. Random Number Generation
In MCMC theory one assumes that the driver sequence is random. In practice however pseudo-random numbers are used. These pseudo-random numbers are deterministically generated sequences of numbers in which appear random. That means, they are designed to satisfy certain statistical test, for example they should be uniformly distributed in and they should be independent from one another. On the other hand, because of the vast number of statistical test one could apply to these pseudo-random numbers, they cannot pass every conceivable statistical test.
The following two questions arise from the above:
- Which properties should the pseudo-random numbers satisfy if used in a MCMC algorithm in order to obtain the correct result? (Consistency)
- If we know the property, can the pseudo-random numbers be improved (with respect to this property)? (Convergence)
What we are asking is, what is the criterion which should be applied to pseudo-random numbers which are used in MCMC algorithms, and if this is known, how can we optimise pseudo-random numbers with respect to this criterion? The assumption until now, usually, was to make the pseudo-random numbers appear as random as possible since MCMC theory assume random numbers as input.
Previous work in this direction is by [Chentsov 1967] cited above, Owen and Tribble 2005 doi: 10.1073/pnas.0409596102, and Tribble and Owen 2008 doi: 10.1214/07-EJS162 showed consistency if is finite set. Our paper extends these results to infinite sets .
Before we investigate quality criteria for pseudo-random numbers, let us consider how pseudo-random numbers are used in quasi-Monte Carlo (qMC). This topic is well studied and also gives some ideas on what might be important for pseudo-random numbers for MCMC.
Let us consider a simple example of a function . We want to estimate using the average .
It is best to use points which are uniformly distributed in , that is, with no `large gaps’ in between them. A one dimensional example of a low discrepancy sequences is the van der Corput sequence, which is defined in the following way: Let have base representation where . Then define the radical inverse function
In qMC we are not assuming that the driver sequence is random, rather we think of the sequence as a deterministic sequence with certain properties. The essential property here is that the sequence has `no large gaps’ in between. The technical term for this statement is that the sequence has low discrepancy.
In statistical notation, the discrepancy is the Kolmogorov-Smirnov distance between the emperical distribution of the points and the uniform distribution.
Formally, discrepancy of a point set is defined by
where is the number of points of in the interval .
The answer is No. The points obtained do not have the distribution properties given by the probability density function.
Low discrepancy guarantees good distribution properties, but the discrepancy does not measure correlations between points. The points of the van der Corput sequence are highly correlated since and . Hence they fail statistical tests for randomness.
The property we need is called complete uniform distribution (CUD).
Definition 1 Let and define . Then is completely uniformly distributed (CUD) if
Consider the van der Corput sequence with . Then the first few points are , , , , , and so on.
Roughly speaking we prove that under certain assumptions: If driver sequence is CUD, then MCMC converges to correct result. Note that CUD sequences are not `more’ random, but rather they have the right property.
The main assumptions are roughly speaking the following (though there are several other technical assumptions):
- (i) The MCMC algorithm works if one uses random numbers as driver sequence.
- (ii) There exists a coupling region.
The first property just means that we need to assume that the MCMC algorithm works if one uses random numbers. In other words, if random numbers do not work, then there is a fundamental difficulty in the problem and we do not expect that pseudo random numbers fix the problem. In particular, if one chooses a point with the desired distribution and uses the update function times to obtain a point based on a i.i.d. uniformly distributed driver sequence, then also has the desired distribution. If one update requires random numbers, then to obtain we need random numbers.
The second assumption is more subtle. We illustrate it using the slice sampler below. Let and be two sequences with different starting points but which are updated using the same driver sequence .
Here, the fourth line for and lie in the coupling region, that is, if vertical value of the fourth line of is, say, and the vertical value of the fifth line of is, say, , then . Hence, since both sequences are updated using the same driver sequence, it follows that the fifth line of and of will coincide. Then for all and hence the sequences coincide except for a finite number of points.
We now illustrate the prove that the sequence works. `Works’ means that the points have the desired distribution. Since, by assumption, MCMC based on random numbers yields points with the desired distribution, we can choose another sequence from which we know that it has the right distribution. We assume that both sequences are updated with the same driver sequence . Since is an ideal sequence we can assume that has the right distribution, whereas for we cannot make such an assumption. Then from the coupling region it follows that, with positive probability, after a finite number of steps the two sequences will coincide (and stay the same from thereon). Since works it follows that, apart from a finite number of points which can be neglected, also works.
The details of the proof can be found in the paper [CDO].
6. Numerical Results
The numerical results by Tribble & Owen show improvements using certain completely uniformly distributed sequences, see Tribble’s PhD thesis. Tribble found the largest improvements for the Gibbs sampler where the driver sequence is a Linear Feedback Shift Register (LFSR). The period length of the LFSR is chosen such that the full period can be used and, additionally, the input sequence is randomised.