Discrepancy bounds for uniformly ergodic Markov chain quasi-Monte Carlo

Recently we uploaded the paper

In a sense, this paper is an extension of the paper

In this paper we prove bounds on the discrepancy of Markov chains which are generated by a deterministic driver sequence u_1, u_2, \ldots, u_n \in [0,1]^s. We also prove a Koksma-Hlawka inequality. The main assumption which we use is uniform ergodicity of the transition (Markov) kernel. We describe the essential ingredients and results in the following.

The transition kernel of a Markov chain

Let (G,\mathcal{B}(G), \pi) be a probability space. We want to generate a Markov chain in the state space G whose distribution converges to the target distribution \pi. The idea of the transition kernel K:G \times \mathcal{B}(G) \to [0,1] is the following. Assume that X_i \in G is the ith point of the Markov chain. Let A \in \mathcal{B}(G) be an arbitrary set. Then the probability that the next point of the Markov chain is in A, is given by K(X_i, A). For a precise definition of the transition kernel see Definition 2 on page 6 here.

Let X_0 \in G be the starting point of our Markov chain. Then we use K(X_0,\cdot) to sample the next point X_1 of our Markov chain. The point X_2 is then sampled from the distribution K(X_1,\cdot), and so on. Asymptotically we then get that the distribution of X_i converges in some sense to the target distribution \pi. One can also write down the transition kernel for going j steps, denoted by K^j(X,A), namely it can be defined recursively by

\displaystyle K^j(X,A) = \int_G K^{j-1}(y,A) K(X, \mathrm{d} y),

where K^0(X,A) = 1_{X \in A} is the indicator function of the set A. Here, K(X, \mathrm{d} y) just means that we integration with respect to the probability measure K(X, \cdot).

This method is an important algorithm in simulation problems where it is not possible to generate samples which have distribution \pi. Instead one generates a Markov chain which has target distribution \pi and uses the chain to as the points for the simulation. Such algorithms are called Markov chain Monte Carlo.

There are many methods for generating the Markov chain, see for instance the Metropolis-Hastings algorithm or the slice sampler. However, those algorithms will not play a role in the following. We just assume that there is a function \varphi:G \times [0,1]^s \to G, which we call update function. This function is used to generate the Markov chain by X_i = \varphi(X_{i-1}, U_i), where U_i is a uniformly distributed random number in [0,1]^s.

Uniformly ergodic Markov chains

Uniform ergodicity is a property of the transition kernel which tells us how fast the distribution of X_i converges to the target distribution \pi. The assumption is of the following form

\displaystyle \sup_{A \in \mathcal{B}(G)} \sup_{X \in G} |K^i(X,A)-\pi(A)| \le M \alpha^i \quad \forall i \in \mathbb{N},

where M >0 and 0 \le \alpha < 1 are constants. This assumption is actually very natural as we explain for a special case in the following.

If the transition kernel K satisfies certain assumptions, we have

\displaystyle K(x,A) = \int_A k(x,y) \pi(\mathrm{d} y),

where k: G \times G \to \mathbb{R}_+ and the function k is in fact a reproducing kernel with some additional properties. The kernel k has a Karhunen-Loeve expansion

\displaystyle k(x,y) = \sum_{r=0}^\infty \lambda_r \psi_r(x) \overline{\psi_r(y)},

where \lambda_r are the eigenvalues and \psi_r are orthonormal eigenfunctions of the integral operator T(f)(y) := \int_G k(x,y) f(x) \mathrm{d} x (we assume that this operator is a compact operator). We have

\displaystyle 1 = \lambda_0 > \lambda_1 \ge \lambda_2 \ge \lambda_3 \ge \cdots \ge 0.

Let now

\displaystyle K^j(x,A) = \int_A k^j(x,y) \pi(\mathrm{d} y).

Then it can be shown using the recursive formula for K^j and the Karhunen-Loeve expansion of k that

\displaystyle k^j(x,y) = \sum_{r=0}^\infty \lambda_r^j \psi_r(x) \overline{\psi_r(y)} \quad \mbox{for all } j \in \mathbb{N}.

Thus k^j converges exponentially fast to k^\infty(x,y) = 1. (In fact, under certain assumptions, one can choose \alpha = \lambda_1.) But we have \int_A k^\infty(x,y) \pi(\mathrm{d} y) = \pi(A), the stationary distribution.

The discrepancy of the Markov chain

We measure the discrepancy of the empirical distribution of the first n points x_1, x_2, \ldots, x_n of the Markov chain from the target distribution \pi. The empirical distribution for a set A \in \mathcal{B}(G) is

\displaystyle \frac{1}{n} \sum_{i=1}^n 1_{x_i \in A},

where 1_{x_i \in A} is the indicator function of the set A. The local discrepancy is

\displaystyle \left|\frac{1}{n} \sum_{i=1}^n 1_{x_i \in A} - \pi(A) \right|.

Note that for arbitrary sets A one cannot get convergence in general (if G = [0,1]^d say and A is just the point set \{x_1,\ldots, x_n\}, then the local discrepancy is 1). One example is to consider boxes of the form \prod_{j=1}^d (-\infty, y_j) \cap G. For instance, one can consider the set of sets \mathcal{A} = \{ \prod_{j=1}^d (-\infty, y_j) \cap G : y_j \in \mathbb{R} \cup \{-\infty, \infty\}. We call the set \mathcal{A} the set of test sets. The discrepancy is now defined by

\displaystyle \sup_{A \in \mathcal{A}} \left|\frac{1}{n} \sum_{i=1}^n 1_{x_i \in A} - \pi(A) \right|.

Main result

The main result of the paper is the proof that for any starting point x_0 \in G there exists a finite driver sequence u_1, u_2, \ldots, u_n such that the Markov chain x_1, x_2, \ldots, x_n generated by this driver sequence is bounded by

\displaystyle \sup_{A \in \mathcal{A}} \left| \frac{1}{n} \sum_{i=1}^n 1_{x_i \in A} - \pi(A) \right| \le C \sqrt{\frac{\log n}{n}},

for some constant independent of n. See Corollary 5 here.

Several further extensions would be interesting. One is to remove the dependence of the driver sequence u_1, u_2, \ldots on the starting point. Further one would like to have a driver sequence which can be used for a large class of update functions. In practice one wants to have explicit constructions of such driver sequences. Ultimately one would also like to have driver sequences for which one can achieve a rate of convergence beyond the Monte Carlo rate, at least, for some restricted class of problems. Answers to these questions would be very interesting, but they seem to be quite challenging, at least at the moment.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s