# Discrepancy bounds for uniformly ergodic Markov chain quasi-Monte Carlo

In a sense, this paper is an extension of the paper

In this paper we prove bounds on the discrepancy of Markov chains which are generated by a deterministic driver sequence $u_1, u_2, \ldots, u_n \in [0,1]^s$. We also prove a Koksma-Hlawka inequality. The main assumption which we use is uniform ergodicity of the transition (Markov) kernel. We describe the essential ingredients and results in the following.

The transition kernel of a Markov chain

Let $(G,\mathcal{B}(G), \pi)$ be a probability space. We want to generate a Markov chain in the state space $G$ whose distribution converges to the target distribution $\pi$. The idea of the transition kernel $K:G \times \mathcal{B}(G) \to [0,1]$ is the following. Assume that $X_i \in G$ is the $i$th point of the Markov chain. Let $A \in \mathcal{B}(G)$ be an arbitrary set. Then the probability that the next point of the Markov chain is in $A$, is given by $K(X_i, A)$. For a precise definition of the transition kernel see Definition 2 on page 6 here.

Let $X_0 \in G$ be the starting point of our Markov chain. Then we use $K(X_0,\cdot)$ to sample the next point $X_1$ of our Markov chain. The point $X_2$ is then sampled from the distribution $K(X_1,\cdot)$, and so on. Asymptotically we then get that the distribution of $X_i$ converges in some sense to the target distribution $\pi$. One can also write down the transition kernel for going $j$ steps, denoted by $K^j(X,A)$, namely it can be defined recursively by

$\displaystyle K^j(X,A) = \int_G K^{j-1}(y,A) K(X, \mathrm{d} y),$

where $K^0(X,A) = 1_{X \in A}$ is the indicator function of the set $A$. Here, $K(X, \mathrm{d} y)$ just means that we integration with respect to the probability measure $K(X, \cdot)$.

This method is an important algorithm in simulation problems where it is not possible to generate samples which have distribution $\pi$. Instead one generates a Markov chain which has target distribution $\pi$ and uses the chain to as the points for the simulation. Such algorithms are called Markov chain Monte Carlo.

There are many methods for generating the Markov chain, see for instance the Metropolis-Hastings algorithm or the slice sampler. However, those algorithms will not play a role in the following. We just assume that there is a function $\varphi:G \times [0,1]^s \to G$, which we call update function. This function is used to generate the Markov chain by $X_i = \varphi(X_{i-1}, U_i)$, where $U_i$ is a uniformly distributed random number in $[0,1]^s$.

Uniformly ergodic Markov chains

Uniform ergodicity is a property of the transition kernel which tells us how fast the distribution of $X_i$ converges to the target distribution $\pi$. The assumption is of the following form

$\displaystyle \sup_{A \in \mathcal{B}(G)} \sup_{X \in G} |K^i(X,A)-\pi(A)| \le M \alpha^i \quad \forall i \in \mathbb{N},$

where $M >0$ and $0 \le \alpha < 1$ are constants. This assumption is actually very natural as we explain for a special case in the following.

If the transition kernel $K$ satisfies certain assumptions, we have

$\displaystyle K(x,A) = \int_A k(x,y) \pi(\mathrm{d} y),$

where $k: G \times G \to \mathbb{R}_+$ and the function $k$ is in fact a reproducing kernel with some additional properties. The kernel $k$ has a Karhunen-Loeve expansion

$\displaystyle k(x,y) = \sum_{r=0}^\infty \lambda_r \psi_r(x) \overline{\psi_r(y)},$

where $\lambda_r$ are the eigenvalues and $\psi_r$ are orthonormal eigenfunctions of the integral operator $T(f)(y) := \int_G k(x,y) f(x) \mathrm{d} x$ (we assume that this operator is a compact operator). We have

$\displaystyle 1 = \lambda_0 > \lambda_1 \ge \lambda_2 \ge \lambda_3 \ge \cdots \ge 0.$

Let now

$\displaystyle K^j(x,A) = \int_A k^j(x,y) \pi(\mathrm{d} y).$

Then it can be shown using the recursive formula for $K^j$ and the Karhunen-Loeve expansion of $k$ that

$\displaystyle k^j(x,y) = \sum_{r=0}^\infty \lambda_r^j \psi_r(x) \overline{\psi_r(y)} \quad \mbox{for all } j \in \mathbb{N}.$

Thus $k^j$ converges exponentially fast to $k^\infty(x,y) = 1$. (In fact, under certain assumptions, one can choose $\alpha = \lambda_1$.) But we have $\int_A k^\infty(x,y) \pi(\mathrm{d} y) = \pi(A)$, the stationary distribution.

The discrepancy of the Markov chain

We measure the discrepancy of the empirical distribution of the first $n$ points $x_1, x_2, \ldots, x_n$ of the Markov chain from the target distribution $\pi$. The empirical distribution for a set $A \in \mathcal{B}(G)$ is

$\displaystyle \frac{1}{n} \sum_{i=1}^n 1_{x_i \in A},$

where $1_{x_i \in A}$ is the indicator function of the set $A$. The local discrepancy is

$\displaystyle \left|\frac{1}{n} \sum_{i=1}^n 1_{x_i \in A} - \pi(A) \right|.$

Note that for arbitrary sets $A$ one cannot get convergence in general (if $G = [0,1]^d$ say and $A$ is just the point set $\{x_1,\ldots, x_n\}$, then the local discrepancy is $1$). One example is to consider boxes of the form $\prod_{j=1}^d (-\infty, y_j) \cap G$. For instance, one can consider the set of sets $\mathcal{A} = \{ \prod_{j=1}^d (-\infty, y_j) \cap G : y_j \in \mathbb{R} \cup \{-\infty, \infty\}$. We call the set $\mathcal{A}$ the set of test sets. The discrepancy is now defined by

$\displaystyle \sup_{A \in \mathcal{A}} \left|\frac{1}{n} \sum_{i=1}^n 1_{x_i \in A} - \pi(A) \right|.$

Main result

The main result of the paper is the proof that for any starting point $x_0 \in G$ there exists a finite driver sequence $u_1, u_2, \ldots, u_n$ such that the Markov chain $x_1, x_2, \ldots, x_n$ generated by this driver sequence is bounded by

$\displaystyle \sup_{A \in \mathcal{A}} \left| \frac{1}{n} \sum_{i=1}^n 1_{x_i \in A} - \pi(A) \right| \le C \sqrt{\frac{\log n}{n}},$

for some constant independent of $n$. See Corollary 5 here.

Several further extensions would be interesting. One is to remove the dependence of the driver sequence $u_1, u_2, \ldots$ on the starting point. Further one would like to have a driver sequence which can be used for a large class of update functions. In practice one wants to have explicit constructions of such driver sequences. Ultimately one would also like to have driver sequences for which one can achieve a rate of convergence beyond the Monte Carlo rate, at least, for some restricted class of problems. Answers to these questions would be very interesting, but they seem to be quite challenging, at least at the moment.