A Good Definition of Randomness

Most mathy people have a pretty good mental model of what a random process is (for example, generating a sequence of 20 independent bits).

I think most mathy people also have the intuition that there’s a sense in which an individual string like 10101001110000100101 is more “random” than 0000000000000000000 even though both strings are equally likely under the above random process, but they don’t know how formalize it, and may even doubt that there is any way to make sense of this intuition.

Mathematical logic (or maybe theoretical computer science) has a method for quantifying the randomness of individual strings: given a string \sigma, the Kolmogorov complexity C(\sigma) of \sigma is the length of the shortest Turing machine that outputs it.

In this blog post, I would like to explain why I think this is a very satisfying definition.

Keeping Grounded

I think a good way to help avoid philosophical quagmires when thinking about randomness is to recognize that random numbers are useful in the real world, and to make sure that your thinking about randomness preserves that.

For example, there are algorithms P(\sigma) that take a fixed length n string \sigma, and produce the correct answer to whatever problem they’re trying to solve on some large proportion \alpha of all the length n strings.  Then a good approach would just be to feed P a random \sigma, and you’ll get the right answer with probability \alpha.

Just to give a concrete example: a very familiar way that random numbers are useful is to estimate the average of a large list of numbers by taking a random sample and averaging them.  You might have a list of 1000 numbers (say, bounded between 0 and 10), and have  \sigma encode a set of 100 indices, then P(\sigma) will return the average of the numbers at those indices.  If you say that P succeeds for this problem if it returns an average that’s within some fixed tolerance of the true average, then you can work out \alpha for the given tolerance (although I think getting exact numbers for this problem is actually pretty tricky).

The reason that I think that the Kolmogorov complexity is a good account of randomness is that they above story “factors” through Kolmogorov complexity in the following way: For any computable P where \alpha is high enough (in a sense to be made precise below), there is an integer c such that:

  1. For all \sigma with C(\sigma|n) > c, P(\sigma) returns a correct answer.
  2. Almost all \sigma (of the given length n) have C(\sigma|n) > c.

That is, Kolmogorov complexity lets you view the problem as follows: Any string of high complexity will yield the right answer when fed into P, so the only role of randomness is as an easy way to generate a string of a high Kolmogorov complexity.

As a note: the notation C(\sigma|n) means the shortest Turing program that outputs \sigma when given n as an input.  The reason for using this concept instead of C(\sigma) is that we want to, e.g., consider any string of all 0s to be low complexity, even if the length of the string happens to be a high complexity number.

Some Rough Intuitions

The intuition for why almost all strings \sigma should have high Kolmogorov complexity is that there are only so many Turing machines: For example, there are 2^n strings of length n and 2^{n-c}-1 Turing machines of length \leq n - c - 1, so the proportion of strings of Kolmogorov complexity \geq n - c must be at least 1 - 2^{-c}.

The intuition for why P(\sigma) should be correct for all strings of sufficiently high complexity is as follows: We’re presuming that P(\sigma) is correct for most strings, and that P is computable.  If P(\sigma) isn’t correct, that means you can describe it fairly succinctly: i.e., as the ith string \tau for which P(\tau) isn’t correct.  This will be a short description since, by presumption, i will be small.

Formalization

I said above that this fact about Kolmogorov complexity only holds if \alpha is high
enough.  How can we formalize this? One approach would be to consider a sequence of algorithms P_i instead of a single P as above.  Each algorithm P_i should return a correct answer on at least 1-2^{-i} of its input strings.  Furthermore, the different algorithms should be consistent: specifically, if P_i(\sigma) returns the correct answer, then so should P_j(\sigma) for j > i.

Now, if we kept the size n of the input string \sigma fixed, then this would be trivial, since for i greater than n, P_i would have to return the correct answer on any string.  So we should also consider algorithms P_{i,n} that take input strings of length n and give a correct answer on at least 1-2^{-i} of those strings.  (And if i>n, we will have to define “correct answer” for P_{i,n} so that every input string returns a correct answer.  Thus P_{i,n} won’t be very useful, but we can look at P_{i,n'} for higher n's.)

In fact, it turns out we can just describe in terms of the sets of input strings on which the algorithm returns a correct answer.

Definition: A P-test \delta is an assignment of a natural number to each
finite string \sigma such that, for each n, the number of \sigma of length n such that \delta(\sigma) \geq m is \geq 2^{n - m}.

If \sigma has length n, then \delta(\sigma) = m corresponds to m being the smallest m_0 such that P_{m_0,n}(\sigma) returns a correct answer in our discussion above.

Theorem (Martin-Löf?):  For any computable P-test \delta, there is a constant c such that:  For all \sigma of length n and natural numbers m, if C(\sigma|n) \geq n + c - m, then \delta(\sigma) \leq m.

Furthermore, the proportion of \sigma such that C(\sigma|n) \geq n + c - m is at least 1 - 2^{c-m}.

I think this was one of Martin-Löf’s original theorems but I’m actually not sure.  It’s a rephrasing of the results in Section 2.4 of Li and Vitányi’s book.

So, there is a complexity bound such that any string of high enough complexity will return a correct answer when plugged into the algorithm.  However, m
may have to be made high (which corresponds to making \alpha high) to ensure
that there are a large number of such high complexity strings (or any at all).

What about Noisy Data?

The algorithms discussed above are all deterministic: that is, they correspond to things like Monte Carlo integration rather than averaging noisy data collected from the real world.

So what about noisy data?  Random numbers are also useful in analyzing real world data, but the theorem above only applies to computable algorithms.  The answer is so simple that it seems like cheating: if you model the noise in your data as coming from some infinite binary sequence X, you can simply redo the whole thing but with Turing machines that have access to X!  In other words, you won’t get theorems about C(\sigma), but you will get theorems about C^X(\sigma), which is the length of the shortest Turing machine that has access to X and outputs \sigma.

What about Infinite Random Sequences?

Above we considered algorithms that knew ahead of time how many random bits we need.  What about algorithms that might request a random bit at any time?  This is also handled by Kolmogorov complexity: here we say that an infinite binary sequence is Martin-Löf random if there is some c such that each prefix of the sequence of length n has complexity at least n - c.  (There actually has to be a technical change to the definition of complexity of finite strings in this case.)

As in the finite case, there’s a theorem saying that any sufficiently robust algorithm will yield a correct answer on any Martin-Löf random sequence.

One thing I like about this framework is that it provides an idea for what it means for a single infinite sequence to be random.  For example, people often say that the primes are random (in fact, it’s one of their main points of interest).  Since the primes are computable, they aren’t random in this sense, but this gives an idea of what it might mean: perhaps there’s some programming language that encapsulates “non-number-theoretic” ideas in some way, and some sequence derived from the primes can be shown to be “Martin-Löf” random with Turing machines replaced by this weaker description language.  But this is pure speculation.

 

2 thoughts on “A Good Definition of Randomness

  1. I think most mathy people also have the intuition that there’s a sense in which an individual string like 10101001110000100101 is more “random” than 0000000000000000000 even though both strings are equally likely under the above random process, but they don’t know how formalize it

    Without having read the rest of this post, here’s my stab at it: abbabaabaababa gets parsed by the mind into an equivalence class with about (or exactly) that ratio of a‘s to b‘s, but aaaaaaaaaaaaa † gets sent into its own equivalence class, and aaaaaaaaaabaa and aaaabaaaaaaaa would get sent into their own equivalence class too.

    † keeping that number of ab‘s because that “looks close enough” — the exact topic at hand

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s