Wednesday, October 7, 2009
In the previous blog I argued that a truly random sequence can (should) be defined as an incompressible one. A sequence from which nothing can be learned. Moreover, it can be shown (mathematically) that certain deterministic chaotic dynamical systems can indeed generate these random sequences.
If randomness is a slippery subject, complexity is worse. Complexity has turned into an entire research field of its own but is being criticized for not even having a good definition for the term itself. In fact, there seem to be 31 definitions floating around. Complexity is one of those things for which we seem to have a clear intuition but is hard to capture in math.
Let's try to build some intuition first. Consider the tree-image above and compare that to say a completely white image and an image of white noise (your television screen if it receives no signal). We feel that the white image is boring because it is utterly predictable but the noise image is also boring because there is nothing to predict about it. The tree-image on the contrary seems to interesting. There is lots of structure. We want our complexity measure to peak at the tree-image but vanish at boring images and random images. Therefore, complexity should not be equated to randomness.
An attractive definition was given by Murray Gell-Mann (Nobel laureate and inventor of the quark). I will discuss a slightly adapted version of his story. We return to the MDL principle (minimum description length). Given a string of symbols we wish to optimally compress it. That means, we want to build a model with a bunch of parameters, P, and use that model to predict most of the symbols in the sequence correctly. The symbols that were wrongly predicted will have to be stored separately. One can imagine a communication game where I want to send the string to you. Instead of sending the string, I send the model and only those symbols which where predicted wrongly by the model. If you use my model on your end to generate the predictable symbols you have all you need to generate the original string (use the model at the locations where predictions are correct but use the separate list of corrections where the model was wrong). The trick is now to find the model that has the optimal complexity in the sense that the total information necessary to encode the model and the unpredictable symbols is minimized. In other words, I want the procedure that maximally compresses the original string.
There is something fundamentally important about this procedure because it guarantees that the model will be optimally predictive of future data. If your model was too small, there was more structure that you could have predicted in the test string. If on the other hand you use a more complex model, you will have "modeled" the randomness of the input string (called overfitting) and this is obviously not helpful either. Living things survive because they are able to predict the world (and leverage it). It is our business to compress the input stream of our senses!
We are now ready to define complexity: it is the information necessary to encode the model (parameters), excluding the residual random bits of the string that remained unpredictable. A white image has very little information to start with and even though it can all be predicted by our model (no random bits) so the total is still very small. On the other, a random string is completely unpredictable and even though it carries a lot of information (a la Shannon) the model part is zero. Both cases have vanishing complexity as we wanted to, and the tree-image will have lots of complexity.
Complexity is a relative quantity. For instance, it depends on factors such as the length of the string. For small strings it doesn't pay off to use a complicated model. This is not so bad in my opinion, because imagine a primordial world where all that has ever been created is the string "GOD". The string are just three symbols embedded in nothingness. It will only acquire its structure (meaning) after we read the bible.
Another more serious objection is perhaps the fact that the true complexity is in-computable because it can be shown that one can not decide whether the model you have is really the optimal one in general. Imagine the fractal image below.
The image looks wonderfully complex. Who would have thought that there is a really simple algorithm to produce it? So, practically speaking, complexity is defined relative to how good of a modeler you are.
Then there is the issue of scale. If we view the world at a large scale, it may look very predictable. However, the world is composed of quadrizillions of elementary particles that are moving in chaotic motion. At the small scale the world is random, but at a larger scale order emerges through averaging. This may be best illustrated by going back to the coin-toss. If we are asked to predict the next outcome of a coin-flip, we will be at a loss. However, if we are asked to predict the total number of heads over the next 100 coin tosses we can make a good guess (50 for a fair coin). This is because even though every sequence of 0's and 1's is equally likely, there are many more sequences with 50 heads and 50 tails than that there are sequences with only heads for instance.
Finally there is the issue of the time that it needs to compress and decompress the string. We may have found a great model that does a wonderful job at compressing the input string, but that requires a very long computation time to use for encoding and decoding. This seems important at least biologically. If I have a model of lions that takes 5 hours to use (i.e. make predictions with) then it is not very useful in terms of my chances of survival. As another example, consider the Mandelbrot fractal again. There is very small prescription to generate it but one needs considerable time to actually compute that image (in fact, this connects to the previous point because you would have to first specify the scale at which you want to stop otherwise it would take infinitely long).
In machine learning researchers have also started to realize the importance of computation in addition to prediction quality. They acknowledge that modern day problems are so large that a really good model may require too long to train up and use for predictions. So, the computational budget is factored into the total equation, favoring faster algorithms and models over slower ones even though the latter may be more accurate (it is in this sense that machine learning is really different from statistics and in fact the reason why machine learning is in the computer science department and not the mathematics department).
Should we adapt our definition of complexity to accommodate running time? Gell-Mann thinks so and he argues for a bound on the computation time. There are indeed many definitions of complexity, such as "logical depth" that do just that. But if we include computational complexity, shouldn't we also include the amount of memory we need during computation (since they can be traded off)? Uh-oh we are descending in the bottomless pit call complexity research with its 31 definitions. This may be a good point to stop.:-)