There are related bodies of work by other people going in other directions, but in my case the emphasis is on using the idea of algorithmic complexity to obtain incompleteness results. I became interested in this as a teenager and have worked on it ever since.
Let me tell you that story. History is extremely complicated, with many different points of view. What will make my account simple is the unity of purpose imposed on a field that is a personal creation, that has a central spine, that pulls a single thread. What did it feel like to do that? In fact, it's not something I did. It's as if the ideas wanted to be expressed through me.
It is an overwhelming experience to feel possessed by promising new ideas. This happened to me as a teenager, and I have spent the rest of my life trying to develop the ideas that flooded my mind then. These ideas were deep enough to merit 45 years of effort, and I feel that more work is still needed. There are many connections with crucial concepts in other fields: physics, biology, philosophy, theology, artificial intelligence... Let me try to remember what happened to me... The history of a person's life, that's just gossip. But the history of a person's ideas, that is real, that is important, that is where you can see creativity at work. That is where you can see new ideas springing into being.
I was also fascinated by computers, and by the computer as a mathematical concept. In 1936 Turing derived incompleteness from uncomputability. My work follows in Turing's footsteps, not Gödel's, but adds the idea of looking at the size of computer programs.
For example, let's call a program Q ``elegant'' if no program written in the same language that is smaller than Q produces the same output. Can we prove that individual programs are elegant? In general, no. Any given formal axiomatic system can only enable us to show that finitely many programs are elegant.
It's easy to see that this must be so. Just consider a program P that calculates the output of the first provably elegant program that is larger than P. P runs through all the possible proofs in the formal axiomatic system until it finds the first proof that an individual program Q larger than P is elegant, and then P runs Q and returns Q's output as its (P's) output.
If you assume that only true theorems can be proved in your formal axiomatic system, then P is too small to be able to produce the same output as Q. If P actually succeeds in finding the program Q, then we have a contradiction. Therefore Q is never found, which means that no program that is bigger than P can be proven to be elegant.
So how big is P? Well, it must include a big subroutine for running through all the possible proofs of the formal axiomatic system. The rest of P, the main program, is rather small; P is mostly that big subroutine. That's the key thing, to focus on the number of bits in that subroutine.
So let's define the algorithmic complexity of a formal axiomatic system to be the size in bits of the smallest program for running through all the proofs and producing all the theorems. Then we can state what we just proved like this: You can't prove that a program is elegant if its size is substantially larger than the algorithmic complexity of the formal axiomatic system that you are using.
Instead of saying ``a formal axiomatic system of algorithmic complexity N,'' I'll just say ``N bits of axioms.'' So if you have N bits of axioms, then no program larger than N + c bits in size can be proven to be elegant. That's the result we just proved.
A more sophisticated example is the number I call Ω, which is the halting probability of a computer running a program produced one bit at a time by repeatedly tossing a coin. Because it is a probability, this number has to be between zero and one. Imagine writing it out in binary:
These bits are peculiar, they are irreducible mathematical information. This means that a formal axiomatic system with N bits of axioms can enable you to determine at most N + c bits of Ω. Essentially the only way to determine bits of Ω is to add that information directly to your axioms. Even though Ω is a single well-defined real number (once you fix the programming language), its bits have no structure, no pattern, none at all, they are irredundant, irreducible mathematical information.
In other words, the bits of Ω are mathematical facts that are true for no reason, no reason simpler than themselves.
So that's the basic idea, and those are my two favorite results, but the devil is in the details. You can spend your life on those details, and I did.
The essay question is what do you conclude if you find a pin on the moon. [This was before the first lunar landing.] My answer is that this means that somebody must have visited before you, because a pin is not natural, it is artificial, the product of intelligence. And, I remark, what this means is that there is a small program to calculate it, to create it. That's how we can tell that the pin has structure and is artificial. And, contrariwise, something natural would not have a description that can be compressed into a small program, because it was not designed.
And then, as a throw-away remark, I state that a random thing is one that cannot be compressed into a smaller program. More precisely, I am speaking about a digital description of an object, not about the object itself. In other words, in 1962 I give the following
However, I quickly forget about this definition, because I am having so much fun learning how to write, debug and run computer programs in the Science Honors Program. And I am given the run of the math stacks at Columbia University and can hold in my hands and study the collected works of Euler and other priceless volumes.
Also that summer, I get the first incompleteness result that I will publish, UB1, an upper bound on the provable lower bounds on run-time complexity in any given formal axiomatic system. This is published in 1970 in a Rio de Janiero Pontifícia Universidade Católica research report, and only there. [While writing up that report in Rio, I realize I can also obtain an upper bound on the provable lower bounds on program-size complexity.]
Another discovery that summer, UB2, is that one can diagonalize over the output of all programs that provably calculate total functions f:N→N to obtain a faster growing computable total function F:N→N. That is to say, given any formal axiomatic system, one can construct a computer program from it that calculates a total function f:N→N, but the fact that this program calculates a total function f:N→N cannot be proved within the formal axiomatic system, because f goes to infinity too quickly. ``Calculates a total function f:N→N'' merely means that every time we give the program f a natural number n as input, it eventually outputs a single natural number f(n) and then halts.
The result UB1 is actually a corollary of UB2, since all lower bounds on run-time complexity are computable total functions.
Now one would say that the proof of UB2 is an instance of Cantor diagonalization, but in my opinion it's really closer to Paul du Bois-Reymond's theorem on orders of infinity. His theorem is that for any scale of rates of growth, any infinite list of functions that go to infinity faster and faster, for example
there is another function
that goes to infinity even more quickly. As far as I know, Paul du Bois-Reymond's work was independent of Cantor's.
Note the Cantor ordinal number ω as a subscript. We can then form
and then
Continuing in this manner, we get to
and onwards and upwards, incredibly fast-growing functions all.
I had read about Paul du Bois-Reymond's work in a monograph by G. H. Hardy called Orders of Infinity. [I learned the calculus from Hardy's A Course of Pure Mathematics, and also enjoyed A Mathematician's Apology and Hardy and Wright, An Introduction to the Theory of Numbers.]
My point of view changes between 1964 and 1965. In my 1964 work, only infinite computations are considered. In 1965, on the contrary, only finite computations are considered, computations that produce a single output. Furthermore, my interest shifts from run-time complexity to program-size complexity.
In a footnote they remark that the theory of games seems to require a quantum-mechanical world in which God plays dice. Not really, I say to myself. Another logical possibility would be that the theory of games tells you to use a random sequence of choices, but you cannot compute this sequence of choices from the theory.
You see, in the game of matching pennies, if a theory can tell you exactly what to do, you can predict what your opponent will do and beat him. The solution to the paradox is either that the theory asks you to use physical randomness, which is unpredictable, or that you have a theory that says you must make mathematically random or unstructured choices, and the contradiction is avoided because these are in fact uncomputable (a notion taken from Turing).
The third leg of the stool comes from reading Shannon, who defines a message to be random or have maximum entropy if it cannot be compressed, if it cannot be encoded more compactly. Obviously the most general possible decoder would be a universal Turing machine, a general-purpose computer.
With my 1962 definition (R1), I have managed to connect game theory, information theory and computability theory. Now all I have to do is work out the details.
Now it's the summer vacation between my first and second year at City College, and I attempt to carry out the plan. At the beginning of the summer, the road forward seems blocked, but I keep trying. Later in the summer, ideas begin to flood into my mind. I write a single paper that is the size of a small book. At the request of the editors, I later divide it in two, and delete much material to save space.
This paper ``On the length of programs for computing finite binary sequences''---part one is published in 1966 and part two is published in 1969, both in the ACM Journal---presents three different theories of program-size complexity (and embryonic versions of ideas that I would explore for years):
Let's define the complexity of a bit string to be the size of the smallest program that computes it.
In each case, theory (A), (B) or (C), I show that most n-bit strings have complexity close to the maximum possible, and I determine asymptotic formulas for the maximum possible complexity of an n-bit string. These maximum or near maximum complexity strings are defined to be random. To show that this is reasonable, I prove, for example, that these strings are ``normal'' in Borel's sense. This means that all possible blocks of bits of the same size occur in such strings approximately the same number of times, an equi-distribution property.
I start with theory (A) because that seems the most straightforward thing to do. The idea in theory (A) is to eliminate all the redundancy in a real programming language. Then I switch to theory (B), in which I don't eliminate the redundancy, I live with it. The proofs are prettier; more subtle, not so heavy-handed. However, in theories (A) and (B) I cannot figure out how to show that a small amount of structure in an n-bit string will force its complexity to dip below the maximum possible complexity for n-bit strings.
To solve this, I switch from Turing machines to binary programs and theory (C). Theory (C) solves the problem, but feels too easy, like stealing candy from a baby. For example, in theory (C) it is trivial to show that most n-bit strings have close to the maximum possible complexity. And this maximum possible complexity is precisely n + 1, not an asymptotic estimate as in theories (A) and (B).
[To get (max complexity n-bit string) = n + 1, theory (C) has to be a bit more complicated.
(C2) is the version of theory (C) given in ``On the length of programs for computing finite binary sequences: statistical considerations'' (ACM Journal, 1969), the second of the two papers put together from my 1965 randomness manuscript.]
By the way, in theories (A) and (B), randomness definition (R1) does not apply, because the size of programs is measured in states, not bits. It is necessary to use a slightly different definition of randomness:
In theory (C), (R1) works fine (but so does (R2), which is more general).
Theory (C) is essentially the same as the one independently proposed by Kolmogorov at about the same time (1965). [Solomonoff was the first person to publish the idea of program-size complexity---in fact, (C)---but he did not propose a definition of randomness.]
However, I am dissatisfied with theory (C); the absence of subadditivity disturbs me. What is subadditivity? The usual definition is that a function f is subadditive if f(x + y) ≤ f(x) + f(y). I mean something slightly different. Subadditivity holds if the complexity of computing two objects together (also known as their joint complexity) is bounded by the sum of their individual complexities. [In the case of joint complexity the computer has two outputs, or outputs a pair of objects, whatever you prefer.] In other words, subadditivity means that you can combine subroutines by concatenating them, without having to add information to indicate where the first subroutine ends and the second one begins. This makes it easy to construct big programs. Complexity is subadditive in theories (A) and (B), but not in theory (C).
Last but not least, ``On the length of programs for computing finite binary sequences'' contains what I would now call a Berry paradox proof that program-size complexity is uncomputable. This seed was to grow into my 1970 work on incompleteness, where I refer to the Berry paradox explicitly for the first time.
This first information-theoretic incompleteness result is immediately published in a Rio de Janiero Pontifícia Universidade Católica research report and also as an AMS Notices abstract, and comes out the next year (1971) as a note in the ACM SIGACT News.
I obtain a LISP 1.5 Programmer's Manual in Brazil and start writing LISP interpreters and inventing LISP dialects. [Around 1973, I give courses on LISP and on computability and metamathematics at the University of Buenos Aires.]
(D) is the mature theory, I believe. I immediately put this out as an IBM research report. I present this paper at the opening plenary session of the IEEE International Symposium on Information Theory in Notre Dame, Indiana, in October 1974. It is published in the ACM Journal in 1975 as ``A theory of program size formally identical to information theory.''
There are three key ideas in this paper: self-delimiting programs, a new definition of relative complexity, and the idea of getting program-size results indirectly from probabilistic, measure-theoretic arguments involving the probability P(x) that a program will calculate x. I call this the algorithmic probability of x. [Solomonoff tried to define P(x) but could not get P(x) to converge since he wasn't working with self-delimiting programs.] Summing P(x) over all possible outputs x yields the halting probability Ω:
permits us to translate complexities into probabilities and vice versa. Here the complexity H(x) of x is the size in bits of the smallest program for calculating x, and the O(1) indicates that the difference between the two sides of the equation is bounded.
Incidentally, (*) implies that there are few minimum or near-minimum size programs for calculating something, few minimal descriptions. That is, an elegant program for calculating something is essentially unique. [For more on this, see my book Exploring Randomness (2001). I had proven this result in theory (C) in 1972, but that proof wasn't published until 1976, in ``Information-theoretic characterizations of recursive infinite strings'' in Theoretical Computer Science.]
Where did the halting probability Ω come from? How did I come up with it? Well, already in part two of my first paper on randomness, ``On the length of programs for computing finite binary sequences: statistical considerations'' (ACM Journal, 1969), I use a Heine-Borel-style algorithm to exhibit a specific example of a random infinite sequence of bits. I think it is important to come up with specific examples.
The halting probability Ω is a natural example of a random infinite sequence of bits. Besides providing a connection with the work of Turing, Ω makes randomness more concrete and more believable.
Furthermore, once you have a natural example of randomness, you immediately get an incompleteness theorem from it, as I point out in ``Gödel's theorem and information'' (International Journal of Theoretical Physics, 1982). My work in 1987 on Ω and its diophantine equation makes this fully explicit. I had been aware of this opportunity for getting a dramatic incompleteness result for some time. [See the end of the introductory section of ``Information-theoretic limitations of formal systems'' (ACM Journal, 1974).]
(*) is nice but it is only part of the story. The true reward for changing from theory (C) to (D) is this spectacular result:
In words, (the joint complexity of two objects) is equal to the sum of (the absolute complexity of the first object) plus (the relative complexity of the second object given the first object).
To get (**), self-delimiting programs aren't enough, you also need the right definition of relative complexity. [Levin claims to have published theory (D) first. However he missed this vital part of theory (D).] I had used relative complexity in my big 1965 randomness manuscript, but had eliminated it to save space. In my 1975 ACM Journal paper I take up relative complexity again, but define H(x|y), the complexity of x given y, to be the size in bits of the smallest self-delimiting program for calculating x if we are given for free, not y directly, but a minimum-size self-delimiting program for y.
And (**) implies that the extent to which computing two things together is cheaper than computing them separately, also known as the mutual information
is essentially the same, within O(1), of the extent to which knowing x helps us to know y
and this in turn is essentially the same, within O(1), of the extent to which knowing y helps us to know x
This is so pretty that I decide never to use theory (C) again. For (D) doesn't just restore subadditivity to (C), it reveals an elegant new landscape with sharp results instead of messy error terms. From now on, theory (D) only.
In the paper ``Algorithmic entropy of sets'' (Computers & Mathematics with Applications, 1976, written at the end of 1975), I attempt to extend the self-delimiting approach to programs for generating infinite sets of output. Much remains to be done. [If this interests you, please see the discussion of infinite computations in the last chapter of Exploring Randomness (2001).]
This topic is important, because I think of a formal axiomatic system as a computation that produces theorems. My measure of the complexity of a formal axiomatic system is therefore the size in bits of the smallest self-delimiting program for generating the infinite set of theorems.
Even though I spend most of my time on this engineering project, in 1982 I publish ``Gödel's theorem and information'' in the International Journal of Theoretical Physics. [At roughly the same time I fulfill a childhood dream by building my own telescope: I join a telescope-making club and grind a 6 inch f/8 mirror for a Newtonian reflector in a basement workshop at the Hayden Planetarium of the Museum of Natural History.] This is later included with my 1974 IEEE Transactions on Information Theory paper in the influential anthology Tymoczko, New Directions in the Philosophy of Mathematics, Princeton University Press, 1998.
My 1985 publication ``An APL2 gallery of mathematical physics: a course outline'' gives computational working models of physical phenomena to be inserted between chapters of Einstein and Infeld, The Evolution of Physics. This APL2 physics simulation software is extremely concise. [Besides learning the physics and some numerical analysis, I wanted to get a feel for the algorithmic complexity of the laws of physics.]
Some of these new ideas are presented in the paper ``Incompleteness theorems for random reals'' (1987). This paper contains a proof of the basic result that an N-bit formal axiomatic system cannot enable you to determine more than N + c bits of Ω, bits that may be scattered and do not have to be together at the beginning of Ω. This is the measure-theoretic proof that I give in the Cambridge book. After writing this paper, work on the book begins in earnest.
Using work by Jones and Matijasevic on Hilbert's 10th problem, I calculate a 200-page diophantine equation for Ω. [It's actually what's called an exponential diophantine equation.] This equation has thousands of unknowns and a parameter k, and has finitely or infinitely many whole-number solutions depending on whether the kth bit of Ω is, respectively, a 0 or a 1. To get this equation, I convert a register machine program for a LISP interpreter into a diophantine equation, and then I plug into that equation a LISP program for computing lower bounds on Ω.
Simultaneously, World Scientific publishes a collection of my papers, Information, Randomness and Incompleteness.
I also publish four papers on LISP program-size complexity:
In these four papers several different dialects of LISP are studied, and appropriate versions of the halting probability ΩLISP are invented for each of them, together with the corresponding incompleteness theorems. The techniques developed in my complexity theories (A) and (B) are put to good use here.
These LISP Ω numbers may not be as random as the fully random Ω number in theory (D), but, like the so-called random infinite sequences in my original 1965 randomness paper, they come close. For example, they are Borel normal for blocks of all sizes in any base.
These 1992 papers are immediately included in my second World Scientific volume, Information-Theoretic Incompleteness (1992).
I lecture at a meeting on reductionism at Cambridge University. The transcript of that lecture, ``Randomness in arithmetic and the decline and fall of reductionism in pure mathematics,'' appears later in Cornwell, Nature's Imagination, Oxford University Press, 1995.
Now, 30 years after starting to work on program-size complexity, I can finally run programs and measure their size.
Several years are needed to complete this concrete version of AIT and to present it in my three Springer-Verlag volumes, The Limits of Mathematics (1998), The Unknowable (1999), and Exploring Randomness (2001). These books come with LISP software and a LISP interpreter.
Honorary doctorate, University of Maine.
Springer-Verlag publishes Conversations with a Mathematician, a collection of some of my lecture transcripts and interviews.
I'm invited to present a paper at a philosophy congress in Bonn, Germany, in September. For this purpose, I begin to study philosophy, particularly Leibniz's work on complexity, which I am led to by a hint in a book by Hermann Weyl.
My paper appears two years later (2004) in a proceedings volume published by the Academy Press of the Berlin Academy that was founded by Leibniz. It is reprinted as the second appendix in my book Meta Math! (2005).
Write Meta Math!, a high-level popularization of AIT, published the following year by Pantheon Books (2005). This book is not just an explanation of previous work; it presents a système du monde.
Meta Math! follows Émile Borel in questioning the existence of the bulk of the real numbers, a train of thought further developed in my paper ``How real are real numbers?'' (International Journal of Bifurcation and Chaos, 2006). [Vladimir Tasic pointed out to me that Borel has a know-it-all real number---a 1927 version of the Ω number. My paper on the ontological status of real numbers is dedicated to Borel's memory.]
Honorary president of the scientific committee of the Valparaíso Complex Systems Institute in Chile. Member of the scientific advisory panel of FQXi, devoted to Foundational Questions in Physics & Cosmology.
A collection of some of my philosophy papers is published in Turin. Full member, Académie Internationale de Philosophie des Sciences.
To paraphrase Einstein, this timeline will have fulfilled its purpose if it shows how the efforts of a lifetime hang together, and why they lead to certain definite expectations. [See the final sentence of Einstein's Autobiographical Notes.] In particular, I think it would be fruitful to explore the following topics.