Monday, March 8, 2010

Million Monkeys on a Million Typewriters and Other Failed Experiments

Many, many years ago I came across a most compelling thought experiment (you must have heard it as well.) It goes like this:

If a million monkeys typed on a million typewriters for a million years, one of them would be sure to type out one of Shakespeare's plays.

Of course, you would need a million English Teachers to read all the typed material, as well as thousands of zookeepers to tend to the monkeys - but that would create jobs. Perhaps this grand experiment could be funded by the Jobs Bill. It would surely stimulate the economy and, we could sell the poems, short stories, novels, and other works created by the monkeys. SO, it would be revenue-neutral.

MY EXPERIMENT

I decided to do the experiment using the latest in computer tools, my laptop plus an Excel spreadsheet. You can download the Excel spreadsheet here. Every time you press the F9 key on your PC keyboard, a "monkey" produces a paragraph.

Now, to give the "monkey" all possible advantages, I created a special keyboard that has more "E" keys and "A" keys and so on, according to their frequency in English text. The keyboards have no numbers or special characters. Half the monkeys are working on keyboards with no punctuation, since any good English teacher can visually pick words out of a stream of characters and phrases and sentences out of a stream of words. The other half have keyboards that include some comma and period keys.

Below is the BEST example I could find in quite a few tries. Words are outlined in red. [Click CTRL + repeatedly for larger view. CTRL - for smaller view.]
I really expected to see more words and some phrases and perhaps a sentence or two. NADA!
Download the Excel spreadsheet here.and try it yourself! Perhaps your "monkeys" will have better luck.


WHY DID IT FAIL?


This is a good example of how the most compelling thought experiments can mislead even the most intelligent among us. Indeed, it would probably take an overly intelligent person to buy into this concept.

It turns out that Wikipedia has the explanation for the failure of our expectations:

"... If there are as many monkeys as there are particles in the observable universe (10^80), and each types 1,000 keystrokes per second for 100 times the life of the universe (10^20 seconds), the probability of the monkeys replicating even a short book is nearly zero."

THE "PARTS" PROBLEM

One of the many valuable things I learned about from Prof. Howard Pattee, when he was my teacher and Chairman of my PhD Committee, was the "parts problem".

For a partially random process of assembly to result in any thing of value, the parts must be in the right proportion to the thing you will produce. In the case of the Million Monkeys, the parts are random letters and we are looking for a book-length result (a Shakespeare play).

All we got were a few English words sprinkled among lots of gibberish.

Had we started with words as the parts, and had they been in proportion to their frquency in the English language, we would have obtained better results.

For example, there are about 10,000 English words that constitute over 99% of all English writing. Say we had a "keyboard" with 100,000 keys, with each word having as many keys as justified by its frequency in the English language. A "monkey" typing thousands of keystrokes on such a "keyboard" would be quite likely to produce a number of grammatical phrases and even sentences, and perhaps several meaningful sentences. The "monkey" might even produce a unique, original thought.

We'd be still further ahead if we used the computer to impose grammatical structure. The "monkey" would have three "keyboards". the first would have keys for SUBJECT, the second for VERB, and the third for OBJECT. Thus, each sentence would be of the form: "The BOY LOVES the GIRL" or "JACK HITS the BALL", etc. Of course, most sentences while gramatical, would not be meaningful, "The GIRL HITS the BEDROOM" or "The GRAPEFRUIT LOVES the SHOES", etc.

Of course, this system would create only the simplest of simple sentences.

For more natural sentence structure, we could use larger parts, such as phrases. Or perhaps ready-made sentences with fill-in-the-blanks like those we used to use as party games.

Like many things in the natural and artificial world, written language is a hierarchical set of structures. English consists of letter characters that, when taken in groups, form words of up to several letters. (But, you can't just take random letters and get a word. You need a proper ordering of vowels and consonants, etc.) At the next level up are simple and compound sentences made up of groups of words. (But, that cannot be random either, you need subject, verb, object substructures). At the next level paragraphs, then sections, then chapters, etc.

HOW THE GENETIC SYSTEM SOLVES THE "PARTS" PROBLEM

When we decode the genome of some animal we express it in a series of four nucleotides A, T, G, C. Each of these letters stand for a molecular assemblage containing a dozen or two atoms. Sequences of these letters (in the "genotype") code for the generation of various amino acids and groups of amino acids code for proteins. Combinations of proteins constitute what we call "genes" that code for physical characteristics (in the "phenotype").

The genetic system long ago settled on a really neat hierarchcal system where the lowest levels are very stable and most are common between different species. When DNA is copied, there are multiple instances of codes for the really important proteins. There are correction mechanisms for many types of mutations (copying errors) . The same is true at the next level, which we call "genes", and the level above that of multiple genes that work in concert, etc.

Notice how, in the genetic system that has been evolving over the past three or four billion years, the "parts" at each level are appropriately sized for their jobs. (My Optimal Span Hypothesis  http://iraknol.wordpress.com/article/optimal-span-3ncxde0rz8dtk-2/ ) provides a basis, founded in well-established information theory, for how hierarchical systems are most effectively organized.)

A TEXT GENERATOR THAT REALLY WORKS

Here are excerpts from a "Post-Modernist" academic paper I just generated:

Reinventing Modernism: Neosemioticist objectivism, capitalism and Derridaist reading
Andreas Porter
Department of Ontology, Massachusetts Institute of Technology


1. Derridaist reading and patriarchial conceptualism
In the works of Burroughs, a predominant concept is the distinction between figure and ground. Therefore, subtextual dialectic theory states that truth is unattainable.

The main theme of the works of Burroughs is the failure, and hence the futility, of precapitalist sexuality. In a sense, several discourses concerning patriarchial conceptualism may be discovered.

If Batailleist `powerful communication’ holds, we have to choose between subtextual dialectic theory and cultural narrative. But the creation/destruction distinction prevalent in Burroughs’s The Last Words of Dutch Schultz emerges again in Naked Lunch.

2. Expressions of dialecticThe characteristic theme of Hamburger’s[1] analysis ...

1. Hamburger, O. ed. (1972) Subtextual dialectic theory and Derridaist reading. University of California Press ...


The above "scholarly paper" and as many as you'd want to see like it, are available at Communications from Elsewhere.

The computer program behind this feat starts with parts that are very large. Indeed, each paper has a Title, Authors, Sections (with paragraphs and sentences), and Citations. Each of these has a set form. The only randomness is the insertion of words from certain lists into specified blanks. The results are quite compelling.

Indeed, if you gave one of these papers to a group of intelligent people who were not experts in post-modernism, many would accept them as peer-reviewed material. AND a Post-Modernist Journal might peer-review and accept the paper for publication! (See Sokal Affair)
Ira Glickstein

5 comments:

Howard Pattee said...

The monkey theorem is conceptually stimulating and can be instructive for probability theorists, but it is of little interest to linguists, scientists, and artists because any random string of symbols is meaningless.

What we call a language must have meaning, by definition. Meaning in any sign or symbol system requires establishing a triadic relation between symbol, interpreter, and referent (see Wiki “C. S. Peirce” on signs). This largely arbitrary but fixed relation is partially codified in what we call a dictionary, but a dictionary is circular. It takes much more direct experience than reading a dictionary to ground a word’s meaning in the real world. The interpreter (cell, brain, or computer hardware) is what determines meaning, not the sequence of marks.

In other words, a language whether natural, mathematical, or artificial, is a set of arbitrary but fixed constraints on the order of symbol sequences, and because it is ordered it is not random, by definition. Such nonrandom constraints are necessary for meaning in any medium.

Igor Stravinsky said it in his Poetics of Music: “The more constraints one imposes, the more one frees oneself from the chains that shackle the spirit . . . and the arbitrariness of the constraint serves only to obtain precision of execution.”

Ira Glickstein said...

Correct Howard, "...any random string of symbols is meaningless."

Yet we believe that life originated on Earth via random processes. The best explanation you and I know started with randomly-generated groups of proteins that happened to form autocatalytic cycles that reproduced themselves. Some "lucky" autocatalytic cycle happened to generate primitive RNA-like strings that reproduced in what has been called "RNA World". There must have been zillions of different "lucky" RNA Worlds that came and went until one happened to generate "super-lucky" primitive DNA-like double strings.

Up to this point, we were depending upon random events, betting (as in the million monkey caper) something would come together with "meaning" - whatever that is in a world without sentient beings.

I believe you told me you were involved with, or around at the time of the Stanley Miller and Harold Urey experiment that actually generated some amino acids from inorganic precursors.

So, it seems the Laws of Physics and Chemistry dictate that random processes will generate organic compounds. Though Miller-Urey did not succeed in generating more than the amino acids, do you believe such experiments could eventually generate primitive single-cell life?

Isn't this akin to the million monkeys eventually generating at least one coherent short story - perhaps not one that is exactly the same as any ever generated by a known writer like Shakespeare, but one that nevertheless would have meaning for any person who understands written English?

Once primitive single-cell, DNA-based reproduceable life was established on Earth, was it not more or less inevitable that it would evolve into more complex forms, including multi-cell life? Perhaps not exactly the same as the life we have now, but somewhat similar?

So, back to your observations about "meaning". In the story of the origin of life on Earth that you and I basically accept, when did "meaning" evolve? In the case of the million monkeys, when does "meaning" evolve? Is it when one of the English teachers happens upon a sentence or paragraph or short story that has meaning to him or her, or was it when that particular monkey happened to type it?

Ira Glickstein

Howard Pattee said...

Ira, we're back to our old argument. The view you like is that random events are just our ignorance of events that are all strictly determined. So in your view life was inevitably determined with the initial conditions at the big bang.

The other view is that nothing is deterministic. All events are just probabilistic, but some have very high (or low) probabilities that are experimentally indistinguishable from determinism.

Since no one has thought of any way to finally empirically test either assumption, they must be considered metaphysical faiths.

These two positions are actually only our models of reality and they are both useful but incompatible [complementary] models.

Max Planck emphasized that, "For it is clear to everybody that there must be an unfathomable gulf between a probability, however small, and an absolute impossibility . . . Thus [deterministic] dynamics and statistics cannot be regarded as interrelated.”

Ira Glickstein said...

Right, Howard, our old argument! (Just like an old married couple :^)

OK, I am willing to take the "nothing is deterministic" stance from now on in this thread.

Given that "All events are just probabilistic", when do you think "meaning" arose in the Miller-Urey experiment that produced amino acids from non-organic precursors? (And, am I misremembering your involvement?)

Do you think it is posssible that some future Miller-Urey-like experiment, starting with non-organic precursors, will eventually yield primitive reproducing life-forms? I believe such an experiment (assuming success) will have something like primitive RNA and DNA (but most likely have different code details). Do you agree?

Assuming the million monkeys are "just probabilistic", and one happens to turn out a paragraph that you and I find "meaningful" and perhaps even "poignant", when did that "meaning" happen? When we read it or when the monkey typed it?

Sorry for all these questions, but you are, after all, my most influential professor!

Ira Glickstein

Howard Pattee said...

Ira wants to know: “when did that "meaning" happen? When we read it or when the monkey typed it?” My short answer is: “When we read it.”

But Ira is not alone in his question! The long answer is that this is the basic question of epistemology, a case of what philosophers call the mind-matter problem or more generally the symbol-matter problem. That is, when does any collection or pattern of matter become more than just physical and chemical substance? When does a molecule become a message?

Many biologists and semioticians argue that the first message was at the origin of life. That required the cell’s genetic message. Certainly DNA has meaning for the cell. It is the heritable record of past selection events that controls reproduction. At the other end of the hierarchy of evolved meanings cognitive scientists wonder when brain matter becomes conscious.

In physics this is called the measurement problem: When does the material measuring instrument produce a symbolic result? In quantum theory they say it is when the wave function loses its entanglement (decoheres) and becomes a classical probability. When and how that happens is very mysterious. In any case, most everyone (e.g., Heisenburg, Pauli, Bohr, von Neumann) agrees that to measure anything one must make a sharp epistemic cut between the measuring device and the system being measured.

In artificial intelligence this is the problem of when the hardware and voltages in a computer can be called symbolic. Harnad calls this the symbol-grounding problem. Searle’s “Chinese room” is an example of the conceptual problem. The Turing Test is the classic example.

Excuse my lecturing, but this is the problem I have worried about for 50 years. Google “epistemic cut” if you want to worry about it further.