Do Baboons really care about letter-pairs?

Monkey-reading, predictive patterns and machine learning.

Yoav Goldberg, April 2012.

Monkeys can Identify English Words

A recent paper in Science describes a cool experiment in which 6 baboons are taught by Jonathan Grainger to distinguish 4-letter English words (“talk”) from 4-letters gibberish words (“snux”).

This result received a considerable amount of press, and you can read some of the popular media coverage here and here .

While it is not quite surprising that animals can identify individual letters (even pigeons can do it), the new baboons study demonstrated that the monkeys could learn to do much more than that.  The monkeys learned to operate at the level of letter combinations, and discriminate between valid and invalid letter combinations based on some unknown underlying patterns that govern the formation of English words and tell them apart from non-English, gibberish words.  The baboons were not just memorizing a set of English words -- they learned to generalize and correctly categorize 4-letter words that were not shown to them before.  The monkeys could identify correctly about 75% of the words shown to them. This is impressive.  

What do the monkeys learn?

But what are these patterns that the baboons learned to follow? The authors of the paper suggest that (emphasis is mine):

baboons were not simply memorizing the word stimuli but had

learned to discriminate words from nonwords on the basis of differences

in the frequency of letter combinations in the two categories of stimuli

(i.e., statistical learning).


Our results indicate that baboons were coding the word and nonword stimuli

as a set of letter identities arranged in a particular order. Baboons

had learned to discriminate different letters from each other (letter identity) and to associate those letter identities with positional information. Their

coding of the statistical dependencies between position-coded letters is reflected in (i) their ability to discriminate novel words from nonwords (i.e., generalization), (ii) the significant correlation between bigram frequency and the accuracy of responses

to words, and (iii) the increase in errors in response to nonword stimuli

that were orthographically more similar to known words.

That is, the claim is that baboons learn patterns based on the frequency of a letter in a certain position within a word, and maybe about the frequency of different pairs of adjacent letters (bigrams).

Is that really the case? maybe.  But in my view it is very hard to tell: while frequency probably had something to do with it, I am not sure the positional information is used. Why do I think so? because you can learn to predict with the same performance of the monkeys, without looking at position information.

Online-Learning and Linear Models

(Readers familiar with some machine learning, can go ahead and skip to the next section)

Machine learning is a branch of computer-science that deals with learning patterns from data.

The learning protocol in the baboons experience is similar to what machine-learning researchers call “online-learning”.  It is informative to compare the performance of the baboons to that of a simple online-learning algorithm.   Online-learning algorithms are learning methods that follow the following protocol: the learner (algorithm or baboon) is shown examples (4-letter words) one at a time.  After seeing an example, the learner makes a prediction (English or non-English).  After making the prediction, the real answer is revealed (monkeys get a reward for a correct prediction and a flashing screen for an incorrect one) and the learner can use this information in order to do better on future examples.  Figuratively, the learner updates it beliefs after each example, but what are its beliefs? This depends on the representation.

Let’s assume that each word is represented as a list of properties (for example, a property might be “the first letter is A”, “the word contains the letter combination OG” or “the word contains the letter C”).  Then, a belief could be represented as a set weights (numbers) associated with these properties.  A positive weight for a property indicates that the property provides evidence for being an English word, while a negative weight indicates evidence against being an English word.

By summing the weights associated with individual properties of a given word, we could tell if the positive evidence is heavier than the negative evidence (sum is larger than 0, and we predict English word) or not (sum is equal or smaller than 0, and we predict non-English).  This simplistic model is called a “linear prediction model”, and the task of learning under this model amounts to assigning a weight for each of the properties such that predictions are correct.

A simple online learning algorithm for a linear prediction model is known as the perceptron method, and is due to Frank Rosenblatt (1957).  The algorithm work as follows:

- Start with all weights assigned to 0.

- After seeing an example (a set of properties), calculate the prediction based on the properties and the weights.

- If the prediction is correct, move on to the next example.

- Otherwise, update the current set of weights:

- if the truth is “Non English”, subtract 1 from the weights of the word properties.

- if the truth is “English”, add 1 to the weights of the word properties.

This algorithm essentially learns to associate negative weights with frequent non-word properties, and positive features with frequent real-word properties.

Of course, the set of properties used to represent words is crucial to what the algorithm can learn.  For example, the property “the word has 4 letters” is absolutely non-informative in the experiment setting in the given experiment, as all words have 4 letters, while the property “the word starts with ZK” would be quite informative, as English words are very unlikely to start with this letter sequence.

Monkey vs. Machine

If we assume that the linear-model and the perceptron learning method are simple enough to be a lower bound on what monkeys can do (meaning that the monkeys’ representation and learning process are more elaborate than those used by the perceptron algorithm -- not an unreasonable assumption in my view considering what we know about brain structure), we can see what kind of information is needed to perform the task at a certain level by looking at what the perceptron algorithm is capable of under different word representations.  Maybe it’s sufficient to look at the first letter of each word and ignore the rest? Or maybe we need something much more elaborate?

Let’s see how the perceptron algorithm perform on the learning task the monkeys faced.

I downloaded the data used in the monkeys experiment[1], and wrote a simple python program implementing the perceptron algorithm. I used the subset of the data seen by VIOLETTE, the “dumbest” monkey in the group.

I begin with the word representation suggested in the paper, encoding the letter identity and its position.  In this representation the word “TALK” is represented as the four properties “T_1, A_2, L_3, K_4” while the word “ARCK” is represented as “A_1, R_2, C_3, K_4”.

After seeing 100 words, the algorithm has a cummulative accuracy of 67%. After 1000 words, the accuracy is 79%, by the 2000 word the accuracy is 83% by the 40,000th example the accuracy is 90%.  The accuracy at 1000 words is already much higher than the monkeys average of 75% accuracy after seeing all the words.

Lets try a simpler representation.  Maybe we could get good prediction by looking only at the first letter of each word? Under this representation, the word “TALK” is represented as a single property “T” while the word “ARCK” is represented as “A”.  This is clearly a dumb representation.  Still, the algorithm can get an accuracy of 60% using this representation, and of almost 62% by looking only at the third letter.  The baboons probably looked at more than one letter.

What if we look at the pair of the first two letters? Under this representation, the word “TALK” will be represented as a single property “TA” and the word “ARCK” as “AR”. The algorithm learns to predict with 80% accuracy. Better than the apes.

How about the middle letter-pair? Under this representation, the word “TALK” will be represented as “AL” and the word “ARCK” as “RC”.  Now the algorithm learns to predict with 85% accuracy -- much better than the apes.

What if we look at all the consecutive letter pairs? Here “TALK” is represented as the 3 properties “TA, AL, LK”. The algorithms learns to predict with a striking 97% of accuracy.

Another representation would be the set of letters in a word, without the position information.  Under this representation, both “TALK” and “KTAL” would be represented as the four properties “T, A, K, L”.  We look at the letters that compose each word, but not at the relative order between them.  Accuracy is 76%, very similar to what the monkeys did.

We could go on and on trying different representations and seeing the performance on them.  You could download the code and experiment for yourself, by changing the “features = …” line in the file.  But these results are already enough to demonstrate my point: the frequencies of individual letters, without position information, are sufficient in order to perform as good as the monkeys under a very simple learning model.  By looking at position or bigram information, the simple model is far smarter than the monkeys.

So what does this tell us?

I have shown that a simple algorithm can learn to do just as good as the baboon by looking at only at the different letters in a word, without regarding their order.  Can we conclusively say that baboons don’t consider letter position when “reading”? Not really.  Perhaps the baboons are not as smart as our algorithm.  Maybe they do look letter-pairs and letter positions.  But what we can conclusively say is that in the specific data presented to the monkeys, the letter frequencies are sufficiently informative to perform on their level, using a simple learning method.  This means that while the baboons certainly learned to “read”, it is not clear what kinds of information they used to do so.  Every dataset contain patterns to be discovered.  Some patterns are informative, others are less so.  Some make sense, some less so. These patterns can be picked up by either monkeys or machines, and it is very difficult, by just looking at the performance, to tell which kinds of patterns are being learned.  

The main claim of the paper is interesting and valid: one does not need to know a language in order to distinguish real words in that language from some other letter sequences. It can be done by statistical algorithms, and it can be done by monkeys.  But the secondary claim, that this is achieved by the monkeys looking at letter-positions and letter-combinations, is far less convincing.

[1] Actually, the learning algorithm saw a different example sequence than the monkeys.  The spreadsheet contain the list of words shown to each monkey, the word category (English or non-English), and the number of times each word was shown, but it does not contain the order in which the words were presented.  I created the sequence by creating a list in which each word occurs the number of times it was presented to the monkeys, and then randomly shuffling the list.  The results are very consistent among the different shufflings.