ConsClass.pdf

Consonant Class Matching (Turchin, Peiros, Gell-Mann) Page 1

Analyzing Genetic Connections between Languages

by Matching Consonant Classes

Peter Turchin, a Ilia Peiros, b Murray Gell-Mann b,1

a Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs,

Connecticut 06269, USA

b Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, New Mexico 87501, USA

1 To whom correspondence should be addressed. E-mail: mgm@santafe.edu

Corresponding author: Murray Gell-Mann, Santa Fe Institute, 1399 Hyde Park Road,

Santa Fe, New Mexico 87501, USA. Tel. 505-946-2745; fax: 505-982-0565. E-mail:

mgm@santafe.edu.

Author contributions: PT, IP, and MG-M designed research; IP provided linguistic data;

PT performed statistical analyses; and PT, IP, and MG-M wrote the paper.

On-line Supporting Information:

http://cliodynamics.info/data/ConsClass_SI.doc

http://cliodynamics.info/data/SuppInfo.xls

Abstract

The idea that the Turkic, Mongolian, Tungusic, Korean, and Japanese

languages are genetically related (the “Altaic hypothesis”) remains

controversial within the linguistic community. In an effort to resolve such

controversies, we propose a simple approach to analyzing genetic

connections between languages. The Consonant Class Matching (CCM)

method uses strict phonological identification and permits no changes in

meanings. This allows us to estimate the probability that the observed

similarities between a pair (or more) of languages occurred by chance alone.

The CCM procedure yields reliable statistical inferences about historical

connections between languages: it classifies languages correctly for well-

known families (Indo-European and Semitic) and does not appear to yield

false positives. The quantitative patterns of similarity that we document for

languages within the Altaic family are similar to those in the non-

controversial Indo-European family. Thus, if the Indo-European family is

accepted as real, the same conclusion should also apply to the Altaic family.

Consonant Class Matching (Turchin, Peiros, Gell-Mann) Page 2

Introduction

Tracing “genetic” relationships between languages is sometimes a source of controversy

in comparative linguistics. For example, within the linguistic community there is not

universal acceptance of the Altaic family, i.e., the idea that the Turkic, Mongolian,

Tungusic, Korean, and Japanese languages are genetically related (share a common

ancestor) (1). Even the recent publication of the Etymological Dictionary of the Altaic

Languages (2) did not put an end to this controversy (3, 4). The critics claim that the

observed similarities can be due either to chance resemblances or to “areal

convergence”—borrowing resulting from cultural contacts (discussion in Ref. (5)).

To demonstrate that languages belong to the same linguistic family it is best to

trace them back to their common ancestor ( = proto-language of this family), with known

sound system, grammar, and partial lexicon. In most cases such proto-languages have to

be reconstructed. According to the standard methods of comparative linguistics this can

be done only if potentially related languages preserve a sufficient number of proto-

language morphemes. Through analysis of such morphemes linguists establish a system

of correspondences between the sound systems of daughter languages. For example,

many German words beginning in /c/ (z in orthography) have the same meaning as

English words beginning in /t/— Zunge:tongue , Zahn:tooth , etc., while initial German /t/

corresponds to English /d/, as in trinken:drink or trocken:dry . A set of such observations

is used to reconstruct the phonological system of the proto-language and the forms of its

individual morphemes (phonological reconstruction). The meanings of the morphemes

are reconstructed using much less rigorous methods. One problem here is that there can

be a substantial semantic shift between two related words (cognates). An example is

English clean and German klein ‘small’—although these words are known to be

cognates (the original meaning was ‘neat, clean’) they now have rather different

meanings.

So far, proto-languages of only a limited number of language families have been

properly reconstructed, thus demonstrating that the languages forming these families are

related. As proto-languages of most proposed families are yet to be reconstructed,

linguists still lack convincing evidence on possible relationships between languages. To

compensate for the lack of information linguists use a variety of provisional methods

ranging from inspection-based judgments to more formalized lexicostatistics . The

assumption here is that if languages are related they should have lexical morphemes of

common origin having identical meanings from the Swadesh 100-item list (6). Since no

changes in meanings are accepted, semantic connections between the morphemes are

straightforward. Still, phonological identification of relatedness is not based in this case

on a system of correspondences * and therefore is not strict enough, with some similarities

being possibly due to chance.

Here we propose a procedure based on lexicostatistics that does use strict

phonological identification and permits no exceptions. This approach allows us to

estimate the probability that the observed similarities between a pair (or more) of

languages occurred by chance alone (7, 8). By design the proposed method is

“conservative”: we go to great lengths to minimize the possibility of false positives

* Another application of lexicostatistics requires good knowledge of comparative phonology and

etymologies and is used to generate linguistic families classifications, based on the amounts of

etymologically identical words revealed by each pair of languages studied.

Consonant Class Matching (Turchin, Peiros, Gell-Mann) Page 3

(concluding that languages are related when in fact they are not). Such an approach,

which places a heavy burden of proof on anyone favoring a genetic relationship, is far

from optimal, but we adopt it to avoid polemical controversies while applying our

method to cases such as that of the Altaic family. The method is not a substitute for the

more sophisticated approaches of comparative linguistics. Rather, it provides a procedure

for testing hypotheses of genetic relationships without relying on matters of choice or

judgment.

Methods

Linguistic data

The linguistic data (lexicostatistical lists of individual languages) are taken from a

collection of databases prepared by participants of the Evolution of Human Languages

Project (Santa Fe Institute, USA) and the Tower of Babel Project (Moscow, Russia). We

code each root (= main lexical morpheme) in the 100-word list for each language by

replacing its first two consonants with generic consonant classes, following a suggestion

put forward by Dolgopolsky (14). Table 1 gives the mapping of consonants to the nine

classes. We have performed this procedure for 53 Eurasian and North African languages

(see Supporting Information for the list of languages).

The measure of similarity between two languages is the proportion of roots of the

same meaning whose first two consonant classes match. For example, English nose and

German Nase (both coded NS) are classified as similar while dog and Hund (TK versus

#N) are classified as dissimilar. The German Zunge , coded CN, and English tongue ,

coded TN, are also classified as dissimilar, even though they are cognates. Our measure

of similarity misses systematic sound correspondences that cut across our consonant

classes. (In addition, it omits information contained in vowels and in any consonants

other than the first two.)

Statistical Analyses

The next step after determining the proportion of matches between two 100-word

lists is to estimate the statistical significance of this result. A naïve approach assumes that

the probability of a match between the first consonants or the second ones is one in nine

(the number of consonant classes) and the probability of both consonants matching is 9 -2

or one in eighty-one. With this method we would expect, on average, a bit more than one

match (100/81=1.2) in a list of 100 words. This approach is, however, flawed in several

ways. First, some consonant classes are more common than others, and therefore the

random chance of both consonants matching is, on average, greater than 1:81. Second,

presence of a certain consonant in one position may affect the probability of finding

another consonant in the other position. In other words, the assumption of independence

may not be warranted. Finally, we must deal with such irregularities as missing or

multiple words in some positions.

We use the bootstrap method (15) to estimate the statistical significance of the

observed proportion of matches between word lists of two languages (the Similarity

Index, SI). The procedure works as follows. We randomly select a root from List 1 and

match it with a random root from List 2 (there are two alternative methods of random

selection, see the next paragraph for the explanation). Repeating this step 100 times, we

calculate the “bootstrap SI” (the proportion of matches between two random 100-word

Consonant Class Matching (Turchin, Peiros, Gell-Mann) Page 4

lists). Next, we replicate this procedure many times (e.g., 10,000 iterations) and use the

10,000 bootstrap SIs to approximate the probability distribution of the SI under the null

hypothesis (that any matches are due to chance). Finally, we determine the proportion of

bootstrapped SIs that is equal to or greater than the index calculated for the original lists.

This gives us an estimate of the probability of observing this value (or a larger one) under

the null hypothesis. The smaller this estimated probability, the greater our degree of

belief that the proportion of observed matches could not arise by mere chance.

There are two ways to perform random selection: with or without replacement. In

the first case (the classic bootstrap) after a word is chosen from the list, and matched with

a word from the other language’s list, the word is put back. In other words, the same

word can be chosen several times (and, therefore, some other words are never chosen).

The alternative procedure (known as the permutation test) is to sample without

replacement, so that each word is selected once. We repeated our analyses using both the

bootstrap and the permutation test and obtained similar results. However, the permutation

test was slightly more permissive (it gave a greater proportion of false positives), and

therefore we report only the bootstrap results. We routinely used 10,000 bootstrap

iterations to construct the probability distribution of the SI, but in cases where all

bootstrapped SI were smaller then the observed one, we reran analysis with 1 million

iterations. Thus, P < 10 –6 means that the observed SI was greater than all of 1 million

bootstrapped SIs.

Our approach allows for missing words. Thus, the SI is the number of matches

divided by the number of possible matches (subtracting observations with missing

values). Missing values are handled during the bootstrap in exactly the same manner.

That is, a bootstrapped SI may also have a number less than 100 in the denominator, if

missing values happened to be chosen during the sampling process.

Results

Testing the Method on the Indo-European and Semitic Families

Before tackling the Altaic family, we test how well this Consonant Class Matching

(CCM) method works on the well-studied Indo-European family. We distinguish between

using modern languages for this purpose and using attested or reconstructed ancient

languages. Applying the procedure to 21 modern Indo-European (IE) languages

(additional tables are in Supporting Information ) we find that it reliably identifies such

branches as Indic, Slavic, Germanic, and Romance (SIs varying between 45 and 77%, all

statistically significant at P < 10 -6 ). By contrast, similarity between languages belonging

to different branches is much lower (between 1 and 21%). A particularly interesting

comparison is between Germanic and Indic languages (Table 2). The SIs are very low,

between 1 and 7%. Half of the comparisons are not significant at the 0.05 level, while all

but one of the rest are weakly significant at 0.05 < P < 0.01.

Both the Indic and the Germanic groups reveal themselves beyond any doubt,

while the genetic relation between these two groups is not convincingly demonstrated by

Table 2. We recall that the validity of the IE family was originally established not on the

basis of modern languages but rather by comparing ancient ones, which are much closer

to each other. The results of the CCM method (Table 3a) reflect the greater degree of

similarity (all comparisons are significant at least at P < 0.02 level, and most at much

Consonant Class Matching (Turchin, Peiros, Gell-Mann) Page 5

higher significance levels). The SI between Old High German and Old Indian, in

particular, is 14%. The probability of this overlap happening by chance is vanishingly

small (<10 –6 ). When we apply the CCM approach to several ancient Semitic languages

(Table 3b) we find that SIs for all comparisons are highly significant ( P < < 10 –6 ).

The improved resolution obtained with ancient languages is not surprising. The

longer the period since the two languages diverged, the more opportunity there has been

for roots in the 100-item list to “mutate” and become dissimilar (that is, cross into a

different phonetic class) or to be replaced (as a result of a semantic shift). As time passes,

the degree of similarity between any two genetically related languages should eventually

decline to the point where in direct comparison it is indistinguishable from random noise.

However, if we keep applying the procedure of reconstructing proto-languages we may

be able to defeat that phenomenon.

The Indo-European and Semitic families are unusual in that they enjoy such a rich

abundance of attested ancient languages. Does that mean that we cannot investigate

genetic relationships when ancient written sources are lacking? As suggested just above,

one possible approach to this problem is to use reconstructed proto-languages. When we

apply the CCM method to the proto-languages of four IE branches, we obtain the same

pattern as for attested ancient languages (Table 4a). For example, the SI between the

Proto-Iranian and the Proto-Germanic languages is 13%. By contrast, in pairwise

comparisons between five modern Germanic languages (German, English, Dutch,

Icelandic, and Swedish) and two modern Iranian languages (Kurdish and Ossetian) it

ranges between 5 and 10% (average = 7%).

Using reconstructed proto-languages can sometimes yield even better results than

using attested old languages, as is shown in the Iranian–Germanic comparison. The SIs

between Old High German and Avestan or Classical Persian are only 9–10%, whereas the

overlap between Proto-Germanic and Proto-Iranian is 13% (and the statistical

significance of the result increases by several orders of magnitude). This improvement is

at least partially due to the greater age of Proto-Germanic and Proto-Iranian compared

with Old High German and Classical Persian respectively.

It should be mentioned, however, that the main issue is not the age of the

languages, but the degree to which they resemble their proto-languages. Ancient

languages are usually more archaic in this sense, as they retain many features of their

proto-languages, both in phonology and lexicon. At the same time some modern

languages are also quite archaic, for example, Lithuanian. Therefore the role played by

this language in Indo-European studies is similar to that of Ancient Greek, Latin and

other ancient languages. In some cases a proto-language can be only a thousand years

old, but because of its archaic character its relations with other (proto-)languages can be

identified even by the CCM method.

Applying the methodology to the Altaic family

Next, we use the CCM approach to test the reality of the Altaic family. We have four

independent reconstructions (2, 9): Proto-Turkic, Proto-Mongolian, Proto-Tungus, and

Proto-Japanese (Korean dialects are too similar to one another to justify a reconstruction

of Proto-Korean). We also calculated the degree of similarity between these four

languages and Proto-Eskimo, because Mudrak (9, 10) proposed that Eskimo languages

are closely related to the Altaic family. The SIs for the four Altaic proto-languages (Table

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: