ConsClass.pdf

(190 KB) Pobierz
Consonant Class Matching (Turchin, Peiros, Gell-Mann) Page 1
Analyzing Genetic Connections between Languages
by Matching Consonant Classes
Peter Turchin, a Ilia Peiros, b Murray Gell-Mann b,1
a Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs,
Connecticut 06269, USA
b Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, New Mexico 87501, USA
1 To whom correspondence should be addressed. E-mail: mgm@santafe.edu
Corresponding author: Murray Gell-Mann, Santa Fe Institute, 1399 Hyde Park Road,
Santa Fe, New Mexico 87501, USA. Tel. 505-946-2745; fax: 505-982-0565. E-mail:
mgm@santafe.edu.
Author contributions: PT, IP, and MG-M designed research; IP provided linguistic data;
PT performed statistical analyses; and PT, IP, and MG-M wrote the paper.
On-line Supporting Information:
Abstract
The idea that the Turkic, Mongolian, Tungusic, Korean, and Japanese
languages are genetically related (the “Altaic hypothesis”) remains
controversial within the linguistic community. In an effort to resolve such
controversies, we propose a simple approach to analyzing genetic
connections between languages. The Consonant Class Matching (CCM)
method uses strict phonological identification and permits no changes in
meanings. This allows us to estimate the probability that the observed
similarities between a pair (or more) of languages occurred by chance alone.
The CCM procedure yields reliable statistical inferences about historical
connections between languages: it classifies languages correctly for well-
known families (Indo-European and Semitic) and does not appear to yield
false positives. The quantitative patterns of similarity that we document for
languages within the Altaic family are similar to those in the non-
controversial Indo-European family. Thus, if the Indo-European family is
accepted as real, the same conclusion should also apply to the Altaic family.
Consonant Class Matching (Turchin, Peiros, Gell-Mann) Page 2
Introduction
Tracing “genetic” relationships between languages is sometimes a source of controversy
in comparative linguistics. For example, within the linguistic community there is not
universal acceptance of the Altaic family, i.e., the idea that the Turkic, Mongolian,
Tungusic, Korean, and Japanese languages are genetically related (share a common
ancestor) (1). Even the recent publication of the Etymological Dictionary of the Altaic
Languages (2) did not put an end to this controversy (3, 4). The critics claim that the
observed similarities can be due either to chance resemblances or to “areal
convergence”—borrowing resulting from cultural contacts (discussion in Ref. (5)).
To demonstrate that languages belong to the same linguistic family it is best to
trace them back to their common ancestor ( = proto-language of this family), with known
sound system, grammar, and partial lexicon. In most cases such proto-languages have to
be reconstructed. According to the standard methods of comparative linguistics this can
be done only if potentially related languages preserve a sufficient number of proto-
language morphemes. Through analysis of such morphemes linguists establish a system
of correspondences between the sound systems of daughter languages. For example,
many German words beginning in /c/ (z in orthography) have the same meaning as
English words beginning in /t/— Zunge:tongue , Zahn:tooth , etc., while initial German /t/
corresponds to English /d/, as in trinken:drink or trocken:dry . A set of such observations
is used to reconstruct the phonological system of the proto-language and the forms of its
individual morphemes (phonological reconstruction). The meanings of the morphemes
are reconstructed using much less rigorous methods. One problem here is that there can
be a substantial semantic shift between two related words (cognates). An example is
English clean and German klein ‘small’—although these words are known to be
cognates (the original meaning was ‘neat, clean’) they now have rather different
meanings.
So far, proto-languages of only a limited number of language families have been
properly reconstructed, thus demonstrating that the languages forming these families are
related. As proto-languages of most proposed families are yet to be reconstructed,
linguists still lack convincing evidence on possible relationships between languages. To
compensate for the lack of information linguists use a variety of provisional methods
ranging from inspection-based judgments to more formalized lexicostatistics . The
assumption here is that if languages are related they should have lexical morphemes of
common origin having identical meanings from the Swadesh 100-item list (6). Since no
changes in meanings are accepted, semantic connections between the morphemes are
straightforward. Still, phonological identification of relatedness is not based in this case
on a system of correspondences * and therefore is not strict enough, with some similarities
being possibly due to chance.
Here we propose a procedure based on lexicostatistics that does use strict
phonological identification and permits no exceptions. This approach allows us to
estimate the probability that the observed similarities between a pair (or more) of
languages occurred by chance alone (7, 8). By design the proposed method is
“conservative”: we go to great lengths to minimize the possibility of false positives
* Another application of lexicostatistics requires good knowledge of comparative phonology and
etymologies and is used to generate linguistic families classifications, based on the amounts of
etymologically identical words revealed by each pair of languages studied.
Consonant Class Matching (Turchin, Peiros, Gell-Mann) Page 3
(concluding that languages are related when in fact they are not). Such an approach,
which places a heavy burden of proof on anyone favoring a genetic relationship, is far
from optimal, but we adopt it to avoid polemical controversies while applying our
method to cases such as that of the Altaic family. The method is not a substitute for the
more sophisticated approaches of comparative linguistics. Rather, it provides a procedure
for testing hypotheses of genetic relationships without relying on matters of choice or
judgment.
Methods
Linguistic data
The linguistic data (lexicostatistical lists of individual languages) are taken from a
collection of databases prepared by participants of the Evolution of Human Languages
Project (Santa Fe Institute, USA) and the Tower of Babel Project (Moscow, Russia). We
code each root (= main lexical morpheme) in the 100-word list for each language by
replacing its first two consonants with generic consonant classes, following a suggestion
put forward by Dolgopolsky (14). Table 1 gives the mapping of consonants to the nine
classes. We have performed this procedure for 53 Eurasian and North African languages
(see Supporting Information for the list of languages).
The measure of similarity between two languages is the proportion of roots of the
same meaning whose first two consonant classes match. For example, English nose and
German Nase (both coded NS) are classified as similar while dog and Hund (TK versus
#N) are classified as dissimilar. The German Zunge , coded CN, and English tongue ,
coded TN, are also classified as dissimilar, even though they are cognates. Our measure
of similarity misses systematic sound correspondences that cut across our consonant
classes. (In addition, it omits information contained in vowels and in any consonants
other than the first two.)
Statistical Analyses
The next step after determining the proportion of matches between two 100-word
lists is to estimate the statistical significance of this result. A naïve approach assumes that
the probability of a match between the first consonants or the second ones is one in nine
(the number of consonant classes) and the probability of both consonants matching is 9 -2
or one in eighty-one. With this method we would expect, on average, a bit more than one
match (100/81=1.2) in a list of 100 words. This approach is, however, flawed in several
ways. First, some consonant classes are more common than others, and therefore the
random chance of both consonants matching is, on average, greater than 1:81. Second,
presence of a certain consonant in one position may affect the probability of finding
another consonant in the other position. In other words, the assumption of independence
may not be warranted. Finally, we must deal with such irregularities as missing or
multiple words in some positions.
We use the bootstrap method (15) to estimate the statistical significance of the
observed proportion of matches between word lists of two languages (the Similarity
Index, SI). The procedure works as follows. We randomly select a root from List 1 and
match it with a random root from List 2 (there are two alternative methods of random
selection, see the next paragraph for the explanation). Repeating this step 100 times, we
calculate the “bootstrap SI” (the proportion of matches between two random 100-word
Consonant Class Matching (Turchin, Peiros, Gell-Mann) Page 4
lists). Next, we replicate this procedure many times (e.g., 10,000 iterations) and use the
10,000 bootstrap SIs to approximate the probability distribution of the SI under the null
hypothesis (that any matches are due to chance). Finally, we determine the proportion of
bootstrapped SIs that is equal to or greater than the index calculated for the original lists.
This gives us an estimate of the probability of observing this value (or a larger one) under
the null hypothesis. The smaller this estimated probability, the greater our degree of
belief that the proportion of observed matches could not arise by mere chance.
There are two ways to perform random selection: with or without replacement. In
the first case (the classic bootstrap) after a word is chosen from the list, and matched with
a word from the other language’s list, the word is put back. In other words, the same
word can be chosen several times (and, therefore, some other words are never chosen).
The alternative procedure (known as the permutation test) is to sample without
replacement, so that each word is selected once. We repeated our analyses using both the
bootstrap and the permutation test and obtained similar results. However, the permutation
test was slightly more permissive (it gave a greater proportion of false positives), and
therefore we report only the bootstrap results. We routinely used 10,000 bootstrap
iterations to construct the probability distribution of the SI, but in cases where all
bootstrapped SI were smaller then the observed one, we reran analysis with 1 million
iterations. Thus, P < 10 –6 means that the observed SI was greater than all of 1 million
bootstrapped SIs.
Our approach allows for missing words. Thus, the SI is the number of matches
divided by the number of possible matches (subtracting observations with missing
values). Missing values are handled during the bootstrap in exactly the same manner.
That is, a bootstrapped SI may also have a number less than 100 in the denominator, if
missing values happened to be chosen during the sampling process.
Results
Testing the Method on the Indo-European and Semitic Families
Before tackling the Altaic family, we test how well this Consonant Class Matching
(CCM) method works on the well-studied Indo-European family. We distinguish between
using modern languages for this purpose and using attested or reconstructed ancient
languages. Applying the procedure to 21 modern Indo-European (IE) languages
(additional tables are in Supporting Information ) we find that it reliably identifies such
branches as Indic, Slavic, Germanic, and Romance (SIs varying between 45 and 77%, all
statistically significant at P < 10 -6 ). By contrast, similarity between languages belonging
to different branches is much lower (between 1 and 21%). A particularly interesting
comparison is between Germanic and Indic languages (Table 2). The SIs are very low,
between 1 and 7%. Half of the comparisons are not significant at the 0.05 level, while all
but one of the rest are weakly significant at 0.05 < P < 0.01.
Both the Indic and the Germanic groups reveal themselves beyond any doubt,
while the genetic relation between these two groups is not convincingly demonstrated by
Table 2. We recall that the validity of the IE family was originally established not on the
basis of modern languages but rather by comparing ancient ones, which are much closer
to each other. The results of the CCM method (Table 3a) reflect the greater degree of
similarity (all comparisons are significant at least at P < 0.02 level, and most at much
Consonant Class Matching (Turchin, Peiros, Gell-Mann) Page 5
higher significance levels). The SI between Old High German and Old Indian, in
particular, is 14%. The probability of this overlap happening by chance is vanishingly
small (<10 –6 ). When we apply the CCM approach to several ancient Semitic languages
(Table 3b) we find that SIs for all comparisons are highly significant ( P < < 10 –6 ).
The improved resolution obtained with ancient languages is not surprising. The
longer the period since the two languages diverged, the more opportunity there has been
for roots in the 100-item list to “mutate” and become dissimilar (that is, cross into a
different phonetic class) or to be replaced (as a result of a semantic shift). As time passes,
the degree of similarity between any two genetically related languages should eventually
decline to the point where in direct comparison it is indistinguishable from random noise.
However, if we keep applying the procedure of reconstructing proto-languages we may
be able to defeat that phenomenon.
The Indo-European and Semitic families are unusual in that they enjoy such a rich
abundance of attested ancient languages. Does that mean that we cannot investigate
genetic relationships when ancient written sources are lacking? As suggested just above,
one possible approach to this problem is to use reconstructed proto-languages. When we
apply the CCM method to the proto-languages of four IE branches, we obtain the same
pattern as for attested ancient languages (Table 4a). For example, the SI between the
Proto-Iranian and the Proto-Germanic languages is 13%. By contrast, in pairwise
comparisons between five modern Germanic languages (German, English, Dutch,
Icelandic, and Swedish) and two modern Iranian languages (Kurdish and Ossetian) it
ranges between 5 and 10% (average = 7%).
Using reconstructed proto-languages can sometimes yield even better results than
using attested old languages, as is shown in the Iranian–Germanic comparison. The SIs
between Old High German and Avestan or Classical Persian are only 9–10%, whereas the
overlap between Proto-Germanic and Proto-Iranian is 13% (and the statistical
significance of the result increases by several orders of magnitude). This improvement is
at least partially due to the greater age of Proto-Germanic and Proto-Iranian compared
with Old High German and Classical Persian respectively.
It should be mentioned, however, that the main issue is not the age of the
languages, but the degree to which they resemble their proto-languages. Ancient
languages are usually more archaic in this sense, as they retain many features of their
proto-languages, both in phonology and lexicon. At the same time some modern
languages are also quite archaic, for example, Lithuanian. Therefore the role played by
this language in Indo-European studies is similar to that of Ancient Greek, Latin and
other ancient languages. In some cases a proto-language can be only a thousand years
old, but because of its archaic character its relations with other (proto-)languages can be
identified even by the CCM method.
Applying the methodology to the Altaic family
Next, we use the CCM approach to test the reality of the Altaic family. We have four
independent reconstructions (2, 9): Proto-Turkic, Proto-Mongolian, Proto-Tungus, and
Proto-Japanese (Korean dialects are too similar to one another to justify a reconstruction
of Proto-Korean). We also calculated the degree of similarity between these four
languages and Proto-Eskimo, because Mudrak (9, 10) proposed that Eskimo languages
are closely related to the Altaic family. The SIs for the four Altaic proto-languages (Table
Zgłoś jeśli naruszono regulamin