Glottobank

Glottobank is an international research consortium established to document and understand the world’s linguistic diversity. Glottobank team members are pursuing this goal on two fronts. First, we have established five global databases documenting variation in language structure (Grambank), lexicon (Lexibank), paradigm systems (Parabank), numerals (Numeralbank), and phonetic changes (Phonobank). In doing so, we seek to develop new methods in language documentation, compile data on the world’s languages and make this data accessible and useful. Second, we are developing methods to use this data to make inferences about human prehistory, relationships between languages and processes of language change. We anticipate data will begin to become available in 2022.

Grambank

Grambank is a database of structural (typological) features of language. It consists of 195 logically independent features (most of them binary) spanning all subdomains of morphosyntax. The Grambank feature questionnaire has been filled in, based on reference grammars, for more than 2,000 languages. The aim is to eventually reach as many as 3,000 languages. The database can be used to investigate language prehistory, the geographical-distribution of features, language universals and the functional interaction of structural features.

Lexibank

Lexibank is a public database and repository for lexical data from the languages of the world. Currently, Lexibank contains lexemes and cognate judgments from ~2500 languages spanning Africa, Europe, Asia, the Pacific, and the Americas. The database will be used to refine cognate judgments, infer language relationships, construct language phylogenies, test hypotheses about language history, investigate factors that affect the mode and tempo of language evolution, model sound change, and facilitate quantitative comparisons with other types of linguistic data. The initial focus of Lexibank will be on compiling basic or core vocabulary, but ultimately the database will be expanded to include a full range of lexicon from all the world’s languages.

Parabank

Parabank is a large database of selected paradigmatic structures found in the world’s languages, focusing on the patterning of formal similarities and identities (or syncretisms) between cells in these paradigms (cf I vs me but you vs you). It is motivated by the observation that different languages and language families have significantly different patterns in their syncretisms and that at least some of these are stable through time. In addition, information arranged in matrices gains additional power because of the large number of values that can be calculated by comparing every cell with every other cell.

Because the paradigms we explore are ubiquitous across the world’s languages, our working hypothesis is that paradigmatic syncretisms can provide significant signal to linguistic relationships in time, and the database is designed to allow the systematic exploration of morphosyntactic features by linguistic typologists and evolutionary biologists. Additionally, Parabank will be an important resource to assist in the identification and quantification of some of the important mechanisms in how the design space of language evolves. Initially, the database will assemble paradigms of free pronouns, verb agreement, and a subset of kin terms, with subsequent plans to incorporate demonstratives/interrogatives/indefinite pronouns/negative pronouns, numeral systems, and other promising linguistic subsystems with paradigmatic structure.

Parabank will be led by Nick Evans, Simon Greenhill and Kyla Quinn, all based at the Australian Research Council Centre of Excellence for the Dynamics of Language (CoEDL), at the Australian National University (ANU), but welcomes the participation of any interested researcher. Funding will primarily come from the CoEDL.

Numeralbank

Numeralbank is a public database and repository on numeral systems in the world’s languages. It is motivated by the idea that number words do not just form an important part of most languages, but constitute systems that serve as essential tools at the intersection of culture, language, and cognition. Numeralbank can be used to classify numeral systems according to their properties, to document the geographical distribution of system types, to investigate commonalities and differences in system properties across languages, to reconstruct the most likely ancestral states, and to explore possible limits to and constraints on the striking diversity in how people count. Initially, the database will allow for analyses within and across systems, but the ultimate goal is to support tests of hypotheses on linguistic, cognitive, and cultural factors that may drive the emergence and evolution of numeral systems.

Entries in Numeralbank are largely based on data collected by Eugene Chan as part of the long-running project "Numeral Systems of the World's Languages" that was hosted at the former Department of Linguistics at the MPI for Evolutionary Anthropology in Leipzig. The data is now hosted at the Department of Cultural and Linguistic Evolution at the MPI for Evolutionary Anthropology in Leipzig. The Numeralbank database is designed and maintained by Hans-Jörg Bibiko. The Numeralbank team consists of (in alphabetical order) Andrea Bender, Hans-Jörg Bibiko, Robert Forkel, Simon Greenhill, Russell Gray, Harald Hammarström, Fiona Jordan, and Annemarie Verkerk.

Phonobank

Phonobank aims to establish a cross-linguistic comparative database of sound patterns, sound correspondences, and sound shifts. Our starting point is collections of multiple phonetic alignments of cognate sets in language families. All sounds are linked to a cross-linguistic phonetic alphabet that provides distinctive features and segment descriptions. The ultimate goals of the database are to support the computational linguistic comparison of word forms and to serve as a basis for improving the methods of computer assisted cognate detection, sound reconstruction and building linguistic phylogenies from sound correspondences.

Methods and Tools

The Glottobank team is developing a suite of methods and tools for analysing comparative linguistic data. For example, using the BEAST 2 software platform, we have created a Bayesian framework for phylogeographic inference of language expansion in space and time. BEASTling is a program designed to help linguists easily prepare Bayesian phylogenetic analyses of linguistic data using the BEAST 2 platform. It automates many tedious data-preparation tasks, features close integration with the Glottolog language catalog, and strives to follow established best practices for computational linguistic phylogenetics. LingPy is a Python library for quantitative tasks in historical linguistics. It offers state-of-the-art algorithms for pairwise and multiple phonetic alignment analyses, automatic cognate detection, and various tools to explore and curate lexical data. Finally, CLDF and associated standards are aimed at providing an interface between databases and tools which will enable easier sharing of data and code.

Funding

In addition to the time and energy of members of the consortium, Glottobank is supported by the Max Planck Institute for the Science of Human History, a Royal Society of New Zealand Marsden Grant (grant #13-UOA-121) and the ARC Centre of Excellence for the Dynamics of Language.