How to build a cross-linguistic database

One of the best things about being a linguist is that language data is all around us. We interact with it in so many ways every day: listening to podcasts, talking to housemates, and even shopping at grocery stores. It’s easy to start paying closer attention to the language we hear and see, but how can we learn from all this data to better understand how language works? In this post, I’m going to focus on one aspect of this: understanding how other languages work by building a cross-linguistic database.

To do this, I’ll talk about a specific example, a certain kind of agreement with nouns (which I call nominal concord, but I’ll just call it agreement with nouns in this post). The English words this/that change their form based on whether the noun they modify is singular or plural: e.g., these bananas vs. this banana. Other languages do similar things. Many people who have studied a Romance language like Spanish remember that articles and adjectives (among other things) have to match the gender of the noun they modify: e.g., la casa blanca ‘the white house’ vs. el edificio blanco ‘the white building’. How do we figure out the properties of this process so that we can use it to understand language better?

A map showing the presence of concord (i.e., agreement with nouns) in the world’s languages. Data from Norris (2019), map created in R with the lingtypology package.

Collecting language examples for the database

Since you’re already swimming in data, you could approach data collection in a very grassroots way. You could travel, taking pictures of language you see in the world, or you could take screenshots of other languages you encounter on blogs or social media. But this can take a long time when you’re looking for data that shows you a specific thing. To develop the database more rigorously, we can turn to more formal sources. We can use grammars, which are book-length reference guides for the properties of languages. Some of these are freely available as books (e.g., Language Science Press, Pacific Linguistics) or PhD dissertations (especially low-resourced languages). Some language data is already compiled and available in online databases (like World Atlas of Language Structures, or Universal Dependencies Corpus) or established corpora (like the Corpus of Contemporary American English, or for something completely different, Estonian corpora available at keelveeb.ee).

There is an important aside here: if we want to understand how language in general works, we have to make sure we’re not looking too closely at one particular language or language family. The languages that most people in North America and Western Europe are familiar with are Indo-European languages. Because these languages are so familiar, there is a reflexive tendency to view their properties as normal or common. To go back to the example of agreement with nouns, we might think that it’s normal for articles and adjectives to agree with noun, because that’s what they do in Spanish. It’s important to remember that we don’t actually know if that’s true! The only way to know for sure is to build a database that does not have overrepresentation of one language or language family.

Selecting a feature/tag set

Now that we’ve talked about sources you can use for your data, the next thing to determine is the set of features or tags that you’ll use to structure the data you collect. Naturally, you will want to focus on features that seem relevant for understanding whatever you’re looking at. If you were looking at agreement with nouns, you could catalog which words agree with nouns and what properties they agree with.

So, after collecting your data and storing it (e.g., in a Google Doc), you would also record what words agree with nouns in that language and what properties were relevant for the agreement. There are many features of a language that are likely irrelevant for what you’re looking at. You might need to do some exploratory analysis first to get the lay of the land before you decide on your feature set.

Managing the data and features

I mentioned that you might choose to store the examples you collect in a Google Doc. What about the features? There are several options for managing databases of this type. A non-exhaustive list of options:

  1. Spreadsheet (e.g., Excel/Sheets): easy to read; allows sorting “by hand”
  2. CSV: a bit harder to read, but can be easily opened by a spreadsheet program or fed into R or Python for more sophisticated computational or statistical analysis.
  3. JSON: easier to read with human eyes (in my opinion), easily read with all sorts of programming languages

Saving the data in a format that can be easily read by something like Python is a good investment. You can write scripts to read in all the data you’ve collected and then tabulate any numbers you find relevant. The approach I use is to save each language as an individual JSON file so that updating the database is simple: all I have to do is add the new JSON file(s) to the proper directory. Then I can run the scripts I’ve written to see how the database has changed.

Image showing a sample of a JSON file
A piece of a JSON file containing information about agreement with nouns in Finnish.

I know I went through this part pretty quickly—I’ll share more about it in a subsequent post!

Once we have enough languages in the database, we can start to extract insights about how different languages do or don’t behave similarly, and through that, we start to learn about language in general. When I built a database like the example I have been discussing in this post, I learned that demonstratives and adjectives are the most likely categories to agree with nouns. In fact, if demonstratives agree with nouns in a language, then it is likely that adjectives will, too. This is just one way in which English turns out to be weird: demonstratives (this/that) agree in English, but adjectives don’t—we don’t say heavies books!

Congratulating yourself

That’s that! As with lots of language work, the most time-intensive part is not really analyzing the results, it’s ensuring that the analysis is based on good data. Collecting good data can take a long time, especially when you’re pulling it from a lot of different languages. So, if you feel compelled to make a database like this one, you might as well start now! A database that would be very useful for both academic and applied contexts would be one that catalogs different word orders that can be used based on the discourse context (e.g., topicalization or focusing) —there are related databases on WALS but they’re less pointed. Any linguists reading this will know that that database could be quite difficult to construct!

A simpler task would be to contribute to an existing database. The Universal Dependencies Corpus is a great example. Right now, the most developed samples in the database are Indo-European languages (and a few other major world languages of Asia). As a result, the Universal Dependencies Corpus is unfortunately still biased!

Some language in the wild from Estonia— the translation of Thoreau’s “Why should we live with such hurry and waste of life?”

Conctypo: What, why, and how

In the past year or so since leaving my academic position and relocating to San Francisco, I have spent my “work time” learning and doing different things. Some of that has been developing my technical skills, and some of it has been continuing the research program I developed while in academia. In particular, nominal concord continues to be an obsession of mine, and I still have unanswered questions that I think nobody will find the answers to if not me. The most satisfying result is when I’m able to marry these two pursuits by using my increased technical skills to improve my research effectiveness. Today, I’m going to introduce the research project I’ve spent the most time with, my typological sample of nominal concord, aka Conctypo.

I’ve debated whether to kick off this series from the very beginning, but I decided instead to start where I am right now. I’ll dig into the past in some subsequent posts.

What is nominal concord and what is Conctypo?

If you’ve studied a European language before, you’ve probably encountered the phenomenon that I (and others) call nominal concord. For example, in the Spanish phrases la casa blanca ‘the white house’ and el edificio blanco ‘the white building’, the words for ‘the’ (la/el) and ‘white’ (blanca/blanco) change their form based on the noun that they modify. In this instance, it’s because the noun casa ‘house’ is feminine and the noun edificio ‘building’ is masculine. This is an example of nominal concord.

A schematic representation of nominal concord. There are orange-colored lines connecting the feminine noun casa 'house' to its modifiers. There are green-colored lines connecting the masculine noun edificio 'building' to its modifiers.
A graphical representation of nominal concord for gender in Spanish

More technically, nominal concord is the agreement process in language whereby modifiers of a noun (eg, adjectives, numerals, or demonstratives) must match the noun they modify in particular features (eg, gender, number, or case). Nominal concord is a well-known process in linguistics, perhaps due to the fact that it is widespread in Indo-European languages. But it exists outside of the range of Indo-European languages: it is found on all 6 inhabited continents (sorry, Antarctica, but you don’t count as inhabited).

Conctypo is a typological sample of nominal concord in the world’s languages. As of this writing, I (with the help of research assistants while I was at OU) have collected data on 244 languages. The first time I presented about the project was at the LSA meeting in 2019. The paper and entire data set (including only 174 languages for better genetic/geographic balance) is available in the SHAREOK archive here: A typological perspective on nominal concord. Since then, I have stopped managing the data with spreadsheets and now store the information in JSON files. All that I have to do to update the database is add the JSON files for the new languages and run the Python scripts I’ve written to pull relevant numbers. But more on that in a later post!

Why build this typological database?

While there are many broad tendencies in language structure—go tool around WALS if you never have—languages also have plenty of idiosyncratic properties. When devising models of language structure, a reasonable approach (to my mind) would be to use common properties as the foundations of the theory. In order to build a theory of nominal concord, we would need an understanding of what the common properties of nominal concord are. That’s where Conctypo comes in— the cross-linguistic sample can tell us what is common in concord systems. In turn, when looking at the concord system of a particular language, we can correctly identify idiosyncratic properties as idiosyncratic (instead of mistaking them for plausibly general properties of concord).

A map of the world with dots scattered throughout. The green dots show languages with nominal concord. The gray dots show languages without nominal concord.
A map (image) showing languages with concord (green dots) and languages without (gray dots). Made in R with the lingtypology package. See this tweet thread for more concord map images.

To put a finer point on this, let me discuss Indo-European briefly. Often, when a researcher brings up nominal concord, they use data from an Indo-European language to highlight its behavior. Concord systems in other languages are compared to Indo-European systems, with the implicit assumption that Indo-European systems are normal or common. Yet without a cross-linguistic understanding of concord systems, we can’t be sure this is true! Concord in Indo-European languages is robust and regular; commonly, gender and number are represented on nearly every word modifying a noun. But concord in the world’s languages could be more sporadic. It could involve, for example, number on some words and gender on other words.

A schematic representation of two kinds of concord systems.
Schematic representations of concord systems. On top, a system like Spanish, where both the article and the adjective must agree with the noun’s feminine and plural features. On the bottom, a hypothetical system where the article only agrees in gender and the adjective only agrees in number.

In this world, Indo-European concord would be perhaps overzealous. The only way to know—well, the only way to feel more assured—is to go to the data.

At the time I started gathering data, I knew of no other typological investigation of nominal concord. Thus, if I wanted to know the answers to these questions, I had to find them myself. After about a year and a half, I found the work of Ranko Matasović and İsa Kerem Bayırlı, who have collected their own concord or concord-related typological samples. We do not all document the same properties, though, so the more, the merrier!

How did we collect the data?

We look for linguistic examples on three different kinds of words:

  1. Demonstratives: words like this/these or that/those
  2. Cardinal numerals greater than ‘one’: number words like two, three, or seven. We specifically avoid ordinal numerals like second, third, or seventh as these often behave like adjectives. We also avoid one because it shows idiosyncratic behaviors in some languages—the goal was to try as much as possible to look at numerals as a distinct category.
  3. Adjectives: words like green, tall, old, etc. This can get tricky as some languages lack a clearly defined adjective class.

To find the examples needed, we look in published sources, including PhD dissertations. Ideally, this would be a grammar, i.e., a reference guide to the linguistic properties of the language. Failing a suitable grammar, we will use other types of writing (ideally published, but for some languages, the only available material may be unpublished). We look through the grammar to find suitable attested examples, where “suitable” means something like If the language had concord, we would be able to see it in this example. I take pictures, take screenshots, or copy the text of the example and save it in a Google Doc for archival purposes (and in case I ever want to check my work).

Linguistic examples showing adjective concord for number in the Pondi language.
In Pondi (Ulmapo; Papua New Guinea), adjectives show concord in number (Barlow, 2020:77)

Once I have finished documenting all three word classes, I can update the database. I wrote a program in Python that asks me the requisite questions and then creates a properly formatted JSON file and saves it in the proper place (more on this program later!).

And the work continues…

My work on this project continues in the form of data collection, computational streamlining, and pursuing theoretical implications. Until next time!