A perspective on data usability from the concord typology project

Cross-linguistic studies involve a lot of data collection. If we want to make the most progress towards understanding something, it makes sense that we should make our work usable for people besides ourselves. We might be able to get some additional help!

When I started the concord typology project as a researcher, I knew I wanted to keep track of my data in such a way that other researchers could easily build on the work I had done. I also wanted to make sure that anybody who wanted to retrace my steps would be able to without having to build the study from scratch.  After writing the proceedings paper based on the initial results, I uploaded the data (with the paper) to OU/OSU/OCU’s SHAREOK archive. Here’s what’s in there:

  • Main article: the proceedings paper (geared towards academic audiences)
  • Research data (spreadsheet): all the coding and classification I did based on the data that I collected; easy to digest reasonably quickly
  • Research data (read me): an explanation of the contents of the archive
  • Research data (examples + examples appendix): the actual linguistics examples that the spreadsheet is based on; sort of like research notes and thus not as easy to digest

I spent some time cleaning up the data to prepare it for eyes other than my own, but it’s hard to perfect it on the first pass. I called it good enough and then got back to work collecting more data.

This is an attempt to use a slightly blurry and slightly purple clipping from the research data spreadsheet as an artistic way to break up the flow of text. Hashtag data is art 😆

One year later, someone finds and uses the data!

About a year after I published my data in the archive, I got an email from Kyle Mahowald, an assistant professor of linguistics (at UC Santa Barbara) who is interested in computational modeling of cross-linguistic studies (like mine). He stumbled upon my data and decided to start building a model based on the data that could account for issues of genetic and geographic proximity. This was, of course, very exciting: somebody was building on the work that I started! I spent a good deal of time getting the data ready for the archive, so seeing that somebody found it and used it made all that effort worthwhile.

This brings me to my next point: when you think about making your data usable, consider the user carefully, and make sure the data is as usable as possible. In the version in the SHAREOK archive, I made a choice that negatively affected usability. When Kyle initially wrote the model, it was making a few predictions that were strikingly different from my results. The issue arose from the coding schema I used for for the spreadsheets. In brief: there was an overlap in some of the labels I had used, and so the script was treating some distinct labels as though they were the same. The bug was an easy fix, but we only noticed it because Kyle and I started collaborating and discussing the model he developed. It got me thinking about how I could make my data not only available, but (even more) useable.

Hammer, Sledgehammer, Mallet, Tool, Striking, Hitting
If you give a very simple program a hammer, it’s gonna start looking for nails (or whatever this smashed thing is).

Iterating towards better usability

After my initial conversations with Kyle, I put some work in to improve the usability of my data. I wanted something that achieved a balance between these three things:

  1. Easy to read (for humans)
  2. Easy to process (for computers)
  3. Easy to update (for me)

For this, I settled on storing the coding for each language as a JSON file (as I mentioned in this post). I find JSON files relatively easy to read—especially if you save them with some formatting for readability—and they can be easily converted to other formats. I wrote a few scripts to convert the existing overlapping labels into a system without overlap. And to add new data, all I have to do is add a JSON file for the language I just documented (which I’ve already written a script for).

I now store and update the data on OSF, a free and open platform for sharing research. This means anybody (even you, dear reader) can download the current state of the study this very moment! If you don’t have experience working with JSON files yourself, don’t worry: I have a script on my github that processes the important data from all the JSON files and saves it as a single CSV. So, even if you haven’t used anything besides Excel/Sheets, you can still look at the data!

Keep your data user-friendly!

To sum up, the big lesson here is to keep your data user-friendly. If you want whoever uses your data—be they a research colleague, a coworker, or a client—to be able to build on the work you’ve put in, think about how they might use the data and try to make that as easy as possible.

Books, Pages, Story, Stories, Notes, Reminder, Remember

A very simple spelling corrector for Estonian

If you’ve spent any time looking at online NLP resources, you’ve probably run into spelling correctors. Writing a simple but reasonably accurate and powerful spelling corrector can be done with very few lines of code. I found this sample program by Peter Norvig (first written in 2006) that does it in about 30 lines. As an exercise, I decided to port it over to Estonian. If you want to do something similar, here’s what you’ll need to do.

First: You need some text!

Norvig’s program begins by processing a text file—specifically, it extracts tokens based on a very simple regular expression.

import re
from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

The program builds its dictionary of known “words” by parsing a text file—big.txt—and counting all the “words” it finds in the text file, where “word” for the program means any continuous string of one or more letters, digits, and the underscore _ (r'\w+'). The idea is that the program can provide spelling corrections if it is exposed to a large number of correct spellings of a variety of words. Norvig’s ran his original program on just over 1 million words, which resulted in a dictionary of about 30,000 unique words.

To build your own text file, the easiest route is to use existing corpora, if available. For Estonian, there are many freely available corpora. In fact, Sven Laur and colleagues built clear workflows for downloading and processing these corpora in Python (estnltk). I decided to use the Estonian Reference Corpus. I excluded the chatrooms part of the corpus (because it was full of spelling errors), but I still ended up with just north of 3.5 million unique words in a corpus of over 200 million total words.

Measuring string similarity through edit distance

Norvig takes care to explain how the program works both mechanically (i.e., the code) and theoretically (i.e., probability theory). I want to highlight one piece of that: edit distance. Edit distance is a means to measure similarity between two strings based on how many changes (e.g., deletions, additions, transpositions, …) must be made to string1 in order to yield string2.

A diagram showing single edits made to the string <paer>
Four different changes made to ‘paer’ to create known words.

The spelling corrector utilizes edit distance to find suitable corrections in the following way. Given a test string, …

  1. If the string matches a word the program knows, then the string is a correctly spelled word.
  2. If there are no exact matches, generate all strings that are one change away from the test string.
    • If any of them are words the program knows, choose the one with the greatest frequency in the overall corpus.
  3. If there are no exact matches or matches at an edit distance of 1, check all strings that are two changes away from the test string.
    • If any of them are words the program knows, choose the one with the greatest frequency in the overall corpus.
  4. If there are still no matches, return the test string—there is nothing similar in the corpus, so the program can’t figure it out.

The point in the program that generates all the strings that are one change away is given below. This is the next place where you’ll need to edit the code to adapt it for another language!

def edits1(word):
#     "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

Without getting into the technical details of the implementation, the code takes an input string and returns a set containing all strings that differ from the input in only one way: with a deletion, transposition, replacement, or insertion. So, if our input was ‘paer’, edits1 would return a set including (among other thing) par, paper, pare, and pier.

The code I’ve represented above will need to be edited to be used with many non-English languages. Can you see why? The program relies on a list of letters in order to create replaces and inserts. Of course, Estonian does not have the same alphabet as English! So for Estonian, you have to change the line that sets the value for letters to match the Estonian alphabet (adding ä, ö, õ, ü, Å¡, ž; subtracting c, q, w, x, y):

    letters    = 'aäbdefghijklmnoöõprsštuüvzž'

Once you make that change, it should be up and running! Before wrapping up this post, I want to discuss one key difference between English and Estonian that can lead to some different results.

A difference between English and Estonian: morphology!

In Norvig’s original implementation for English, a corpus of 1,115,504 words yielded 32,192 unique words. I chopped my corpus down to the same length, and I found a much larger number of unique words: 170,420! What’s going on here? Does Estonian just have a much richer vocabulary than English? I’d say that’s unlikely; rather, this has to do with what the program treats as a word. As far as the program is concerned, be, am, is, are, were, was, being, been are all different words, because they’re different sequences of characters. When the program counts unique words, it will count each form of be as a unique word. There is a long-standing joke in linguistics that we can’t define what a word is, but many speakers have the intuition is and am are not “different words”: they’re different forms of the same word.

The problem is compounded in Estonian, which has very rich morphology. The verb be in English has 8 different forms, which is high for English. Most verbs in English have just 4 or 5. In Estonian, most verbs have over 30 forms. In fact, it’s similar for nouns, which all have 12-14 “unique” forms (times two if they can be pluralized). Because this simple spelling corrector defines word as roughly “a unique string of letters with spaces on either side”, it will treat all forms of olema ‘be’ as different words.

Why might this matter? Well, this program uses probability to recommend the most likely correction for any misspelled words: choose the word (i) with the fewest changes that (ii) is most common in the corpus. Because of how the program defines “word”, the resulting probabilities are not about words on a higher level, they’re about strings, e.g., How frequent is the string ‘is’ in the corpus? As a result, it’s possible that a misspelling of a common word could get beaten by a less common word (if, for example, it’s a particularly rare form of the common word). This problem could be avoided by calculating probabilities on a version of the corpus that has been stemmed, but in truth, the real answer is probably to just build a more sophisticated spelling corrector!

Spelling correction: mostly an English problem anyway

Ultimately, designing spelling correction systems based on English might lead them to have an English bias, i.e., to not necessarily work as effectively on other languages. But that’s probably fine, because spelling is primarily an English problem anyway. When something is this easy to put together, you may want to do it just for fun, and you’ll get to practice some things—in this case, building a data set—along the way.

Conctypo: What, why, and how

In the past year or so since leaving my academic position and relocating to San Francisco, I have spent my “work time” learning and doing different things. Some of that has been developing my technical skills, and some of it has been continuing the research program I developed while in academia. In particular, nominal concord continues to be an obsession of mine, and I still have unanswered questions that I think nobody will find the answers to if not me. The most satisfying result is when I’m able to marry these two pursuits by using my increased technical skills to improve my research effectiveness. Today, I’m going to introduce the research project I’ve spent the most time with, my typological sample of nominal concord, aka Conctypo.

I’ve debated whether to kick off this series from the very beginning, but I decided instead to start where I am right now. I’ll dig into the past in some subsequent posts.

What is nominal concord and what is Conctypo?

If you’ve studied a European language before, you’ve probably encountered the phenomenon that I (and others) call nominal concord. For example, in the Spanish phrases la casa blanca ‘the white house’ and el edificio blanco ‘the white building’, the words for ‘the’ (la/el) and ‘white’ (blanca/blanco) change their form based on the noun that they modify. In this instance, it’s because the noun casa ‘house’ is feminine and the noun edificio ‘building’ is masculine. This is an example of nominal concord.

A schematic representation of nominal concord. There are orange-colored lines connecting the feminine noun casa 'house' to its modifiers. There are green-colored lines connecting the masculine noun edificio 'building' to its modifiers.
A graphical representation of nominal concord for gender in Spanish

More technically, nominal concord is the agreement process in language whereby modifiers of a noun (eg, adjectives, numerals, or demonstratives) must match the noun they modify in particular features (eg, gender, number, or case). Nominal concord is a well-known process in linguistics, perhaps due to the fact that it is widespread in Indo-European languages. But it exists outside of the range of Indo-European languages: it is found on all 6 inhabited continents (sorry, Antarctica, but you don’t count as inhabited).

Conctypo is a typological sample of nominal concord in the world’s languages. As of this writing, I (with the help of research assistants while I was at OU) have collected data on 244 languages. The first time I presented about the project was at the LSA meeting in 2019. The paper and entire data set (including only 174 languages for better genetic/geographic balance) is available in the SHAREOK archive here: A typological perspective on nominal concord. Since then, I have stopped managing the data with spreadsheets and now store the information in JSON files. All that I have to do to update the database is add the JSON files for the new languages and run the Python scripts I’ve written to pull relevant numbers. But more on that in a later post!

Why build this typological database?

While there are many broad tendencies in language structure—go tool around WALS if you never have—languages also have plenty of idiosyncratic properties. When devising models of language structure, a reasonable approach (to my mind) would be to use common properties as the foundations of the theory. In order to build a theory of nominal concord, we would need an understanding of what the common properties of nominal concord are. That’s where Conctypo comes in— the cross-linguistic sample can tell us what is common in concord systems. In turn, when looking at the concord system of a particular language, we can correctly identify idiosyncratic properties as idiosyncratic (instead of mistaking them for plausibly general properties of concord).

A map of the world with dots scattered throughout. The green dots show languages with nominal concord. The gray dots show languages without nominal concord.
A map (image) showing languages with concord (green dots) and languages without (gray dots). Made in R with the lingtypology package. See this tweet thread for more concord map images.

To put a finer point on this, let me discuss Indo-European briefly. Often, when a researcher brings up nominal concord, they use data from an Indo-European language to highlight its behavior. Concord systems in other languages are compared to Indo-European systems, with the implicit assumption that Indo-European systems are normal or common. Yet without a cross-linguistic understanding of concord systems, we can’t be sure this is true! Concord in Indo-European languages is robust and regular; commonly, gender and number are represented on nearly every word modifying a noun. But concord in the world’s languages could be more sporadic. It could involve, for example, number on some words and gender on other words.

A schematic representation of two kinds of concord systems.
Schematic representations of concord systems. On top, a system like Spanish, where both the article and the adjective must agree with the noun’s feminine and plural features. On the bottom, a hypothetical system where the article only agrees in gender and the adjective only agrees in number.

In this world, Indo-European concord would be perhaps overzealous. The only way to know—well, the only way to feel more assured—is to go to the data.

At the time I started gathering data, I knew of no other typological investigation of nominal concord. Thus, if I wanted to know the answers to these questions, I had to find them myself. After about a year and a half, I found the work of Ranko Matasović and İsa Kerem Bayırlı, who have collected their own concord or concord-related typological samples. We do not all document the same properties, though, so the more, the merrier!

How did we collect the data?

We look for linguistic examples on three different kinds of words:

  1. Demonstratives: words like this/these or that/those
  2. Cardinal numerals greater than ‘one’: number words like two, three, or seven. We specifically avoid ordinal numerals like second, third, or seventh as these often behave like adjectives. We also avoid one because it shows idiosyncratic behaviors in some languages—the goal was to try as much as possible to look at numerals as a distinct category.
  3. Adjectives: words like green, tall, old, etc. This can get tricky as some languages lack a clearly defined adjective class.

To find the examples needed, we look in published sources, including PhD dissertations. Ideally, this would be a grammar, i.e., a reference guide to the linguistic properties of the language. Failing a suitable grammar, we will use other types of writing (ideally published, but for some languages, the only available material may be unpublished). We look through the grammar to find suitable attested examples, where “suitable” means something like If the language had concord, we would be able to see it in this example. I take pictures, take screenshots, or copy the text of the example and save it in a Google Doc for archival purposes (and in case I ever want to check my work).

Linguistic examples showing adjective concord for number in the Pondi language.
In Pondi (Ulmapo; Papua New Guinea), adjectives show concord in number (Barlow, 2020:77)

Once I have finished documenting all three word classes, I can update the database. I wrote a program in Python that asks me the requisite questions and then creates a properly formatted JSON file and saves it in the proper place (more on this program later!).

And the work continues…

My work on this project continues in the form of data collection, computational streamlining, and pursuing theoretical implications. Until next time!