What I (a linguist) did while searching for a job (in tech)

This is a post about (or the first post in a series about) the search for my first job in the language + tech industry space. I perhaps could have called this something like What I Did to Get a Job, but one of my opinions about getting your first industry job is there is no silver bullet for the task, so I can’t say that any of these things were necessary in getting my first job. However, I did some things while searching for a job and then did manage to get a job (first a contract job and then a full-time job).

To say that again another way:  there is no one thing that you must do to be able to get a job nor is there one thing that you will enable you to easily find a job if you do that one thing. Unfortunately! So what you should do is just keep trying to develop.


What I did to learn/practice coding (mostly Python):

  • Free intro course (keyword FREE); I did Codecademy‘s Python 2 course, because their Python 3 course costs money.
    • There are differences between Python v2 and v3 but they are minor and you will pick them up.
    • This will help you learn the basic data structures.
  • Automate the boring stuff is a good book, and all the chapters are available online (just scroll down the linked page):
    • Learn to manipulate files (reading/writing CSVs, text files, and JSONs): Chapters 9, 13, 14, 16 (at least!)
  • Free mini-courses on Kaggle
    • They have great bite-sized courses on a variety of topics, e.g., Intro to Python, Pandas, intro to ML, advanced ML, data cleaning
    • When you’re done, you can even add a certificate to your LinkedIn profile, which you should do!
  • NLP-focused content
    • Introduction to SpaCy (great library for NLP)
    • NLTK book
    • My two cents: don’t worry about the syntactic details; focus on internalizing the steps of an NLP pipeline in a broad sense
      • I got overly concerned with knowing this well—I never really got there, and I also haven’t had to know it well for either of my jobs or for my interviews.
      • The file management stuff is more important!

The main thing I have done at my jobs with Python is file and data management. For file management, learn to do things like (i) list all the files in a directory, (ii) perform the same operation on every file in a directory (or just every .csv or .txt file…), etc. For data management, you want to be able to manipulate different file types; I would say focus on being able to import and export csvs and jsons. In my opinion, you should prioritize learning (some of) the pandas library, because it’s powerful and also used by data scientists. I have some aspects of pandas memorized by now, but I still look things up all the time. That being said, I once had a coding interview where the code was written with the csv package, and I couldn’t remember the syntax. I just tried to do my best in the interview, but I did feel afterwards like I should make sure to remember how the csv package works. For language work, the csv package is also probably quite a workable solution— some of my colleagues at both jobs have used csv instead of pandas.

Learning ML/AI/NLP

The key thing here is that you need to focus on being conversant in these concepts rather than necessarily being able to write them. Maybe you can get to that, too, but step one is understanding what the pieces are. The reason to watch and read these things is not so you can necessarily do this work (speaking for myself, I do not build ML models at work), but so you know enough about them to know kind of how the enterprise works. To be clear, there are some “linguist” jobs where you build models, so if you’re interested in it, it’s definitely worthwhile. However, you can secure a job without building a model. Again, these are the things that I read or watched while I was searching, not necessarily “must watch” or “must read” sources!

We did not get to see the models in my job at Amazon. I still think it was helpful for me to know a little bit how these programs work so I could talk/think about how the data would be used. 

“Do a project”

A lot of people told me some version of “do a project.” If your feeling when you hear that is “Sounds good, but what?” I don’t blame you! As someone offering advice, it is easy to think that project ideas abound, because once you start working, so many more ideas come to you. However, when I was searching, I just didn’t know what projects I could do. Most identifiable NLP projects (i.e., papers) are multi-authored, so I felt that I couldn’t make headway there (let alone the fact that I was still learning!). I definitely didn’t think I was doing any projects that were big enough to count as projects (whatever that means), and at times, I felt like I couldn’t even if I wanted to.

Surprise! I actually did some projects

Now that I’ve been working about 8 months, I can see that I did “do a project” a few times. Here is a reasonably complete list.

  • Poketext: scraped Pokédex entries from Bulbapedia pages and saved it in one giant text file
    • When I started this, I had in mind that I was building a corpus. I don’t know what the corpus would be used for— probably a good idea to have an answer to that.
    • This gave me some experience web scraping with the Python module called BeautifulSoup
  • Poketext (part deux): try to fix issues in some of the sentences that I saw using SpaCy
    • Some of the sentences in the Pokédex would refer to the Pokémon by name, but some would just use a pronoun or an NP like “this Pokémon”.
    • I created an algorithm to replace the pronouns/NPs with the name of the Pokémon, and I encountered some interesting issues in the process.
  • Concord: I converted my typological database from a somewhat unwieldy spreadsheet to a format that was more computationally friendly and easily updated.
    • Each individual language in the data set is one JSON file
    • I wrote scripts to import all the JSON files from the directory and output descriptive stats about the data set
    • I wrote a program that creates a data file for any language that is new to the data set (after I’ve documented it myself). It asks me a few questions about the language and then creates the JSON file.
    • Because I did this, Kyle Mahowald found my data and we started a research collaboration.
    • FWIW: this is the project that ended up being the most useful (imo) for getting my first job.
  • Estonian spell checker: Messed with some Estonian data to adapt Peter Norvig’s very simple English spell checker for Estonian.
  • My blog!: Though I do sometimes write about esoteric issues connected to my linguistic research, I also tried to write some posts addressing language + data in a friendly, accessible way.

My recommendation is that you put these projects on a github and/or on a website, no matter how small you think they are. If recruiters or hiring managers get curious enough about you to look for your web presence, you want to give them something to chew on. Projects that are representative of your current skills make the most sense—if they’re not flashy enough for a particular job, then trust me: you don’t want that job (right now)!

Additional project ideas

Feel free to riff on any of the project ideas stated above. If none of those sound fun to you, here are some ideas to inspire any do-a-projects you might pursue.

  • Build a test set for an imaginary classifier: Test sets are smaller than training sets, so it won’t take you as long to annotate them.
    • Columns (one idea): label/annotation, URL, first paragraph, first paragraph tokenized or split on spaces (eg, using SPLIT() in Sheets)
    • Classifier ideas:
      • X or not X: news articles that are or are not about a certain thing (eg, about accidents or not, about the stock market or not, or even subjective categories)
      • native vs non-native (or: native vs. translated): sentences/paragraphs that are or are not from native speakers.
  • Build a corpus with web data (possibly scraping with BeautifulSoup if you can’t find the data more easily): collect examples with the idea that a person could design an annotation using the data you’ve collected.
    • There are lots of “corpus of movie review” tutorials online— these won’t “make you stand out”, but you will learn a lot from doing them.
    • I also saw something that involved pulling all the dialogue from a TV show and using that as data
    • Could use this corpus to feed into a project like the one above.
  • Playing with old research data: If you have any research data that can be put into spreadsheet format, try to do something with it in python.
    • Maybe that’s data visualization,
    • maybe that’s creating randomized samples of the data,
    • maybe that’s writing a script that would let you add to the data (e.g., what fields does the script need to ask for),
    • maybe you want to create a database of linguistic examples that you have gathered in your fieldwork/research and tag it for relevant information (eg, these are my relative clause examples, these are my wh-question examples, …)
    • If you don’t have any spreadsheet data from your research, you could download some samples from wals.info and play with those. Eg, can you figure out how to download 3 samples from WALS and combine them all into one file (so that if a language appears multiple times, you consolidate all of its values into one row?). This stuff isn’t super hard to learn!
    • just do something to give yourself something to work on and slowly figure out

The important thing to remember: there’s no secret beyond trying

Unfortunately, if you don’t already have computational or statistical skills, it can be hard to show a recruiter that you can contribute. There is no secret to this task— you just have to maintain your ability to try new things and keep waiting for a little bit of luck. I don’t mean this in a hokey All You Need to Do Is Try kind of way. I just mean it’s probably not helpful to worry about whether you have done The One Thing You’re Supposed to Do. Such a thing probably doesn’t exist. When I wasn’t worried about that possibility, I kept working on projects as long as I found them fun/interesting/whatever, and when I stopped feeling excited about them, I moved on to something else. I never did finish that ML course even though I told myself I would. It’s okay— just keep trying to practice or learn.

Case and K

Before you read too far, let me issue this disclaimer: when I say case in this blogpost, I am never talking about syntactic Case with a capital C.

If you ask anybody who works on generative nominal morphosyntax where case is, my guess is that most of them will bring up KP, a head that (most of the time) is assumed to take a DP complement. The earliest citation I’m aware of for KP is Lamontagne & Travis (1987) (see also Travis & Lamontagne, 1992), but since as early as 2005, people have been using KP without citation. I think many/most NP generative syntax folks would not disagree with a statement like, “KP is the location of case features,” but there are in fact very few works that carefully explore the connection between K and case morphemes.

Case = K: Case particles

I can’t get too in the weeds with this (b/c blog), but here’s one example where the connection between case and K is brought up. At the beginning of their paper which is mostly about nominal licensing (but does use KP), Bittner and Hale (1996:4) suggest that the order of case particles and nominal phrases tracks that of verb and object.

In Mískito (ISO miq; Misumalpan, Honduras/Nicaragua), verbs follow objects and case particles follow NPs (Bittner and Hale, 1996:4).
In Khasi (ISO kha; Austroasiatic, Bangladesh/India), verbs precede objects and case particles precede NPs (Bittner and Hale, 1996:4).

But they do not cite or report on the results of a typological study. And Dryer’s sample of case affixes does not include case particles. However, the border between adpositions and case is fuzzy, and since adpositions also closely track VO order, I expect it’s true that case particles do likewise (to the extent that a border between case particles and adpositions can be established).

Case ≠ K: Case concord

The mapping between case and K is clearest in these case particle languages, because there is one case morpheme and one syntactic locus (and 1, as they say, = 1). There are languages with case multiple times per NP (languages with case concord), and here it is not clear what the connection is between K and case. Take, for example, Estonian:

In Estonian (ISO ekk; Uralic, Estonian), case is marked on many of the words inside NP. In this example, inessive case -s appears on each word (Norris, 2018: 539)

In my NLLT paper (and in my other work on case concord), I do treat case as originating on K in some sense, but I do not specify how the K head itself is realized. I believe the same is true for Ingason (2016), who discusses case concord in Icelandic (hmm, where have I heard of that before?). When people would ask me about this when I was in grad school, I would provide a joke answer of, “Oh, I like to pretend the K head explodes and rains down its pieces on the heads below,” but I did not have a real answer. The only plausible answer (that I don’t anticipate working out any time soon) is that K is realized as case on the noun.

There are some other approaches, too, like treating case as a feature assigned to a phrase rather than originating on a head (Baker and Kramer, 2014:148) or inserted as postsyntactic morphemes (eg, Embick and Noyer, 2001).

Case ≟ K: other case suffixes

The missing piece of this investigation imo are languages with case suffixes but no case concord (or at least, no robust case concord). I have wondered: in a language with no case concord but a case suffix on N, how does case end up on N? Sometimes, nothing special needs to be said. In an N-final language like Turkish, case could be a suffix in K and just end up landing on N because they happen to be adjacent. So I went looking in my concord sample for languages that were [-Nfinal, +case, -case concord] to see where the case morpheme ended up.

A number of these languages are coded by Dryer as having postpositional clitics—the case marker ends up on whatever word is last in the NP (or perhaps there are some restrictions, but the idea is that case can attach to a variety of bases). There were also a number of languages reported as having case suffixes, but when I looked more closely, I found that in fact many of these languages might actually have postpositional clitics instead. I only say “many” because I don’t have clear data for some of them, but importantly: I don’t have any examples showing a bound case formative on a non-final N in a language without case concord. (!!) What! Here are a couple examples to show what I mean

Yuchi (ISO yuc; isolate, North America)

Yuchi is coded as having case suffixes, but in Mary Linn’s grammar of the language, I found a couple examples where the case morpheme was not on the noun, but on its modifier. This would be evidence for labeling Yuchi case as a postpositional clitic. (Dryer’s data come from a different source for Yuchi, so I can’t say for sure what the discrepancy is.)

From Linn’s grammar of Yuchi, the locative case marker ‘-le’ meaning “back to” attaches to a numeral, not the noun.

Fur (ISO fvr; Fur (controversially Nilo-Saharan), CAF/Chad/Sudan)

Fur is also characterized as having case suffixes, but in examples from Tucker & Bryan (1966), the case marker attaches after postnominal adjectives.

From Tucker & Bryan (1966), but alas, I don’t have a page number! Look at the second sentence where “-si” attaches to “futa.” In the third row, we again see a case marker (this time “-ŋ”) attaching to postnominal “futa”.

Again, if these examples are representative (and, of course, assuming it’s reasonable to treat adjectives as different from nouns in Fur), then these are more like postpositional clitics, too.

Hang on, I’ve lost the thread.

Blogs are hard. The point is this: Nobody has (to my knowledge) a worked out demonstration of what needs to be said to maintain that case morphemes are connected to a high head in nominal phrases (“call it K, if you like”). There should be one— or there should be something talking about why that can’t work. I say that because there is such interesting work on gender and number in this domain. Why not case? I guess it could be because case has a more indirect relationship to the noun and thus has less noun-related idiosyncrasy, by and large. Or it could be that in many languages, cases are just tiny and/or dependent adpositions, and there’s not a lot of morphology to adpositions generally.

If I wanted to take the time to re-write this (not really how blog posts work), it might look like this:

  1. Are case formatives realizations of K?
  2. Easiest stuff: case particles
  3. Pretty easy stuff: peripheral case affixes/clitics in N-final languages
  4. Pretty hard stuff: Case concord
  5. Pretty does it exist stuff: that puppy-ACC fuzzy is a pattern we don’t expect and importantly, we don’t see it (very often)

Somebody get into case formatives! End of blog post.

Kinds of hybrid agreement and analyses thereof

How is a linguistics blog post different from a linguistics article? I think one key thing is that they’re short. So ommina try to keep this short!

Hybrid agreement (in gender)

The example below demonstrates the complexities of what is starting-to-be-standardly called “Hybrid Agreement.”

BCS hybrid agreement: some words are masculine, other words are feminine

What’s particularly of note is that the adjective stare ‘old’ is feminine, but the demonstrative ovi ‘these’ is masculine. Note there is optionality here— the demonstrative could also be feminine. Hybrid agreement has been front and center in the debate around the headedness of nominal phrases. Salzmann (2018) argues on the basis of hybrid agreement that NPs cannot be headed by N, and Bruening (2020) reanalyzes the data in a framework where N is the head. I’m not going to recapitulate the discussion here (because this is a blog post!), but there are some key properties of this pattern in BCS (as well as the non-BCS patterns of hybrid agreement that are sometimes discussed, e.g., by Landau (2016)):

  1. Lexical: only certain lexical items show this hybrid behavior
  2. Construction-general: This hybrid behavior shows up in a variety of syntactic contexts (eg, NP internal, verbs, pronouns)
  3. Optionality/variation: Hybrid agreement occurs “optionally”, which I use to here to mean “presence of identifiable hybrid agreement is not required for grammaticality.”

Because of these properties, the debates around hybrid agreement have always involved the question of how much information is encoded in the lexical representation of a noun. From the seminal monograph by Wechsler and Zlatić (2003) to Bruening’s (2020) update of the broad strokes of that approach, capturing hybrid agreement via additional lexical information explains (or some other word if you don’t like “explains” here) the three properties in the following ways.

  1. Lexical: lexical information is known to vary from word to word. If hybrid behavior is lexically-encoded, we expect it to be localized to certain lexical items but not others.
  2. Construction-general: Lexical properties are most compelling when they are not affected by the syntactic contexts in which they appear (that’s why they’re lexical). If hybrid behavior is lexically encoded, we expect that hybrid agreement would be visible in many syntactic constructions.
  3. Optionality/variation: The two parts of hybrid agreement—e.g., masculine and feminine features in the case of this BCS pattern—are not encoded in exactly the same way. We expect to see different behavior (or it’s at least not a surprise to see it) because of how processes access lexical information (e.g., what kinds of encoding they pay attention to). This can result in surface variation or optionality.

Finnish/Estonian hybrid agreement in number

In Finnish and Estonian (and possibly other Finnic languages where the patterns are not well documented), another kind of hybrid agreement pattern occurs.

Estonian hybrid agreement: some words are singular, other words are plural

The catalyst for this hybrid agreement is a numeral (anything other than `one’). Material to the right of the numeral is singular in form, and material to the left is plural in form. The numeral itself is also singular in form (yes, numerals in Finnish and Estonian clearly distinguish plural and singular forms, see my LSA paper for some examples and references). There is also a case distinction here, but only sometimes, and I’m not going to talk about it, since this is my blog and I will not be entertaining a lexical treatment of case in this post or ever. But Finnic hybrid number is rather different from the more well beaten paths of hybrid gender.

  1. Not lexical: nearly every noun that can be counted in Finnish/Estonian exhibits this number split. The exceptions that exist are in fact nouns which exceptionally do not show hybrid agreement—they’re plural on both sides (see my LSA paper on this, for example).
  2. Construction-specific: this is a property of numeral-noun constructions and numeral- noun constructions only (or, if you twist my arm, fine, we could just say it’s in vaguely non- universal quantificational contexts). We do not see this number split in other areas (e.g., not in simple NPs).
  3. Obligatory: As far as I know, this property of Estonian and Finnish is fully obligatory. It is ungrammatical to count plural nouns with singular numerals, and to the best of my knowledge, it is ungrammatical (if not completely, then very nearly so) to use a singular demonstrative in a numeral-noun construction with a numeral ≠ ‘one’. Perhaps a rigorous corpus study would reveal examples in some corner of the data, but my own fieldwork and the normative grammars certainly suggest that the only option is a plural demonstrative.

And just like that, the blog post is over.

Well, this is already verging on too long for a blog post, so let me try to concisely say what the point is. The Finnish/Estonian form of hybrid agreement is not a lexical pattern. (Landau (2016) actually does touch briefly on Finnish in his excellent work on hybrid agreement, but as I discuss in my LSA paper, the analysis is really only sketched. And anyway, Landau’s analysis of these patterns is also not actually lexical.) It’s thus not obvious how the lexicalist analyses of hybrid agreement—which I have not discussed in detail, because this is a blog post—can generalize to the Finnish/Estonian form of hybrid agreement. I have a 3/4 (ha! Hybrid agreement joke) completed squib on the topic—posting this in part to make sure I’m not missing any obvious beeves.

Of course, “well just because you call them the same thing does not mean they’re the same thing.” I’m not saying the Finnish /Estonian pattern and the BCS (etc.) pattern must have the same analysis because both can be called “hybrid agreement.” But I am saying that the Finnish/Estonian pattern must have an analysis. If your analysis of BCS (etc) hybrid agreement is part of a bigger point about the architecture of the grammar, then I contend it is important to consider how Finnish/Estonian fit into that architecture, too, now that you know the pattern exists.

A perspective on data usability from the concord typology project

Cross-linguistic studies involve a lot of data collection. If we want to make the most progress towards understanding something, it makes sense that we should make our work usable for people besides ourselves. We might be able to get some additional help!

When I started the concord typology project as a researcher, I knew I wanted to keep track of my data in such a way that other researchers could easily build on the work I had done. I also wanted to make sure that anybody who wanted to retrace my steps would be able to without having to build the study from scratch.  After writing the proceedings paper based on the initial results, I uploaded the data (with the paper) to OU/OSU/OCU’s SHAREOK archive. Here’s what’s in there:

  • Main article: the proceedings paper (geared towards academic audiences)
  • Research data (spreadsheet): all the coding and classification I did based on the data that I collected; easy to digest reasonably quickly
  • Research data (read me): an explanation of the contents of the archive
  • Research data (examples + examples appendix): the actual linguistics examples that the spreadsheet is based on; sort of like research notes and thus not as easy to digest

I spent some time cleaning up the data to prepare it for eyes other than my own, but it’s hard to perfect it on the first pass. I called it good enough and then got back to work collecting more data.

This is an attempt to use a slightly blurry and slightly purple clipping from the research data spreadsheet as an artistic way to break up the flow of text. Hashtag data is art 😆

One year later, someone finds and uses the data!

About a year after I published my data in the archive, I got an email from Kyle Mahowald, an assistant professor of linguistics (at UC Santa Barbara) who is interested in computational modeling of cross-linguistic studies (like mine). He stumbled upon my data and decided to start building a model based on the data that could account for issues of genetic and geographic proximity. This was, of course, very exciting: somebody was building on the work that I started! I spent a good deal of time getting the data ready for the archive, so seeing that somebody found it and used it made all that effort worthwhile.

This brings me to my next point: when you think about making your data usable, consider the user carefully, and make sure the data is as usable as possible. In the version in the SHAREOK archive, I made a choice that negatively affected usability. When Kyle initially wrote the model, it was making a few predictions that were strikingly different from my results. The issue arose from the coding schema I used for for the spreadsheets. In brief: there was an overlap in some of the labels I had used, and so the script was treating some distinct labels as though they were the same. The bug was an easy fix, but we only noticed it because Kyle and I started collaborating and discussing the model he developed. It got me thinking about how I could make my data not only available, but (even more) useable.

Hammer, Sledgehammer, Mallet, Tool, Striking, Hitting
If you give a very simple program a hammer, it’s gonna start looking for nails (or whatever this smashed thing is).

Iterating towards better usability

After my initial conversations with Kyle, I put some work in to improve the usability of my data. I wanted something that achieved a balance between these three things:

  1. Easy to read (for humans)
  2. Easy to process (for computers)
  3. Easy to update (for me)

For this, I settled on storing the coding for each language as a JSON file (as I mentioned in this post). I find JSON files relatively easy to read—especially if you save them with some formatting for readability—and they can be easily converted to other formats. I wrote a few scripts to convert the existing overlapping labels into a system without overlap. And to add new data, all I have to do is add a JSON file for the language I just documented (which I’ve already written a script for).

I now store and update the data on OSF, a free and open platform for sharing research. This means anybody (even you, dear reader) can download the current state of the study this very moment! If you don’t have experience working with JSON files yourself, don’t worry: I have a script on my github that processes the important data from all the JSON files and saves it as a single CSV. So, even if you haven’t used anything besides Excel/Sheets, you can still look at the data!

Keep your data user-friendly!

To sum up, the big lesson here is to keep your data user-friendly. If you want whoever uses your data—be they a research colleague, a coworker, or a client—to be able to build on the work you’ve put in, think about how they might use the data and try to make that as easy as possible.

Books, Pages, Story, Stories, Notes, Reminder, Remember

A very simple spelling corrector for Estonian

If you’ve spent any time looking at online NLP resources, you’ve probably run into spelling correctors. Writing a simple but reasonably accurate and powerful spelling corrector can be done with very few lines of code. I found this sample program by Peter Norvig (first written in 2006) that does it in about 30 lines. As an exercise, I decided to port it over to Estonian. If you want to do something similar, here’s what you’ll need to do.

First: You need some text!

Norvig’s program begins by processing a text file—specifically, it extracts tokens based on a very simple regular expression.

import re
from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

The program builds its dictionary of known “words” by parsing a text file—big.txt—and counting all the “words” it finds in the text file, where “word” for the program means any continuous string of one or more letters, digits, and the underscore _ (r'\w+'). The idea is that the program can provide spelling corrections if it is exposed to a large number of correct spellings of a variety of words. Norvig’s ran his original program on just over 1 million words, which resulted in a dictionary of about 30,000 unique words.

To build your own text file, the easiest route is to use existing corpora, if available. For Estonian, there are many freely available corpora. In fact, Sven Laur and colleagues built clear workflows for downloading and processing these corpora in Python (estnltk). I decided to use the Estonian Reference Corpus. I excluded the chatrooms part of the corpus (because it was full of spelling errors), but I still ended up with just north of 3.5 million unique words in a corpus of over 200 million total words.

Measuring string similarity through edit distance

Norvig takes care to explain how the program works both mechanically (i.e., the code) and theoretically (i.e., probability theory). I want to highlight one piece of that: edit distance. Edit distance is a means to measure similarity between two strings based on how many changes (e.g., deletions, additions, transpositions, …) must be made to string1 in order to yield string2.

A diagram showing single edits made to the string <paer>
Four different changes made to ‘paer’ to create known words.

The spelling corrector utilizes edit distance to find suitable corrections in the following way. Given a test string, …

  1. If the string matches a word the program knows, then the string is a correctly spelled word.
  2. If there are no exact matches, generate all strings that are one change away from the test string.
    • If any of them are words the program knows, choose the one with the greatest frequency in the overall corpus.
  3. If there are no exact matches or matches at an edit distance of 1, check all strings that are two changes away from the test string.
    • If any of them are words the program knows, choose the one with the greatest frequency in the overall corpus.
  4. If there are still no matches, return the test string—there is nothing similar in the corpus, so the program can’t figure it out.

The point in the program that generates all the strings that are one change away is given below. This is the next place where you’ll need to edit the code to adapt it for another language!

def edits1(word):
#     "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

Without getting into the technical details of the implementation, the code takes an input string and returns a set containing all strings that differ from the input in only one way: with a deletion, transposition, replacement, or insertion. So, if our input was ‘paer’, edits1 would return a set including (among other thing) par, paper, pare, and pier.

The code I’ve represented above will need to be edited to be used with many non-English languages. Can you see why? The program relies on a list of letters in order to create replaces and inserts. Of course, Estonian does not have the same alphabet as English! So for Estonian, you have to change the line that sets the value for letters to match the Estonian alphabet (adding ä, ö, õ, ü, š, ž; subtracting c, q, w, x, y):

    letters    = 'aäbdefghijklmnoöõprsštuüvzž'

Once you make that change, it should be up and running! Before wrapping up this post, I want to discuss one key difference between English and Estonian that can lead to some different results.

A difference between English and Estonian: morphology!

In Norvig’s original implementation for English, a corpus of 1,115,504 words yielded 32,192 unique words. I chopped my corpus down to the same length, and I found a much larger number of unique words: 170,420! What’s going on here? Does Estonian just have a much richer vocabulary than English? I’d say that’s unlikely; rather, this has to do with what the program treats as a word. As far as the program is concerned, be, am, is, are, were, was, being, been are all different words, because they’re different sequences of characters. When the program counts unique words, it will count each form of be as a unique word. There is a long-standing joke in linguistics that we can’t define what a word is, but many speakers have the intuition is and am are not “different words”: they’re different forms of the same word.

The problem is compounded in Estonian, which has very rich morphology. The verb be in English has 8 different forms, which is high for English. Most verbs in English have just 4 or 5. In Estonian, most verbs have over 30 forms. In fact, it’s similar for nouns, which all have 12-14 “unique” forms (times two if they can be pluralized). Because this simple spelling corrector defines word as roughly “a unique string of letters with spaces on either side”, it will treat all forms of olema ‘be’ as different words.

Why might this matter? Well, this program uses probability to recommend the most likely correction for any misspelled words: choose the word (i) with the fewest changes that (ii) is most common in the corpus. Because of how the program defines “word”, the resulting probabilities are not about words on a higher level, they’re about strings, e.g., How frequent is the string ‘is’ in the corpus? As a result, it’s possible that a misspelling of a common word could get beaten by a less common word (if, for example, it’s a particularly rare form of the common word). This problem could be avoided by calculating probabilities on a version of the corpus that has been stemmed, but in truth, the real answer is probably to just build a more sophisticated spelling corrector!

Spelling correction: mostly an English problem anyway

Ultimately, designing spelling correction systems based on English might lead them to have an English bias, i.e., to not necessarily work as effectively on other languages. But that’s probably fine, because spelling is primarily an English problem anyway. When something is this easy to put together, you may want to do it just for fun, and you’ll get to practice some things—in this case, building a data set—along the way.

How to build a cross-linguistic database

One of the best things about being a linguist is that language data is all around us. We interact with it in so many ways every day: listening to podcasts, talking to housemates, and even shopping at grocery stores. It’s easy to start paying closer attention to the language we hear and see, but how can we learn from all this data to better understand how language works? In this post, I’m going to focus on one aspect of this: understanding how other languages work by building a cross-linguistic database.

To do this, I’ll talk about a specific example, a certain kind of agreement with nouns (which I call nominal concord, but I’ll just call it agreement with nouns in this post). The English words this/that change their form based on whether the noun they modify is singular or plural: e.g., these bananas vs. this banana. Other languages do similar things. Many people who have studied a Romance language like Spanish remember that articles and adjectives (among other things) have to match the gender of the noun they modify: e.g., la casa blanca ‘the white house’ vs. el edificio blanco ‘the white building’. How do we figure out the properties of this process so that we can use it to understand language better?

A map showing the presence of concord (i.e., agreement with nouns) in the world’s languages. Data from Norris (2019), map created in R with the lingtypology package.

Collecting language examples for the database

Since you’re already swimming in data, you could approach data collection in a very grassroots way. You could travel, taking pictures of language you see in the world, or you could take screenshots of other languages you encounter on blogs or social media. But this can take a long time when you’re looking for data that shows you a specific thing. To develop the database more rigorously, we can turn to more formal sources. We can use grammars, which are book-length reference guides for the properties of languages. Some of these are freely available as books (e.g., Language Science Press, Pacific Linguistics) or PhD dissertations (especially low-resourced languages). Some language data is already compiled and available in online databases (like World Atlas of Language Structures, or Universal Dependencies Corpus) or established corpora (like the Corpus of Contemporary American English, or for something completely different, Estonian corpora available at keelveeb.ee).

There is an important aside here: if we want to understand how language in general works, we have to make sure we’re not looking too closely at one particular language or language family. The languages that most people in North America and Western Europe are familiar with are Indo-European languages. Because these languages are so familiar, there is a reflexive tendency to view their properties as normal or common. To go back to the example of agreement with nouns, we might think that it’s normal for articles and adjectives to agree with noun, because that’s what they do in Spanish. It’s important to remember that we don’t actually know if that’s true! The only way to know for sure is to build a database that does not have overrepresentation of one language or language family.

Selecting a feature/tag set

Now that we’ve talked about sources you can use for your data, the next thing to determine is the set of features or tags that you’ll use to structure the data you collect. Naturally, you will want to focus on features that seem relevant for understanding whatever you’re looking at. If you were looking at agreement with nouns, you could catalog which words agree with nouns and what properties they agree with.

So, after collecting your data and storing it (e.g., in a Google Doc), you would also record what words agree with nouns in that language and what properties were relevant for the agreement. There are many features of a language that are likely irrelevant for what you’re looking at. You might need to do some exploratory analysis first to get the lay of the land before you decide on your feature set.

Managing the data and features

I mentioned that you might choose to store the examples you collect in a Google Doc. What about the features? There are several options for managing databases of this type. A non-exhaustive list of options:

  1. Spreadsheet (e.g., Excel/Sheets): easy to read; allows sorting “by hand”
  2. CSV: a bit harder to read, but can be easily opened by a spreadsheet program or fed into R or Python for more sophisticated computational or statistical analysis.
  3. JSON: easier to read with human eyes (in my opinion), easily read with all sorts of programming languages

Saving the data in a format that can be easily read by something like Python is a good investment. You can write scripts to read in all the data you’ve collected and then tabulate any numbers you find relevant. The approach I use is to save each language as an individual JSON file so that updating the database is simple: all I have to do is add the new JSON file(s) to the proper directory. Then I can run the scripts I’ve written to see how the database has changed.

Image showing a sample of a JSON file
A piece of a JSON file containing information about agreement with nouns in Finnish.

I know I went through this part pretty quickly—I’ll share more about it in a subsequent post!

Once we have enough languages in the database, we can start to extract insights about how different languages do or don’t behave similarly, and through that, we start to learn about language in general. When I built a database like the example I have been discussing in this post, I learned that demonstratives and adjectives are the most likely categories to agree with nouns. In fact, if demonstratives agree with nouns in a language, then it is likely that adjectives will, too. This is just one way in which English turns out to be weird: demonstratives (this/that) agree in English, but adjectives don’t—we don’t say heavies books!

Congratulating yourself

That’s that! As with lots of language work, the most time-intensive part is not really analyzing the results, it’s ensuring that the analysis is based on good data. Collecting good data can take a long time, especially when you’re pulling it from a lot of different languages. So, if you feel compelled to make a database like this one, you might as well start now! A database that would be very useful for both academic and applied contexts would be one that catalogs different word orders that can be used based on the discourse context (e.g., topicalization or focusing) —there are related databases on WALS but they’re less pointed. Any linguists reading this will know that that database could be quite difficult to construct!

A simpler task would be to contribute to an existing database. The Universal Dependencies Corpus is a great example. Right now, the most developed samples in the database are Indo-European languages (and a few other major world languages of Asia). As a result, the Universal Dependencies Corpus is unfortunately still biased!

Some language in the wild from Estonia— the translation of Thoreau’s “Why should we live with such hurry and waste of life?”

Conctypo: What, why, and how

In the past year or so since leaving my academic position and relocating to San Francisco, I have spent my “work time” learning and doing different things. Some of that has been developing my technical skills, and some of it has been continuing the research program I developed while in academia. In particular, nominal concord continues to be an obsession of mine, and I still have unanswered questions that I think nobody will find the answers to if not me. The most satisfying result is when I’m able to marry these two pursuits by using my increased technical skills to improve my research effectiveness. Today, I’m going to introduce the research project I’ve spent the most time with, my typological sample of nominal concord, aka Conctypo.

I’ve debated whether to kick off this series from the very beginning, but I decided instead to start where I am right now. I’ll dig into the past in some subsequent posts.

What is nominal concord and what is Conctypo?

If you’ve studied a European language before, you’ve probably encountered the phenomenon that I (and others) call nominal concord. For example, in the Spanish phrases la casa blanca ‘the white house’ and el edificio blanco ‘the white building’, the words for ‘the’ (la/el) and ‘white’ (blanca/blanco) change their form based on the noun that they modify. In this instance, it’s because the noun casa ‘house’ is feminine and the noun edificio ‘building’ is masculine. This is an example of nominal concord.

A schematic representation of nominal concord. There are orange-colored lines connecting the feminine noun casa 'house' to its modifiers. There are green-colored lines connecting the masculine noun edificio 'building' to its modifiers.
A graphical representation of nominal concord for gender in Spanish

More technically, nominal concord is the agreement process in language whereby modifiers of a noun (eg, adjectives, numerals, or demonstratives) must match the noun they modify in particular features (eg, gender, number, or case). Nominal concord is a well-known process in linguistics, perhaps due to the fact that it is widespread in Indo-European languages. But it exists outside of the range of Indo-European languages: it is found on all 6 inhabited continents (sorry, Antarctica, but you don’t count as inhabited).

Conctypo is a typological sample of nominal concord in the world’s languages. As of this writing, I (with the help of research assistants while I was at OU) have collected data on 244 languages. The first time I presented about the project was at the LSA meeting in 2019. The paper and entire data set (including only 174 languages for better genetic/geographic balance) is available in the SHAREOK archive here: A typological perspective on nominal concord. Since then, I have stopped managing the data with spreadsheets and now store the information in JSON files. All that I have to do to update the database is add the JSON files for the new languages and run the Python scripts I’ve written to pull relevant numbers. But more on that in a later post!

Why build this typological database?

While there are many broad tendencies in language structure—go tool around WALS if you never have—languages also have plenty of idiosyncratic properties. When devising models of language structure, a reasonable approach (to my mind) would be to use common properties as the foundations of the theory. In order to build a theory of nominal concord, we would need an understanding of what the common properties of nominal concord are. That’s where Conctypo comes in— the cross-linguistic sample can tell us what is common in concord systems. In turn, when looking at the concord system of a particular language, we can correctly identify idiosyncratic properties as idiosyncratic (instead of mistaking them for plausibly general properties of concord).

A map of the world with dots scattered throughout. The green dots show languages with nominal concord. The gray dots show languages without nominal concord.
A map (image) showing languages with concord (green dots) and languages without (gray dots). Made in R with the lingtypology package. See this tweet thread for more concord map images.

To put a finer point on this, let me discuss Indo-European briefly. Often, when a researcher brings up nominal concord, they use data from an Indo-European language to highlight its behavior. Concord systems in other languages are compared to Indo-European systems, with the implicit assumption that Indo-European systems are normal or common. Yet without a cross-linguistic understanding of concord systems, we can’t be sure this is true! Concord in Indo-European languages is robust and regular; commonly, gender and number are represented on nearly every word modifying a noun. But concord in the world’s languages could be more sporadic. It could involve, for example, number on some words and gender on other words.

A schematic representation of two kinds of concord systems.
Schematic representations of concord systems. On top, a system like Spanish, where both the article and the adjective must agree with the noun’s feminine and plural features. On the bottom, a hypothetical system where the article only agrees in gender and the adjective only agrees in number.

In this world, Indo-European concord would be perhaps overzealous. The only way to know—well, the only way to feel more assured—is to go to the data.

At the time I started gathering data, I knew of no other typological investigation of nominal concord. Thus, if I wanted to know the answers to these questions, I had to find them myself. After about a year and a half, I found the work of Ranko Matasović and İsa Kerem Bayırlı, who have collected their own concord or concord-related typological samples. We do not all document the same properties, though, so the more, the merrier!

How did we collect the data?

We look for linguistic examples on three different kinds of words:

  1. Demonstratives: words like this/these or that/those
  2. Cardinal numerals greater than ‘one’: number words like two, three, or seven. We specifically avoid ordinal numerals like second, third, or seventh as these often behave like adjectives. We also avoid one because it shows idiosyncratic behaviors in some languages—the goal was to try as much as possible to look at numerals as a distinct category.
  3. Adjectives: words like green, tall, old, etc. This can get tricky as some languages lack a clearly defined adjective class.

To find the examples needed, we look in published sources, including PhD dissertations. Ideally, this would be a grammar, i.e., a reference guide to the linguistic properties of the language. Failing a suitable grammar, we will use other types of writing (ideally published, but for some languages, the only available material may be unpublished). We look through the grammar to find suitable attested examples, where “suitable” means something like If the language had concord, we would be able to see it in this example. I take pictures, take screenshots, or copy the text of the example and save it in a Google Doc for archival purposes (and in case I ever want to check my work).

Linguistic examples showing adjective concord for number in the Pondi language.
In Pondi (Ulmapo; Papua New Guinea), adjectives show concord in number (Barlow, 2020:77)

Once I have finished documenting all three word classes, I can update the database. I wrote a program in Python that asks me the requisite questions and then creates a properly formatted JSON file and saves it in the proper place (more on this program later!).

And the work continues…

My work on this project continues in the form of data collection, computational streamlining, and pursuing theoretical implications. Until next time!