This is a post about (or the first post in a series about) the search for my first job in the language + tech industry space. I perhaps could have called this something like What I Did to Get a Job, but one of my opinions about getting your first industry job is there is no silver bullet for the task, so I can’t say that any of these things were necessary in getting my first job. However, I did some things while searching for a job and then did manage to get a job (first a contract job and then a full-time job).
To say that again another way: there is no one thing that you must do to be able to get a job nor is there one thing that will enable you to easily find a job if you do that one thing. Unfortunately! So what you should do is just keep trying to develop.
What I did to learn/practice coding (mostly Python):
- Free intro course (keyword FREE); I did Codecademy‘s Python 2 course, because their Python 3 course costs money.
- There are differences between Python v2 and v3 but they are minor and you will pick them up.
- This will help you learn the basic data structures.
- Automate the boring stuff is a good book, and all the chapters are available online (just scroll down the linked page):
- Learn to manipulate files (reading/writing CSVs, text files, and JSONs): Chapters 9, 13, 14, 16 (at least!)
- Free mini-courses on Kaggle
- They have great bite-sized courses on a variety of topics, e.g., Intro to Python, Pandas, intro to ML, advanced ML, data cleaning
- When you’re done, you can even add a certificate to your LinkedIn profile, which you should do!
- NLP-focused content
- Introduction to SpaCy (great library for NLP)
- NLTK book
- My two cents: don’t worry about the syntactic details; focus on internalizing the steps of an NLP pipeline in a broad sense
- I got overly concerned with knowing this well—I never really got there, and I also haven’t had to know it well for either of my jobs or for my interviews.
- The file management stuff is more important!
The main thing I have done at my jobs with Python is file and data management. For file management, learn to do things like (i) list all the files in a directory, (ii) perform the same operation on every file in a directory (or just every .csv or .txt file…), etc. For data management, you want to be able to manipulate different file types; I would say focus on being able to import and export csvs and jsons. In my opinion, you should prioritize learning (some of) the pandas library, because it’s powerful and also used by data scientists. I have some aspects of pandas memorized by now, but I still look things up all the time. That being said, I once had a coding interview where the code was written with the csv package, and I couldn’t remember the syntax. I just tried to do my best in the interview, but I did feel afterwards like I should make sure to remember how the csv package works. For language work, the csv package is also probably quite a workable solution— some of my colleagues at both jobs have used csv instead of pandas.
The key thing here is that you need to focus on being conversant in these concepts rather than necessarily being able to write them. Maybe you can get to that, too, but step one is understanding what the pieces are. The reason to watch and read these things is not so you can necessarily do this work (speaking for myself, I do not build ML models at work), but so you know enough about them to know kind of how the enterprise works. To be clear, there are some “linguist” jobs where you build models, so if you’re interested in it, it’s definitely worthwhile. However, you can secure a job without building a model. Again, these are the things that I read or watched while I was searching, not necessarily “must watch” or “must read” sources!
- StatQuest videos on youtube (start here: A Gentle Introduction to Machine Learning)
- The first 3 weeks of Stanford’s ML course on Coursera
- I stopped after three because I wasn’t enjoying it (sometimes that’s reason enough to stop doing something!), but in retrospect, I encountered a lot of new concepts in those three weeks and saw some examples that have stuck with me.
- Videos in the Michigan UX course on Coursera:
- enroll for free, watch all the videos in a week, unenroll so you don’t pay
- apparently you may also be able to “audit” the course so you don’t have to complete it in a week.
- I watched this to get familiar with the general design process.
We did not get to see the models in my job at Amazon. I still think it was helpful for me to know a little bit how these programs work so I could talk/think about how the data would be used.
“Do a project”
A lot of people told me some version of “do a project.” If your feeling when you hear that is “Sounds good, but what?” I don’t blame you! As someone offering advice, it is easy to think that project ideas abound, because once you start working, so many more ideas come to you. However, when I was searching, I just didn’t know what projects I could do. Most identifiable NLP projects (i.e., papers) are multi-authored, so I felt that I couldn’t make headway there (let alone the fact that I was still learning!). I definitely didn’t think I was doing any projects that were big enough to count as projects (whatever that means), and at times, I felt like I couldn’t even if I wanted to.
Surprise! I actually did some projects
Now that I’ve been working about 8 months, I can see that I did “do a project” a few times. Here is a reasonably complete list.
- Poketext: scraped Pokédex entries from Bulbapedia pages and saved it in one giant text file
- When I started this, I had in mind that I was building a corpus. I don’t know what the corpus would be used for— probably a good idea to have an answer to that.
- This gave me some experience web scraping with the Python module called BeautifulSoup
- Poketext (part deux): try to fix issues in some of the sentences that I saw using SpaCy
- Some of the sentences in the Pokédex would refer to the Pokémon by name, but some would just use a pronoun or an NP like “this Pokémon”.
- I created an algorithm to replace the pronouns/NPs with the name of the Pokémon, and I encountered some interesting issues in the process.
- Concord: I converted my typological database from a somewhat unwieldy spreadsheet to a format that was more computationally friendly and easily updated.
- Each individual language in the data set is one JSON file
- I wrote scripts to import all the JSON files from the directory and output descriptive stats about the data set
- I wrote a program that creates a data file for any language that is new to the data set (after I’ve documented it myself). It asks me a few questions about the language and then creates the JSON file.
- Because I did this, Kyle Mahowald found my data and we started a research collaboration.
- FWIW: this is the project that ended up being the most useful (imo) for getting my first job.
- Estonian spell checker: Messed with some Estonian data to adapt Peter Norvig’s very simple English spell checker for Estonian.
- I had to find a suitable training corpus for the spell checker— that’s how I found estnltk.
- I used the process as material for a blog post.
- My blog!: Though I do sometimes write about esoteric issues connected to my linguistic research, I also tried to write some posts addressing language + data in a friendly, accessible way.
My recommendation is that you put these projects on a github and/or on a website, no matter how small you think they are. If recruiters or hiring managers get curious enough about you to look for your web presence, you want to give them something to chew on. Projects that are representative of your current skills make the most sense—if they’re not flashy enough for a particular job, then trust me: you don’t want that job (right now)!
Additional project ideas
Feel free to riff on any of the project ideas stated above. If none of those sound fun to you, here are some ideas to inspire any do-a-projects you might pursue.
- Build a test set for an imaginary classifier: Test sets are smaller than training sets, so it won’t take you as long to annotate them.
- Columns (one idea): label/annotation, URL, first paragraph, first paragraph tokenized or split on spaces (eg, using SPLIT() in Sheets)
- Classifier ideas:
- X or not X: news articles that are or are not about a certain thing (eg, about accidents or not, about the stock market or not, or even subjective categories)
- native vs non-native (or: native vs. translated): sentences/paragraphs that are or are not from native speakers.
- Build a corpus with web data (possibly scraping with BeautifulSoup if you can’t find the data more easily): collect examples with the idea that a person could design an annotation using the data you’ve collected.
- There are lots of “corpus of movie review” tutorials online— these won’t “make you stand out”, but you will learn a lot from doing them.
- I also saw something that involved pulling all the dialogue from a TV show and using that as data
- Could use this corpus to feed into a project like the one above.
- Playing with old research data: If you have any research data that can be put into spreadsheet format, try to do something with it in python.
- Maybe that’s data visualization,
- maybe that’s creating randomized samples of the data,
- maybe that’s writing a script that would let you add to the data (e.g., what fields does the script need to ask for),
- maybe you want to create a database of linguistic examples that you have gathered in your fieldwork/research and tag it for relevant information (eg, these are my relative clause examples, these are my wh-question examples, …)
- If you don’t have any spreadsheet data from your research, you could download some samples from wals.info and play with those. Eg, can you figure out how to download 3 samples from WALS and combine them all into one file (so that if a language appears multiple times, you consolidate all of its values into one row?). This stuff isn’t super hard to learn!
- just do something to give yourself something to work on and slowly figure out
The important thing to remember: there’s no secret beyond trying
Unfortunately, if you don’t already have computational or statistical skills, it can be hard to show a recruiter that you can contribute. There is no secret to this task— you just have to maintain your ability to try new things and keep waiting for a little bit of luck. I don’t mean this in a hokey All You Need to Do Is Try kind of way. I just mean it’s probably not helpful to worry about whether you have done The One Thing You’re Supposed to Do. Such a thing probably doesn’t exist. When I wasn’t worried about that possibility, I kept working on projects as long as I found them fun/interesting/whatever, and when I stopped feeling excited about them, I moved on to something else. I never did finish that ML course even though I told myself I would. It’s okay— just keep trying to practice or learn.