Language Datasets and You: A Primer

Collocate Clusters

What are language datasets? An example of such datasets most teachers will be familiar with are word lists such as the General Service List or the Academic Word List.

There are some publicly available sources for language datasets (for example the Speech and Language Data Repository or the Language Goldmine) yet most won’t be of much immediate use to teachers. Furthermore some teachers like Paul Raine are using such datasets in a form that is usable by fellow professionals.

I would like to make the case that playing with such datasets ourselves can be beneficial.

It is reasonable to assume that to write well it is necessary (but not sufficient) to read well. Similarly spending time playing with language data can have positive benefits for language awareness or knowledge about language.

For example, I was reading an article titled Towards an n-Grammar of English which argues for using continuous sequences of words (n-grams) taken from corpora as a basis for an alternative grammar syllabus. It uses a publicly available language data set of 5-grams to make its case. As I was reading the paper I wanted to see how the authors derived their example language patterns.

The first thought was to download the text file and import it into Excel. One problem, the text file contains more rows than Excel can take. An option here is to split the file over several sheets in Excel. However this is cumbersome so another option is to use what is called an IPython Notebook.

IPython Notebook is an environment that allows you to use computer code, text, images, graph plots. It was originally designed as a way to show reproducible work.

Below is a screenshot of an (incomplete) notebook for the article I was reading. Learning commands is relatively straightforward depending on what you want to do.

python-notebook
Screenshot from example notebook

The screenshot shows the first command is to import a module called pandas that will be used to query the data. The next command imports the data file which is tabbed separated. For those interested in exploring python notebooks there are many resources available on the net. Usually when I want to look for a command I include the word “pandas” in a search.

As an example of how making an ipython notebook helped me understand the article, is my initial confusion of why “I don’t want to was not in the top 100 n-grams. “I don’t want to has 12659 instances. Using the ipython notebook I saw that the grammar pattern which instantiates this [ppis1 vd0 xx vvi to] has only 51 types (or rows in the dataset) whereas the number one ranked pattern [at nn1 io at nn1] has 7272 rows.

ppis1 – 1st person sing. subjective personal pronoun (I); vd0 – do, base form (finite); xx – not, n’t; vvi – infinitive (e.g. to give… It will work…); at – article (e.g. the, no); nn1 – singular common noun (e.g. book, girl); io – of (as preposition)
from Claws7 tagset.

Note. Links to information on how to set up a python notebook and to the n-gram grammar paper are included in the example notebook.

Datasets can also come from research papers. I have used a word list of the top 150 phrasal verbs and their most common meanings to create a phrasal verb dictionary. This is a step beyond simply querying a dataset (as can be done using an IPython Notebook or Excel) and may not be for everyone. However, I imagine many teachers have used paper based word lists when designing lessons, hence such datasets and ways of manipulating them will not be completely unfamiliar.

Luckily, as mentioned before, people like Paul Raine are using publically available datasets that are easy for teachers to use. On his apps4efl site he has a paired sentence app that uses the Tatoeba Corpus of sentence pairs (which internet users have translated), a wiki close app (that uses Wikipedia data), video activities (using YouTube) and so on (see list below).

The most well-known type of datasets are corpora. Interfaces to such data such as the BYU interfaces to COCA (Corpus of contemporary American English), or the BNC (British National Corpus), are most popular. I won’t go into detail about exploiting such data, for those interested you can read more about such datasets on my blog, or over at the G+ Corpus Linguistics community. Suffice it to say that this kind of data is becoming better supported now on the net.

Hopefully this short primer on the value of language datasets may encourage you to start to explore them; or, if you are already, why not drop a comment? Readers may also know of publicly available language datasets that they would like to share. If so, please share!

List of datasets:

Speech and Language Data Repository

Language Goldmine

COCA n-grams

Thanks to Paul Raine for the following that he uses for apps4efl:

Wikis (Creative Commons license)
Native English Wikipedia (via API) en.wikipedia.org
Simple English Wikipedia (via API) simple.wikipedia.org
Native English WikiNews (via API) en.wikinews.org

Videos
TED (via download) www.ted.com (Creative Commons non-derivative)

VOA Learn English (via download) learningenglish.voanews.com (Public Domain, copyright info here: http://learningenglish.voanews.com/info/about_us/1374.html)

Example sentences
Tatoeba corpus (via download: http://tatoeba.org/eng/downloads) www.tatoeba.org (Creative Commons license)

Dictionaries
The Open Multilingual Wordnet (via download:http://compling.hss.ntu.edu.sg/omw/) (Creative Commons license)
CMU Pronouncing Dictionary (via download: http://www.speech.cs.cmu.edu/cgi-bin/cmudict) (BSD license)

Wordlists
The New General Service List, New Academic Word List (via download: http://www.newgeneralservicelist.org/new-ngsl-japanese-defs/) (Creative Commons license)

 

%d bloggers like this: