Language Datasets and You: A Primer

Collocate Clusters

What are language datasets? An example of such datasets most teachers will be familiar with are word lists such as the General Service List or the Academic Word List.

There are some publicly available sources for language datasets (for example the Speech and Language Data Repository or the Language Goldmine) yet most won’t be of much immediate use to teachers. Furthermore some teachers like Paul Raine are using such datasets in a form that is usable by fellow professionals.

I would like to make the case that playing with such datasets ourselves can be beneficial.

It is reasonable to assume that to write well it is necessary (but not sufficient) to read well. Similarly spending time playing with language data can have positive benefits for language awareness or knowledge about language.

For example, I was reading an article titled Towards an n-Grammar of English which argues for using continuous sequences of words (n-grams) taken from corpora as a basis for an alternative grammar syllabus. It uses a publicly available language data set of 5-grams to make its case. As I was reading the paper I wanted to see how the authors derived their example language patterns.

The first thought was to download the text file and import it into Excel. One problem, the text file contains more rows than Excel can take. An option here is to split the file over several sheets in Excel. However this is cumbersome so another option is to use what is called an IPython Notebook.

IPython Notebook is an environment that allows you to use computer code, text, images, graph plots. It was originally designed as a way to show reproducible work.

Below is a screenshot of an (incomplete) notebook for the article I was reading. Learning commands is relatively straightforward depending on what you want to do.

Screenshot from example notebook

The screenshot shows the first command is to import a module called pandas that will be used to query the data. The next command imports the data file which is tabbed separated. For those interested in exploring python notebooks there are many resources available on the net. Usually when I want to look for a command I include the word “pandas” in a search.

As an example of how making an ipython notebook helped me understand the article, is my initial confusion of why “I don’t want to was not in the top 100 n-grams. “I don’t want to has 12659 instances. Using the ipython notebook I saw that the grammar pattern which instantiates this [ppis1 vd0 xx vvi to] has only 51 types (or rows in the dataset) whereas the number one ranked pattern [at nn1 io at nn1] has 7272 rows.

ppis1 – 1st person sing. subjective personal pronoun (I); vd0 – do, base form (finite); xx – not, n’t; vvi – infinitive (e.g. to give… It will work…); at – article (e.g. the, no); nn1 – singular common noun (e.g. book, girl); io – of (as preposition)
from Claws7 tagset.

Note. Links to information on how to set up a python notebook and to the n-gram grammar paper are included in the example notebook.

Datasets can also come from research papers. I have used a word list of the top 150 phrasal verbs and their most common meanings to create a phrasal verb dictionary. This is a step beyond simply querying a dataset (as can be done using an IPython Notebook or Excel) and may not be for everyone. However, I imagine many teachers have used paper based word lists when designing lessons, hence such datasets and ways of manipulating them will not be completely unfamiliar.

Luckily, as mentioned before, people like Paul Raine are using publically available datasets that are easy for teachers to use. On his apps4efl site he has a paired sentence app that uses the Tatoeba Corpus of sentence pairs (which internet users have translated), a wiki close app (that uses Wikipedia data), video activities (using YouTube) and so on (see list below).

The most well-known type of datasets are corpora. Interfaces to such data such as the BYU interfaces to COCA (Corpus of contemporary American English), or the BNC (British National Corpus), are most popular. I won’t go into detail about exploiting such data, for those interested you can read more about such datasets on my blog, or over at the G+ Corpus Linguistics community. Suffice it to say that this kind of data is becoming better supported now on the net.

Hopefully this short primer on the value of language datasets may encourage you to start to explore them; or, if you are already, why not drop a comment? Readers may also know of publicly available language datasets that they would like to share. If so, please share!

List of datasets:

Speech and Language Data Repository

Language Goldmine

COCA n-grams

Thanks to Paul Raine for the following that he uses for apps4efl:

Wikis (Creative Commons license)
Native English Wikipedia (via API) en.wikipedia.org
Simple English Wikipedia (via API) simple.wikipedia.org
Native English WikiNews (via API) en.wikinews.org

TED (via download) www.ted.com (Creative Commons non-derivative)

VOA Learn English (via download) learningenglish.voanews.com (Public Domain, copyright info here: http://learningenglish.voanews.com/info/about_us/1374.html)

Example sentences
Tatoeba corpus (via download: http://tatoeba.org/eng/downloads) www.tatoeba.org (Creative Commons license)

The Open Multilingual Wordnet (via download:http://compling.hss.ntu.edu.sg/omw/) (Creative Commons license)
CMU Pronouncing Dictionary (via download: http://www.speech.cs.cmu.edu/cgi-bin/cmudict) (BSD license)

The New General Service List, New Academic Word List (via download: http://www.newgeneralservicelist.org/new-ngsl-japanese-defs/) (Creative Commons license)



Where’s the doc? A meditation on the transcendent nature of digital media with Joe Tomei

JtomeiI recently had the opportunity to get a JALT SIG officer up to speed on some of the things I had been using as a coordinator for the THT-SIG and one of the questions that surprised me was when he asked, while we were working in google docs, how he could download the document for his reference. I realized that one of the controlling metaphors that framed his thinking was that the electronic document is an object that is moved from place to place. While this is true on some level that what we were working on probably could be pointed to as a series of 1s and 0s in a Google server farm somewhere, I tried to explain that he needed to stop thinking of the documents as objects.

In working with people on various projects, I find that the metaphor ‘the file is an object’ is quite common. Asking how to download a google drive file in order to have their own copy is just one example. 4 people get in a folder in google drive and if they are not used to it, they will feel they have to create their own documents, so you have 4 copies of the same document, all with different edits. Students will start a google presentation by uploading a powerpoint file to share with me (because they are told to share first and then edit) and they continue to work on the powerpoint file, upload a completed version and then wonder why they can’t see it in the class folder.

from the 1984 Apple desktop

This is certainly understandable. Apple, stealing from Xerox Parc and an idea by Alan Kay, started the metaphor of the desktop, where you could organize and keep your files. You drag files to a trashcan to delete them. You send them by attaching them to an email. You save them in your computer, you put them on USB sticks. In truth, it is hard to get away from the idea that the file is an object. And it has its uses, especially when we think about security and backup.

But to become more comfortable with the digital mobile world, you can’t hold on to that metaphor so tightly. Thinking about ‘where’ a document is has you think that when creating things, adopting behaviors that actually cause problems. It encourages people to not think about how they name files or using tags and categories. It discourages collaboration because it makes one think that things are not accessible or has people gloss over how they make things accessible.

To show how this works, when I create document I need to work on, I first share it to my other google accounts so I can access it in any account. (I don’t do this with everything, just the things I am pretty sure I may be working on remotely) If it is for a class, I place it in a folder that is shared with all the students in the class so they have access to it and they can also add what they have created to share. The same applies if it is for a group, such as a JALT SIG or chapter. I also use this system for student seminar papers, and Google Drive allows synchronous chat and simultaneous editing, so it is possible to watch what a student is writing and be able to guide him or her as they are composing.

Doing this with students is a bit of a hassle, cause teaching a new group of students these points all over again can be strangely frustrating. I say strangely because the whole idea of teaching in an institutional setting is that you are teaching groups of students who move on, so we should be used to that, but when it comes to technology, that ability to understand that we may have to teach the same thing again and again doesn’t really happen as often as it should.

In an article about Neil Postman, co-author of Teaching as a Subversive Activity, a particular favorite of mine, there was this:

“There’s this kind of dialogue around technology where people dump on each other for ‘not getting it,’” Lanier says. “Postman does not seem to be vulnerable to that accusation: He was old-fashioned but he really transcended that. I don’t remember him saying, ‘When I was a kid, things were better.’ He called on fundamental arguments in very broad terms – the broad arc of human history and ethics.”

I suspect that none of the things I have set out in this post is new to anyone who would visit a place called Digital Mobile Language Learning. And I’m sure a lot of us have gradually let go of this metaphor as we use cloud applications. However, if the person you are teaching or working with doesn’t have this understanding, your possibilities for collaboration are restricted and you are tied to the understanding of the person you are sharing it with. Using technology is not simply a matter of understanding it individually, it is our collective understanding of the tools. So the next time some technological collaboration doesn’t seem to be going as well as it should, consider discussing how they are thinking of their files. You might be surprised.