Language Datasets and You: A Primer

Collocate Clusters

What are language datasets? An example of such datasets most teachers will be familiar with are word lists such as the General Service List or the Academic Word List.

There are some publicly available sources for language datasets (for example the Speech and Language Data Repository or the Language Goldmine) yet most won’t be of much immediate use to teachers. Furthermore some teachers like Paul Raine are using such datasets in a form that is usable by fellow professionals.

I would like to make the case that playing with such datasets ourselves can be beneficial.

It is reasonable to assume that to write well it is necessary (but not sufficient) to read well. Similarly spending time playing with language data can have positive benefits for language awareness or knowledge about language.

For example, I was reading an article titled Towards an n-Grammar of English which argues for using continuous sequences of words (n-grams) taken from corpora as a basis for an alternative grammar syllabus. It uses a publicly available language data set of 5-grams to make its case. As I was reading the paper I wanted to see how the authors derived their example language patterns.

The first thought was to download the text file and import it into Excel. One problem, the text file contains more rows than Excel can take. An option here is to split the file over several sheets in Excel. However this is cumbersome so another option is to use what is called an IPython Notebook.

IPython Notebook is an environment that allows you to use computer code, text, images, graph plots. It was originally designed as a way to show reproducible work.

Below is a screenshot of an (incomplete) notebook for the article I was reading. Learning commands is relatively straightforward depending on what you want to do.

Screenshot from example notebook

The screenshot shows the first command is to import a module called pandas that will be used to query the data. The next command imports the data file which is tabbed separated. For those interested in exploring python notebooks there are many resources available on the net. Usually when I want to look for a command I include the word “pandas” in a search.

As an example of how making an ipython notebook helped me understand the article, is my initial confusion of why “I don’t want to was not in the top 100 n-grams. “I don’t want to has 12659 instances. Using the ipython notebook I saw that the grammar pattern which instantiates this [ppis1 vd0 xx vvi to] has only 51 types (or rows in the dataset) whereas the number one ranked pattern [at nn1 io at nn1] has 7272 rows.

ppis1 – 1st person sing. subjective personal pronoun (I); vd0 – do, base form (finite); xx – not, n’t; vvi – infinitive (e.g. to give… It will work…); at – article (e.g. the, no); nn1 – singular common noun (e.g. book, girl); io – of (as preposition)
from Claws7 tagset.

Note. Links to information on how to set up a python notebook and to the n-gram grammar paper are included in the example notebook.

Datasets can also come from research papers. I have used a word list of the top 150 phrasal verbs and their most common meanings to create a phrasal verb dictionary. This is a step beyond simply querying a dataset (as can be done using an IPython Notebook or Excel) and may not be for everyone. However, I imagine many teachers have used paper based word lists when designing lessons, hence such datasets and ways of manipulating them will not be completely unfamiliar.

Luckily, as mentioned before, people like Paul Raine are using publically available datasets that are easy for teachers to use. On his apps4efl site he has a paired sentence app that uses the Tatoeba Corpus of sentence pairs (which internet users have translated), a wiki close app (that uses Wikipedia data), video activities (using YouTube) and so on (see list below).

The most well-known type of datasets are corpora. Interfaces to such data such as the BYU interfaces to COCA (Corpus of contemporary American English), or the BNC (British National Corpus), are most popular. I won’t go into detail about exploiting such data, for those interested you can read more about such datasets on my blog, or over at the G+ Corpus Linguistics community. Suffice it to say that this kind of data is becoming better supported now on the net.

Hopefully this short primer on the value of language datasets may encourage you to start to explore them; or, if you are already, why not drop a comment? Readers may also know of publicly available language datasets that they would like to share. If so, please share!

List of datasets:

Speech and Language Data Repository

Language Goldmine

COCA n-grams

Thanks to Paul Raine for the following that he uses for apps4efl:

Wikis (Creative Commons license)
Native English Wikipedia (via API) en.wikipedia.org
Simple English Wikipedia (via API) simple.wikipedia.org
Native English WikiNews (via API) en.wikinews.org

TED (via download) www.ted.com (Creative Commons non-derivative)

VOA Learn English (via download) learningenglish.voanews.com (Public Domain, copyright info here: http://learningenglish.voanews.com/info/about_us/1374.html)

Example sentences
Tatoeba corpus (via download: http://tatoeba.org/eng/downloads) www.tatoeba.org (Creative Commons license)

The Open Multilingual Wordnet (via download:http://compling.hss.ntu.edu.sg/omw/) (Creative Commons license)
CMU Pronouncing Dictionary (via download: http://www.speech.cs.cmu.edu/cgi-bin/cmudict) (BSD license)

The New General Service List, New Academic Word List (via download: http://www.newgeneralservicelist.org/new-ngsl-japanese-defs/) (Creative Commons license)



Death to “Death by Powerpoint”

prezenAs the semester gets fully into gear, the first hints of final projects and presentations start to take form. It is a good point to nip the “death by powerpoint” syndrome that turns up near the end of the semester, with a group of students presenting on class-related topics. We are all familiar with death by powerpoint, those boring text-laden bullet-pointed sets of slides that get those heads nodding and eyelids flagging.

The best antidote is not to use another tool (there are dozens out there, I teach ones like PreziPowToon and Google Drive Slides). The better solution is to think about presentations in a new way, one that works better with your brain. Learning is best when emotion is linked to the ideas, and powerpoint and image are great at eliciting motion. Video is even better. That is why your slides should have few words and be simple enough to allow people to concentrate on you, the presenter. All those details? Put them in a handout for after the presentation.

The best example of this approach to presenting is Garr Reynolds, with his Presentation Zen. A book, a video and a website give you an idea of what is involved, but a recent TEDx presentation is a 17-minute look at the psychology of presentations. Well worth the time. His references to Japanese culture help as metaphors.

He also has a 4-page text outline (pdf) of the steps and ideas you should try to convey to your students while they work on their presentations.



Digital Sojourn: The final frontier of Language Learning?

Spock as a child in school on Vulcan (Star Trek, 2009)


O.K., so this is not the image of Spock everyone remembers. And with the recent passing of one of the world’s most iconic Sci-Fi legends, perhaps I should have tried to fit Leonard Nimoy’s classic, “Live long and prosper,” image in here somewhere (Perhaps I will still find a way). But, it just so happens that I was looking at this image from the J. J. Abrams version of the Star Trek inspired Vulcan educational system last week and wondering what our earthly final frontier of language learning might actually be. I believe I have come up with an answer–digital sojourn.

One of the big names in intercultural studies is Michael Byram. He described the difference between someone who is just beginning their study of intercultural communication and someone who has attained intercultural communicative competence. The former he describes as a tourist, the later as a sojourner. The tourist is just passing though, and perhaps looking for an interesting intercultural souvenir to take home; however, they look forward to returning to their familiar surroundings with everything just the way they left it. The sojourner, on the other hand, is fundamentally changed by their travels, affects change in others they meet along the way, and returns home to affect change upon their own culture (Byram, 1997).

With the digital age upon us, it seems we now have opportunity to update our pedagogy with regards to helping students develop their intercultural competence. With intercultural communication tied so closely to language learning, and intercultural communicative competence parallel to competence with use of English as a Lingua Franca, we have an obligation as teachers to pursue its development in our students.

Spock is famous for stating, “Computers make excellent and efficient servants, but I have no wish to serve under them.” As language teachers, we are often reminded of this when the argument comes up surrounding technology in the classroom. The technological landscape before us may seem at times to cause us to serve under it. I think it is safe to say that we would like to find ways to make it serve us.

I see the final frontier of language learning as that of digital sojourn. By digital sojourn I mean the use of technology to support language learner’s efforts to spend extended amounts of time “traveling” among a particular culture and its people. It is impractical for all of our students to spend years physically traveling in a foreign culture, however, with technology they can digitally sojourn.

There are many ways we as teachers can support digital sojourn inside and outside of our classroom, and with mobile technology becoming more and more prevalent the possibilities continue to grow. Before we are ready to travel the galaxy seeking new intercultural encounters, we should use those resources currently at our disposal to develop our intercultural skills here on earth. After all, we still have 48 years to go until “First Contact”.  \V/

Image by Dave Daring @ DeviantArt