Corpora

The Linguistics Lab maintains a collection of natural language corpora. Some highlights are listed below. Many of these have been released through the Linguistic Data Consortium (LDC). UGA's membership in the LDC is funded annually by contributions from cooperating academic units. Faculty and students in these cooperating units enjoy unrestricted access to the entire collection.

Cooperating Academic Units 2022-2023
Department of Linguistics
Institute for Artificial Intelligence
Department of Language and Literacy Education
Department of Romance Languages
Latin American and Caribbean Studies Institute
The Willson Center for Humanities & Arts
The Linguistic Atlas Project
University of Georgia Libraries

For information about becoming a cooperating academic unit, please contact John Hale via <linglab@uga.edu>

The Brown Corpus

This was the first million-word electronic corpus of English, created in 1961 at Brown University. It spans about fifteen different categories of text.

The Penn Treebank

Manually-corrected phrase structure trees for English, including 1.2 million words of newspaper text from the Wall Street Journal.

COCA: Corpus Of Contemporary American English

531 million tokens of American English sampled from 1990--2017 across categories such as Academic, Fiction, Magazine, Newspaper, and Spoken. This version is annotated with lemmas and parts of speech. Provided courtesy of the UGA Library.

COHA: Corpus Of Historical American English

400+ million words from the period 1810--2008 facilitate diachronic investigation. Provided courtesy of the UGA Library.

Early English Books

Phase 1 and 2. Provided courtesy of William Kretzschmar and the Text Creation Partnership.

Royal Society Corpus

78 million words from Philosophical Transactions of the Royal Society of London 1665--1920 with lemmas, parts of speech and spelling normalization. For more information see Fischer et al LREC 2020.

Cronfa Electroneg o Gymraeg - Electronic Corpus of Welsh

1 million words of written Welsh tagged with parts of speech, lemmas and a classification of morphophonemic mutation types. Courtesy of Jonathan Jones.

British National Corpus

100 million words of text from a variety of British sources that are annotated with parts of speech and lemmas as well as sociolinguistic variables such as speaker age, social class and geographical region. 91% was published between 1985 and 1993.

AudioBNC

Audio and all available transcriptions of the 7.5 million words of the spoken portion of the British National Corpus, including TextGrids aligned at the word and phone level, with associated speaker metadata. For more details see Coleman et al. 2012.

SpokenBNC 2014

11.4 million tokens, orthographically transcribed from smartphone recordings made between 2012 and 2016. Substantial speaker metadata is included. Annotations include parts of speech, lemmas and a system of semantic tags. For more details see Brezina et al 2018.

International Corpus of English

English varieties from the UK, Canada, East Africa, Hong Kong, India, Ireland, Jamaica, the Philippines, Singapore and the USA. Each component corpus contains about one million words. The ICE-GB word annotations (but not syntactic trees) are searchable using IMS Open Corpus Workbench.

Arabic Treebank

Approximately 800 thousand words of newswire text from Agence France-Presse annotated with parts of speech, morphology and phrase structure.

DEFT Spanish Treebank

About 100 thousand words from both Spanish newswire and discussion forums, with extensive morphological and syntactic annotations.

AnCora-ES Spanish Treebank

About 500 thousand words of Spanish newswire, with extensive morphological and syntactic annotations.

El Corpus del Español, Web/Dialects

2 billion words of Spanish from 21 different countries, with lemmas and parts of speech. Courtesy of Mark Davies.

CETEMPúblico

180 million words from the Portuguese newspaper "Publico'' 1991--1998 with morphological and syntactic annotations.

Brazilian Portuguese Web as Corpus

About 2 billion words from Brazillian web pages, collected in WaCky style (see below). Courtesy of Aline Villavicencio.

Corpus do Português

1 billion words from web pages in Brazil, Portugal, Angola & Mozambique with POS tags from Eckhard Bick's "Palabras" tagger. Courtesy of Mark Davies.

French Treebank

This corpus is drawn from the newspaper Le Monde 1989-1994 annotated with syntactic constituents, syntactic categories, lemmas and compounds and totals about 650 thousand words.

SPMRL2014

Dependency, constituency and morphology annotations for Arabic, Basque, French, German, Hebrew, Hungarian, Korean, Polish and Swedish.

NEGRA corpus

355 thousand tokens of the German newspaper Frankfurter Rundschau annotated with syntactic structures.

EuroParl

About 40 million words of European Parliamentary proceedings aligned across translations into English, German, Spanish, French, Italian and Dutch.

CALLHOME

This corpus consists of 5-10 minute snippets from 120 phone calls, each 30 minutes each in length. They are made from native English speakers from various places in North America; mostly made to family members or close friends. Roughly 90 of the phone calls are placed to persons living outside of North America but all are in English. Holdings include both audio and transcripts.

The Buckeye Speech Corpus

This corpus is comprised of 40 speakers from Columbus Ohio, recorded from 1992-2000. Each interview lasted about 60 minutes and the corpus totals more than 300,000 words of speech. Created primarily for studying phonological variation in American English, it was gathered via a "modified sociolinguistic interview format" to target a representative sample of forms and frequency of phonological variants. The corpus' .wav files are transcribed and force-aligned at the segment level.

CELEX2

Orthography, phonology, morphology and attestation frequency information for words in English, German and Dutch.

Concretely Annotated New York Times

About 1.3 billion words from articles that appeared in the New York Times 1982--2007 with automatically-assigned lemmas and part-of-speech tags.

WaCky corpora

Between 1.2 and 1.9 billion tokens each of French, German and Italian as crawled from the world wide web. Also includes about 800 million tokens of English Wikipedia as it was in 2009. These corpora are annotated with lemmas and parts of speech. For more details see Baroni et al 2009.

Bible Corpus

Parallel translations of the Bible into 100 languages. For more information see Christodouloupoulos and Steedman 2015.

BLM

A corpus of tweets collected by Jordan Graham as part of her MA thesis. These tweets all include the word "police" and either the hashtag #BlackLivesMatter or the hashtag #BlueLivesMatter and are dated either May25--26th 2020 or June 3-4 2020. Each subset comprises 2000 tweets, except for the May #BlueLivesMatter set where the scraping operation only yielded 81.

Slideshow

Concretely Annotated New York Times

BLM