Resources
Centre for Corpus Research
The Centre for Corpus Research at Birmingham has a wide range of corpus resources and tools for research purposes.
Corpora and interfaces
Bank of English
The Bank of English is now hosted on CQPWeb, on a Birmingham server. This allows Birmingham staff and students access to the corpus and other corpora through the same interface. It requires a bham.ac.uk email address for you to be able to register.
Please note that the previous version of the Bank of English which had been hosted on the Titania server is now closed and is no longer available. The Bank of English at Birmingham is now only available on CQPweb.
British Sign Language Corpus Project
Access to some of the video data and ELAN annotation files that form the British Sign Language (BSL) Corpus based at University College London is available here. Creating the British Sign Language Corpus was a joint venture involving five UK universities during 2008-2011, led by Dr Adam Schembri who is now based here at the University of Birmingham and who continues to work on corpus-based approaches to the study of BSL linguistics.
CLiC
CLiC is a web application for the corpus linguistic analysis of Dickens’s novels and other literary texts. The web app is being developed as part of the CLiC Dickens project, a collaboration between the University of Birmingham and the University of Nottingham, funded by the AHRC. Please see the CLiC Dickens project site.
CorporaCoCo
CorporaCoCo is an R package that identifies statistically significant co-occurrence count differences between two corpora and reports an effect size and confidence interval for each of the identified differences. The package produces high quality, customizable plots for use in reports.
EuroCoAT
EuroCoAT (European Corpus of Academic Talk) provides transcripts of academic conversations between undergraduate Erasmus students (L1 Spanish) and their lecturers at different host universities. The EuroCoAT project is a collaboration between the Universities of Extremadura, Birmingham, Limerick, Dalarna and VU Amsterdam.
BNCWeb
A web-based client program for searching and retrieving lexical, grammatical and textual data from the British National Corpus (BNC). It requires a bham.ac.uk email address for you to be able to register. This gives access to the British National Corpus.
Sketch Engine
Institutional access to the Sketch Engine interface is available on any computer on the University network (but not outside the network). Approximately 160 corpora are included.
Wordbanks Online
Institutional access to the Wordbanks Online service is available on any computer on the University network (but not outside the network). Wordbanks Online is the HarperCollins interface for the 550 million word version of the Bank of English.
Wordsmith Tools
University of Birmingham users can use the networked version of the programme from any machine on the university network, provided that the user has logged into the network.
CLAWS Part of Speech Tagger
The Centre has a licence for the CLAWS tagger (UCREL, Lancaster). Staff or students who are interested in POS-tagging large quantities of data for research should contact Paul Thompson.
WMatrix
The Centre also has a licence for the WMatrix suite of semantic and POS annotation and analysis tools developed by Paul Rayson (UCREL, Lancaster). Staff or students who are interested in using WMatrix for research should contact Paul Thompson.
The Centre has a large number of corpora for use by researchers and students at the university. These include:
- AHRC corpora
- Australian Corpus of English
- British Academic Spoken English corpus
- British Academic Written English corpus
- British National Corpus
- Bank of English
- Brown Corpus
- COCA
- COLT
- DANTE
- Europarl
- Freiburg-LOB (FLOB)
- Freiburg-Brown (Frown)
- German Parole
- Global Web-Based English (GloWbE).
- The Helsinki Corpus of English Texts: Diachronic Part
- The Helsinki Corpus of Older Scots
- ICLE
- Innsbruck Computer-Archive of Machine-Readable English Texts (ICAMET)
- The International Corpus of English - East African component
- Italian Corpus
- Italian Newspaper
- Kolhapur Corpus (India)
- Lampeter
- Lancaster/IBM Spoken English Corpus (SEC)
- LOB Corpus
- London Lund Corpus
- Micase corpus
- Multilingual Plato
- Newdigate Newsletters
- Polytechnic of Wales Corpus
- Wellington New Zealand Spoken
- Wellington New Zealand Written
- Wolverhampton Business English
Access to some of these corpora is restricted; for further information, contact Paul Thompson at ccr@contacts.bham.ac.uk