The American National Corpus (ANC) is a consortium project to develop a linguistic corpus comparable to the British National Corpus. The corpus of American English, providing a snap-shot of American English as used across a wide range of written and spoken genres, is designed for language and linguistic research and teaching.The ANC website provides details of the consortium members and project contact details; summary of aims and organisation; and a copy of the original proposal to create the corpus.
The website BADIP : Banca Dati dell'Italiano Parlato, from the Karl-Franzens-Universität of Graz, makes available materials for the analysis and study of spoken Italian language. Users are also invited to collaborate with the development of the database by identifying problems with the site, communicating particular needs for studies and research, and suggesting further material. At the time of this review the site held a main collection of texts of spoken Italian (Corpus LIP - corpus del lessico di frequenza dell'italiano parlato), which was created between 1990 and 1992 by a group of linguists directed by Prof Tullio De Mauro. It contains about 469 texts (of about 490,000 words) which were collected from four cities: Milan; Florence; Rome; and Naples. The section Tipologia dei testi demonstrates how the texts have been gathered into five main groups, depending on how and where the dialogue or conversation took place. Users can also find data on the kinds of speakers involved in the dialogues and the length of the conversations. The transcriptions contain grammatical notes and other linguistic explanations.
The Bookmarks for Corpus-based Linguists site contains a comprehensive collection of corpus linguistics links, compiled by Dr David Lee (City University of Hong Kong). The links are annotated and organised in groups such as: corpora, collections, data archives; software, tools, frequency lists; references, papers, journals; teaching and miscellaneous; people, places and conferences. What makes the site particularly useful is that the collection of links is extensive (about 1,000) but nevertheless easy to navigate thanks to the categorisation into groups. The annotation is clear and helpful. The site claims to be aimed primarily at linguists and language teachers who work with corpora. It could also be of use to those working in computational linguistics and people interested in data-driven learning, text analysis, computer-assisted language learning (CALL), or lexicography.
The BASE website offers information about and access to the British Academic Spoken English (BASE) corpus. The corpus consists of recordings made in a variety of university departments, grouped into four broad disciplinary groups with 40 lectures and 10 seminars in each: Arts and Humanities; Social Sciences; Physical Sciences; and Life Sciences. The recordings, together making up around 1.6 million word tokens, have been transcribed and tagged, and the transcriptions can be downloaded from the website in XML format. The lecture portion of the corpus can also be accessed through the Sketch Engine corpus analysis interface (subscrition-based with free 30-day trial). The BASE corpus is a valuable resource for investigation of language use in academic context and the website contains a list of publications and conference papers which refer to BASE data. In addition to the BASE manual, the site also provides access to an Excel spreadsheet with information about the individual lectures and seminars, such as: title; department; audience; date of recording; speakers; duration. A link is provided to a selection of interviews with academic staff made in relation to the BASE corpus. The corpus can also be ordered via the Oxford Text Archive (OTA) website (formerly part of the Arts and Humanities Data Service (AHDS)), on completion of a request access form.
The British National Corpus (BNC) website offers information about and access to the BNC, a 100-million word corpus of written and spoken English. The BNC was compiled according to carefully designed criteria and contains a wide variety of written and spoken language. The written texts (100 million words) were taken from a range of fiction and non-fiction domains usually dating back not earlier than 1975. The spoken samples (10 million words) include material from different contexts and regions produced by speakers of different ages and social backgrounds. The corpus is a key resource used for research and teaching in a number of areas, such as: lexicography; natural language processing; applied and theoretical linguistics.
The BNC website describes how the corpus was created and offers comprehensive information about its content and structure. Information on how to use the customised search software (Xaira or SARA) is also available, in the form of step-by-step guides and sample queries. The Simple Search function on the site allows users to see how often a word or phrase occurs in the corpus, and to retrieve up to 50 examples. Links are provided to other sites which offer access to the corpus or to resources created on the basis of it, such as word lists.
Unrestricted access to the corpus requires a user licence, which can be obtained by purchasing a copy of the corpus on DVD or by registering for the Subscription Service. A 30-day free trial is available to those who register and download a copy of the search software.
The BYU Corpus of American English is a very large collection of texts which is being made freely available online via a dedicated search interface. The interface allows the user to search the corpus for words and phrases and display the search result as a concordance with limited context. In addition to searching for exact words or phrases, users can exploit wildcards in their searches, search for lemma and part-of-speech information, look for collocates, and make semantically-based queries, amongst other things. The corpus initially consists of around 360 million words, equal amounts from each year from 1990 to 2007. New material will added at least twice a year. The texts are drawn from a variety of sources and are divided into five genres of equal size: spoken; fiction; popular magazines; newspapers; and academic journals. The search interface is simple to use, and offers functions that are not generally found in corpus search tools, such as the ability to find synonyms and compare similar words. A help file is available and information about how to use this very powerful tool is also provided in the form of a five minute guided tour. The BYU Corpus of American English is a valuable resource for anyone interested in looking at how English, especially American English, is used today. The composition of the corpus makes it particularly suitable for comparisons across time period or genre.
The website of the Centre of English Corpus Linguistics at Catholic University of Louvain contains useful information about corpus linguistics and in particular learner corpora. The site offers a comprehensive learner corpus bibliography and description of the ICLE (International Corpus of Learner English) project. The site also provides annotated links to other resources, such as corpus linguistics websites, discussion lists, and concordances. The major difference between this site and other corpus linguistics sites is the emphasis on learner corpora: research, resources, and publications. The page about ICLE (International Corpus of Learner English) contains information about the project and the resulting corpus resources as well as advice to those who wish to join the project or create a similar corpus. Similar information is provided about LINDSEI (The Louvain International Database of Spoken English Interlanguage), a corpus of spoken learner language. There is also information about the LOCNESS corpus which is a corpus of native English essays used for comparative studies.
This corpus contains 979,831 words, made up of 1723 articles taken from three daily French newspapers: "Le Monde" (576 articles / 355,046 words), "L'Humanite" (576 articles / 367,486 words) and "La Depeche du Midi" (571 articles / 257,299 words). The articles were published in 2002 and 2003. They belong to one of six categories: editorial, cultural, sports, national news, international news, finance. The articles were taken from the newspapers on the 4th, 12th, 20th and 28th of each month. If in one or all categories, an article was not available on a particular day, the article from the day after was taken. If no article was available on that day, the article from the day before was taken, and so on and so forth. This resource is available via the Oxford Text Archive (OTA) website, and can be downloaded as a zipped file.
Corpus International do Português (CINTIL, International Corpus of Portuguese) is a annotated corpus of some one million word tokens compiled at the Linguistics Center at the University of Lisbon. This page provides a search tool to access the corpus. The result of a search is a concordance that allows the user to sort the lines and to expand the context. The corpus contains some one million words and is divided into a written part, consisting of fiction; and a spoken part of both formal and informal conversations. The corpus is annotated with tags for part of speech; inflection; and multi-word named entities. In addition, the site contains a link to the tagging guidelines, as a PDF-file. The corpus will be released to the research community in the future but at the time of review there were no information about when this will happen. This is a valuable tool for anyone researching or studying Portuguese language or corpus linguistics.
CLIPS : corpora e lessici di italiano parlato e scritto (corpus and vocabulary of spoken and written Italian) is a project hosted by the University of Naples. It makes available online a wealth of written documents and audio files, which are free to use for research purposes. Users can access about 100 hours of speech from 15 different cities in Italy: Bari; Bergamo; Bologna; Cagliari; Catanzaro; Florence; Genoa; Lecce; Milan; Naples; Palermo; Parma; Perugia; Rome; and Venice. Male and female voices are equally represented. Various recordings are transcribed to read in PDF. A variety of different text typologies have been used, such as: radio and television broadcasts; interviews; dialogues; non professional speakers reading aloud; telephone conversations between 300 speakers and a hotel desk operator; read speech by a selection of professional speakers recorded in an anechoic chamber. The project has been co-ordinated by various universities and colleges in Italy and a detailed outline of its development is given by Federico Albano Leoni, the Project Director. This resource is an extremely valuable source of primary material for scholars of Italian linguistics; it would also be of use to teachers of Italian looking for new audio and printed material containing contemporary Italian for use in class.
ConcApp is a free and user-friendly text analysis program. It offers concordances, collocations and word frequency statistics. It can also be used to edit text files. ConcApp is exceptionally easy to download and start via the simple and clear interface. Support is offered for English, Japanese, Chinese, Thai, Russian and most European character sets in unicode. Users can download the program files from the website for free without any registration. The software is designed to run under Windows operating systems (ME, NT / 2000, XP and Vista).
Concordance : Software for Concordancing and Text Analysis is the website of a software package for Windows which will generate: concordances; word lists; and indices from single and multiple texts. Results can be: printed; saved; exported as text HTML and as Web Concordance files for dissemination online. The Web Concordance provide a user-friendly interface for exploring texts processed with Concordance. It allowsthe viewing of the original text; a wordlist with frequencies for each entry; and hyperlinks to their occurrences in the text. Concordance supports multiple languages and alphabets, and can convert from OEM to ANSI character sets, and from Unix to PC files. The software is available on a commercial basis though an evaluation version of the software is available for download from the website. The software has been developed by R.J.C. Watt (University of Dundee) and is used by the author for the teaching of Shelley, Coleridge, Keats, and Blake.
Corpora4Learning.net is a website created and maintained by Dr Sabine Braun at University of Surrey. It contains an extensive selected bibliography of publications concerned with the use of corpora in teaching; a commented set of links to English corpora; a list of links to useful tools and websites, also commented; and some information about two projects Dr Braun is involved in. This site is a useful starting point for anyone looking for publications or resources about the use of corpora in teaching and learning English.
Corpus Chambers-Le Baron D'Articles de Recherche en Français (The Chambers-Le Baron Corpus of Research Articles in French) contains 1,045,872 words, made up of 160 articles taken from 20 journals. The articles included were published between 1998 and 2006. They belong to one of ten categories: media/culture; literature; linguistics and language learning; social anthropology; law; economics; sociology and social sciences; philosophy; history; and communication. The articles were selected on the basis that they concerned studies in the humanities and social sciences in a very broad sense of the term, were peer-reviewed, and were written by native speakers of French. The corpus can be downloaded as a plain text file from the Oxford Text Archive website (formerly part of the Arts and Humanities Data Service (AHDS)), but as use is restricted to non-commercial purposes, users wishing to access this resource are requested to apply for approval by filling in a short form on the site.
The Corpus del Español is a corpus of 100 million words of the Spanish language. It has been funded by the National Endowment for the Humanities in the USA. The corpus is searchable online through a quick and useful search engine created by Professor Mark Davies.The corpus contains 100 million words of text, including 20 million from the time period 1200-1500, 40 million from 1500-1800 and 40 million from 1800-2000.
Corpus Encoding Standard (CES) is an online set of guidelines developed by the Expert Advisory Group on Language Engineering Standards (EAGLES), designed to be optimally suited for use in language engineering research and applications, in order to serve as a set of encoding standards for corpus-based work in natural language processing applications. The CES is a subset of SGML (Standard Generalised Markup Language) compliant with the TEI (Text Encoding Initiative) Guidelines. The document starts with an overview of the general principles of corpus encoding, and the 'recommendations common to all documents' which include a description of SGML syntax and a discussion of issues related to character sets, including the International Phonetic Alphabet. The sections that follow describe: building the TEI header; encoding of primary data and encoding of linguistic annotation. For primary data the CES identifies three levels of encoding, from the minimum encoding level required for CES conformance to more detailed tagging. The section on linguistic annotation includes chapters on: locators; encoding conventions for segmentation and grammatical annotation; and encoding conventions for parallel text alignment. The document ends with: a bibliography; lists of relevant standards and URLs; the CES DTD; the tag index; and the recommendations on using the CES. The page provides links to an XML version of the CES DTD and a list of projects using the CES.
Corpus Linguistics is a site designed as a supplement to the book Corpus Linguistics, although it can be used on its own. The project is funded by IHE (Innovation in Higher Education). The site consists of four major sections: Early Corpus Linguistics and the Chomskyan Revolution; What is a Corpus and What is in it?; Quantitative Data; and The Use of Corpora in Language Studies. Each of these sections gives a detailed overview of the subject mentioned. The site offers a good introduction to corpus linguistics, its development, its methods for language study, its limitations and advantages. It will be particularly useful as a teaching resource for students of corpus linguistics, although it is no substitute for the book which is far more detailed and has more sections (the site covers only the first four chapters).
The Corpus of Early English Correspondence Sampler (CEECS) is an electronic resource which can be downloaded from the Oxford Text Archive website (formerly part of the Arts and Humanities Data Service (AHDS)). The 0.45 million word Corpus of Early English Correspondence Sampler was created from the larger Corpus of Early English Correspondence. CEECS covers the years 1418-1680, and consists of 1,147 letters written by 194 writers. The selection criteria were arbitrary, as only 23 editions which were no longer in copyright could be included, but CEECS is nevertheless a fairly representative sample of the full corpus. COCOA markup references are used. Access to this resource is restricted, and hence users are requested to complete a short online form to apply for a copy.
The Corpus of English Dialogues (CED) is an electronic resource comprising dialogues from 1560 to 1760. It can be downloaded from the Arts and Humanities Data Service (AHDS) website: however, access to the material is restricted, and users are asked to complete a short Web form to apply for a copy. To give a picture of spoken interaction of the past, as mediated through written records, the CED contains 1.2 million words drawn from both texts which include constructed dialogue and those which purportedly record language from authentic speech situations. There are five main text types in the CED: drama comedy; didactic works (language manuals and other handbooks); fiction; trial proceedings; and witness depositions. The corpus texts have been coded to indicate features such as: foreign language; narration; compilers' comments; editorial comments and emendations; and font changes. The CED comprises 177 text files, and is distributed in plain text and XML formats, accompanied by a PDF guide to the corpus.
This is the website for the project Corpus of Spoken Israeli Hebrew (CoSIH) which started in 2000 at Tel Aviv University. Its aim is to provide a representative corpus of Hebrew (5 million words) as spoken today by different groups in society taking into account such factors as: sex; age; profession; social and economic background; and education. This project was launched to fill a gap in the field of Corpus Linguistics and to have a resource as a base for research and general educational purposes. The website is mainly of benefit to researchers. The site has a simple design, and the text is available in both English and Hebrew. Among other things it describes: the rationale for the project; its aims; its design; and sampling procedures used. A list of useful references is also included. At the time of review the site hadn't been updated since 2004.
The Corpus of Spontaneous Japanese website presents the data and preliminary results from a large-scale national research project on spontaneous spoken Japanese. The whole corpus contains approximately 650 hours of spontaneous speech of various kinds, recorded between 1999 and 2003; this provides up-to-date data on current spoken language in Japan. Parallel English and Japanese versions of the website provide: an introduction to the study; information on the sources, accompanied by soundfiles of samples of speech; details of the transcription and annotation used; preliminary analyses; and references. Two kinds of speech, academic presentation speeches (APS) and simulated public speaking (SPS), were the main sources, but some material was also taken from interviews with the subjects about their APS or SPS, and from recordings of the subjects reading short passages aloud. This is a detailed and valuable source of data for researchers in Japanese linguistics. The project is a collaboration between the National Institute for Japanese Language, the Communications Research Laboratory, and the Tokyo Institute of Technology.
The COSMAS system gives online access to the German language corpora of the Institit für Deutsche Sprache in Mannheim, Germany. This is the world's largest collection of German text corpora for linguistic research. A 1.1 billion word corpus is publicly available copyright-free. In addition invited guests have access to the whole COSMAS corpus collection (currently 1.85 billion words). The corpora on offer include classic literary texts, national and regional newspapers, the works of Marx and Engels, spoken language in transcribed form, morphosyntactically transcribed texts and some unique corpora focussing on texts about the "Wende" (the collapse of the GDR and the reunification of Germany).
Registered users can submit queries online and obtain concordances and word frequency counts. The most heavily used COSMAS feature is "Collocation Analysis and Clustering", a method for discovering hierarchical structures within a set of search hits based on the collocational patterning of the search terms. There is extensive help on the use of the system (in German). Users must register with COSMAS and obtain a username and password, which is free and can be used for subsequent visits to the site. Commercial exploitation of the results obtained from COSMAS is not permitted.
This is the personal website of Costas Gabrielatos, a PhD-student in English linguistics, at the University of Lancaster. It contains a bibliography of his articles and presentations concerned with the use of corpora in linguistic research and teaching. Most of the items are accessible either as PDF-files or as web pages. In addition there is a list of links to useful resources and tools found on the Web.
The objective of this project is the compilation and analysis of the representative Croatian texts - both older and contemporary - in the form of a corpus, the usage of which is applicable for all kinds of linguistic research. The Web pages are in Croatian and English. Users can get immediate free online access to the Croatian National corpus. The service allows the user to submit queries to to the corpus and obtain concordance lines. The corpus is under development and at the time of review has 101 million words. A list of the most frequently occurring words in the corpus is also available for viewing. In addition to the corpus, users may also query the Croatian Electronic Text Archive (Hrvatski elektronski tekstovni arhiv).There is a page giving extensive information about the goals of the project, plus a page of links to other Croatian electronic texts and reference corpora in other languages.
The Cyril Belica: Kookkurrenzdatenbank CCDB - V3.2 (co-occurrence database), developed by the Institut für Deutsche Sprache (Institute for German language) in Mannheim (Germany) is a corpus linguistic experiment platform for research and theoretical evidence of relationships between language constituents. The database is meant for further research in the field of collocation analysis in modern German. The database holds 220309 analysed words that can be browsed or searched and shown in context (one line). There are further options for showing related collocations; semantic proximity; self-organising maps; or near synonyms. In the second search window, there is a search box for co-occurrences, which holds 770202 words. The database is based on a 2.2bn word corpus of modern German of the corpus linguistics program of the Institut fuer Deutsche Sprache. A map showing the latest visitors to the site by country is a more lighthearted feature of this site. This database is an excellent tool for lexicographers and can be used for corpus based research in modern German.
The Czech National Corpus (CNC) is a key resource for researching Czech language as it is used today. The first and the easiest way to check the possibilities of the Czech National Corpus is the Internet interface to CNC. The Internet accessible corpus PUBLIC contains about 20,000,000 words. It's a selection from the big corpus SYN2000 (100,000,000 words) with the same genre constitution. The Internet access has several limits (besides the corpus size): searches for phrases or combinations are not possible; only single words are searchable; the context around the search results is limited to 60 characters; the corpus is not morphologically annotated. Even with these limits it is possible to evaluate the computer approach to the language processing. For more serious linguistic research, full access to SYN2000 is offered for free. First, users have to download the corpus manager GCQP that allows: unlimited context to be shown around the search result; searching for phrases; searches based on morphology; sorting the results; the showing of information about the text source, type (newspaper, fiction, etc.) and genre; saving the selected concordances to a local computer disk; and access to various statistics.The corpus SYN2000 arose from text sources provided to CNC/ÚČNK for non-commercial usage. Thus everybody interested in the corpus access has to sign a declaration (in Czech) that they will not use the information retrieved from CNC for commercial profit.
Data-Intensive Linguistics (DIL) is an online book introducing tools and techniques for using text corpora and giving the basics of statistical natural language processing. It presents: UNIX corpus tools; probability and information theory and their application to computational linguistics; fundamental techniques of probabilistic language modelling; and implementation techniques for corpus tools. The book is divided into five chapters: an overview of DIL and its historical roots; finding information in text (tools for: finding and displaying text; concordances; and collocations); collecting and annotating corpora (corpus design; SGML (Standard Generalised Markup Language) for computational linguistics; annotation tools); statistics for DIL; and applications of DIL. The book is a useful resource for anyone interested in computational linguistics, corpus linguistics and natural language processing.
The Arts and Humanities Data Service (AHDS) has published this guide to good practice, Developing Linguistic Corpora and a free online version is made available here (a print version may be purchased from the site). The Guide is edited by Martin Wynne from the Literature, Languages and Linguistics branch of the AHDS, which is hosted by the Oxford Text Archive. A selection of experts in various areas of corpus construction offer advice in a readable and largely non-technical style to help the reader to ensure that their corpus is well designed and fit for the intended purpose. This Guide is aimed at those who are at some stage of building a linguistic corpus. Little or no knowledge of corpus linguistics or computational procedures is assumed, although more advanced users will also find the guidelines useful. The site contains the following chapters: Corpus and Text: Basic Principles, John Sinclair (Tuscan Word Centre); Adding Linguistic Annotation, Geoffrey Leech (Lancaster University); metadata for Corpus Work, Lou Burnard (University of Oxford); Character Encoding in Corpus Construction, Anthony McEnery and Richard Xiao (Lancaster University); Spoken Language Corpora, Paul Thompson (University of Reading); Archiving, Distribution and Preservation, Martin Wynne (University of Oxford); Appendix: How to make a corpus, John Sinclair (Tuscan Word Centre).
The Dialogue Diversity Corpus (DDC) is a collection of dialogue transcripts in a wide range of situations including a medical interview; academic tutoring; telephone travel service; friends interacting. These transcripts are freely available for research in human interaction. Version 2.0 retains access to all of the sources that were available through the original release and also has a hyperlink to a site on finding dialogue transcripts and records and finding aids including several text corpora and links to data sharing organisations.
ELISA: English Language Interview corpus as a Second-language Application is developed at the Eberhard Karls University, Tübingen and the University of Surrey. Its aim is to become a resource for language learning and teaching, and interpreter training. It consists of video recordings of interviews with native English speakers from, for example, England; Scotland; Ireland; Australia; and US. The interviews have been transcribed. The site is a demo that gives free access to a number of video clips and transcriptions as text and XML files. In addition there is a search engine that allows searches in the transcripts and present the result as concordances or word counts. The material on the website is free for use in research, teaching and study with due recognition of the project. This is a valuable resource for anyone interested in corpus linguistics, spoken English or applied linguistics.
The Emille Corpus is an electronic resource which can be downloaded from the Oxford Text Archive website (formerly part of the Arts and Humanities Data Service (AHDS)). The encoding format used is SGML. The collection consists of: 30 million words of monolingual written data (Gujarati, Tamil, Hindi, and Punjabi news website articles); 600,000 words of monolingual spoken data (Hindi, Urdu, Punjabi, Bengali, and Gujarati radio broadcasts); 120,000 words of parallel data in each of English, Hindi, Urdu, Punjabi, Bengali, and Gujarati (taken from UK government leaflets). The resource is freely available, although users are asked to agree to a brief statement of terms and conditions.
The Empirical Language Research (ELR) journal is an online peer reviewed e-journal for any kind of linguistic research based on empirical corpus data. ELR journal is the relaunched English Language Research journal with a changed name to reflect the emphasis on the use of empirical data for linguistic research. The journal will focus particularly on empirical approaches to linguistic theory; multilingual corpora and translation; data-driven learning; natural language processing; and corpus-driven lexicography and lexicology. Launching the journal as a free online e-journal is a statement in support for 'the movement towards making academic research open and freely available rather than obscure, expensive, and inaccessible’. At the time of review the journal contained only two issue with three articles. This publication is of interest to both scholars and students of empirical linguistics.
The English and Irish Language Terminology Database is a downloadable resource available from the Arts and Humanities Data Service (AHDS) website as a binary text RTF. Thirty-nine terminology lists and 20 dictionaries - the result of continuous work on Irish language terminology since 1922 - were inputted into the database. The database contains over 260,000 terms - constituting one of the largest terminology databases in the world. An Coiste Téarmaíochta (Terminology Committee), Foras na Gaeilge were responsible for the creation of these terms. The following areas are covered: science; commerce; computing; sport; history; religion; and current affairs.
English Language and Linguistics is a biannual journal which focuses on the description of the English language within the framework of contemporary linguistics. It covers a range of theoretical perspectives, including syntax; morphology; phonology; semantics; pragmatics; corpus linguistics; and lexis. The site has a link to Cambridge Journals Online, where free tables of contents and abstracts of articles, starting with volume 1, 1997, are provided. For registered users, there is the additional benefit of email alerting. The journal is available to institutions in print and electronic form, and to individuals in print only. Discounts are available to members of the European Society for the Study of English, the Linguistic Society of America, and the International Association of Teachers of English as a Foreign Language.
The English Language of the North-West in the Late Modern English Period website introduces a corpus of never-before transcribed letters written to Richard Orford, a steward at Lyme Hall in Cheshire, between 1761 and 1790. The collection is held in the John Rylands University Library of Manchester. These are unselfconscious practical letters, often by uneducated people, on matters of business, farming, mining, and social relations. A Corpus of Late Eighteenth-Century Prose contains about 300,000 words, available free for download as a single text file for electronic searching or as three linked HTML files for maximum readability. The corpus can be ordered via the Oxford Text Archive (formerly part of the Arts and Humanities Data Service (AHDS)), or from the project manager on completion of the request access form.
The website of the English-Norwegian parallel corpus (ENPC) offers information about the ENPC project and the corpus itself. The corpus was developed at the Department of British and American Studies of the Universitetet i Oslo (University of Oslo), and consists of original Norwegian texts and translations from and into English. It is intended for contrastive analysis of the two languages and translation studies. More detailed information about the corpus can be found in the ENPC manual, available on the site. The purpose of the manual is to describe the structure and explain the markup in the corpus. The ENPC manual starts with a description of the corpus, its aims and collection development policy, and proceeds to an explanation of its markup. The document has a chapter on tags used for linguistic analysis, including the markup for direct speech and thought, and word-class tagging. The manual also provides a description of the software written for the project, namely the Translation Corpus Aligner, which aligns texts automatically at the sentence level, and the Translation Corpus Explorer, which is a browser for parallel texts. The manual offers a list of texts included in the corpus and a list of word-class elements allowed by the ENPC DTD with notes on their usage. Links to publications (until 2001) and people involved with the project can also be found on the site together with links to extensions of the project. The encoding behind the corpus is in broad agreement with the TEI Guidelines, though the ENPC DTD differs from the TEI DTD in some respects, mainly through the addition of new tags and entities (all modifications to the TEI DTD are described in Appendix 3 to the document). The chapter on markup includes a detailed description of the encoding recommended for the header, text and its divisions, paragraphs, S-units, words, headings, punctuation, highlighting and quotation, foreign elements, notes, lists, figures, editorial comments, links and other textual elements.
Although the site is no longer updated, the information remains relevant.
Entwicklung und Implementierung eines Datenbanksystems zur Speicherung und Verarbeitung von Textkorpora is an online dissertation on the design of database systems for the storage and processing of text corpora. The dissertation starts with: an introduction to corpus linguistics; and an overview of some early and modern corpora including: the British National Corpus (BNC); the Bank of English; and the German language corpora developed at the Institut für Deutsche Sprache in Mannheim. The author discusses various aspects of corpus annotation, including: the choice of part-of-speech tag sets; automatic part-of-speech tagging; disambiguation; and parsing. The chapter on corpus analysis tools gives an overview of: text analysis and concordancing software; and such corpus analysis systems as: SARA developed for the BNC; COSMAS developed for work with German language corpora at the Institut für Deutsche Sprache in Mannheim; and the IMS Corpus Workbench developed at the Institut für Maschinelle Sprachverarbeitung in Stuttgart.
Subsequent chapters discuss the encoding of texts for linguistic corpora using SGML (Standard Generalised Markup Language) and the TEI (Text Encoding Initiative) Guidelines. The discussion of corpus markup includes 'Tokenisierung' - the identification of: whitespace; words; sentences; figures and punctuation to be encoded; and the building of corpus and text headers following the guidelines for the TEI header. The rest of the dissertation describes the design and building of a text database using the corpus database system CORSICA.
The Edinburgh University Speech Timing Archive and Corpus of English (EUSTACE) is a speech corpus comprising 4608 spoken sentences recorded for speech timing research at the department of Theoretical and Applied Linguistics at the University of Edinburgh. The full corpus is available for downloading and is intended to be useful for phonetics researchers and speech technologists working on synthesis and recognition. Example sentences are available for playback on the website, together with documentation including details of the experimental design, recording procedure, labelling methodology and original research results. The complete archive, available for downloading, includes a structured list of the sentences, the speech recordings and the label files, plus full documentation. Speech waveform files are available in WAW (RIFF) format and SD (ESPS) format. The downloadable corpus is free, and licensed for non-commercial use only. The original research was funded by the Engineering and Physical Sciences Research Council (EPSRC) and the production of the website was funded by the Moray Endowment Fund of Edinburgh University.
EXMARaLDA stands for Extensible Markup Language for Discourse Annotation and is a system of data formats and tools for annotation and transcription of spoken language. The system contains tools for analysing and querying annotated corpora. The markup is done in XML and the software is implemented in Java allowing them to run on a multitude of operating systems. The software is freely available for downloading and the site encourages feedback. The English version of the site contains a partitur editor, for annotation of music scores; a corpus management system; and a query tool for annotated data. The German version adds an annotation tool and a TEI document explorer. In addition there are some downloabable smaller annotated corpora; documentation; and a list of publications where some articles are available online. This is a useful site for anyone interested in text analysis; linguistic annotation; or corpus linguistics.
FreeLing is a open source language analysis tool suite, that is freely provided under a GNU General Public License of the Free Software Foundation. The tools have been developed by the TALP Research Center at the Polytechnic University of Catalonia. The software is programmed in C++ and runs under Linux but instructions for porting to other platforms are provided. At the time of review the suit contains English, Spanish, Galician, Italian and Catalan dictionaries; a text tokenisation tool; a sentence splitting tool; a tool for morphological analysis; and a part of speech tagger to mention a few. There is an online analyser that allows analyses of smaller samples, as demonstration, and there is extensive online documentation and manuals. Registration is free and quick but is needed for some of the features on the site. The site provides clear and useful instructions for installing the software. This is a powerful and very useful resource but it demands some knowledge of Linux and how to install software and prepare the system, alternatively how to compile and run C++ programs under other operating systems. Although this is not plug and play software this suit is a very useful tool for those interested in corpus linguistics and text analysis, especially in English, Spanish, Catalan, Galician or Italian.
Funded by the ESRC, AHRB and the British Academy, the French learner language oral corpora website aims to promote research in French language learning by providing a range of corpora (sets of recordings and their transcripts) from a number of research projects. Well-organized and easy to navigate, the site is divided into three main sections: the French learner language oral corpora project; the individual corpora which it catalogues; and further resources for researchers. The first section outlines the overarching project, the corpora it includes and the computerized Child Language Data Exchange System (CHILDES) employed by the researchers who carried out the projects. Access to the data is possible through downloading the CHILDES software, clear instructions for which are given in the Beginner's Guide pages under the section entitled 'Other Resources'. The second section elaborates on the descriptions given of the corpora in the first section. For each corpus, details of the research project, the tasks set and the learners involved are given. There is then a collection of the data produced by learners of French participating in each project. The primary data can be downloaded in a number of formats including sound, transcriptions, tagged and XML. The final section, 'Other Resources', offers an extensive bibliography, details of future conferences and a list of other related research projects. Collaboratively produced and run by the Modern Language Schools of both Southampton and Newcastle Universities, this website constitutes an accessible and useful resource for those engaged in teaching and research in this field. This resource can also be downloaded in XML format from the Oxford Text Archive (OTA) website (formerly part of the Arts and Humanities Data Service (AHDS)).
The British Academic Written English (BAWE) corpus is a project involving the universities of Warwick, Reading and Oxford Brookes and is funded by Economic and Social Research Council (ESRC). The corpus contains some 3000 academic student assignments, divided into four disciplinary areas, Arts and Humanities; Social Sciences; Life Sciences; and Physical Sciences, and across four levels of study. The corpus is accessible through the Open Sketch Engine which allows online searches in the corpus. The corpus itself is available free of charges to researchers who register with Oxford Text Archive. This is a valuable resource for researchers within the subject area of corpus linguistics and English language studies.
The gateway to Corpus Linguistics on the Internet is an excellent Web resource that aims to direct users to corpus linguistics materials - both for academic and non-academic purposes - that are available online. It offers annotated links to a wide variety of resources including: research centres; projects; events and mailing lists related to the field; online tutorials for corpus linguistics and concordances; corpora of different languages (with particular emphasis on English and German) and text archives; software; sites devoted to data-driven learning; and miscellaneous online resources such as electronic journals; dictionaries; and sites maintained by individual linguists. The site's author, Yvonne Breyer, also offers a bibliography of printed material for corpus linguistics, which will be expanded in the future, and a further bibliography relevant for forensic linguistics. At the time of review the site hadn't been updated for some time. This gateway is a substantial and well-organised collection of links and while it does not claim to be exhaustive, it certainly offers a comprehensive range of resources and will certainly help anyone working within the field of corpus linguistics to easily locate material online.
The German Parole Corpus is an electronic resource which can be downloaded from the Oxford Text Archive (formerly part of the Arts and Humanities Data Service (AHDS)). This corpus of approximately 23 million words contains written texts of the modern German language, subdivided into four domains: books; newspapers; periodicals; and miscellaneous. The encoding format used is TEI P3 SGML. The material can be accessed free of charge, although users are asked to agree to a short statement of terms and conditions.
Originally supported by the Economic and Social Research Council, the GerManC website presents a corpus-building project conducted within the School of Languages, Linguistics and Cultures of the University of Manchester. The project aims to build a corpus of written German covering the period 1650-1800. The structure of the corpus will parallel that of similar historical linguistic corpora in English, such as the ARCHER project or the Helsinki Corpus. A pilot study of German newspapers, started in 2006, was successfully completed in April 2007 and the 100.000 word corpus created then is available online. The annotated texts as well as the accompanying documentation covering the complete list of texts the procedures for building the corpus, and the annotation codes are initially made available through the GerManC website, in future through Oxford Text Archive. A list of conference papers on the GerManC and links to contact details for researchers on the project complete this informative site on an excellent research project.
This is the website of the University of Cambridge project, A Historical Corpus of the Welsh Language. The project ran from 2001 to 2004, and its architects hope to extend it further in the future. The project aims to produce a historical corpus of Welsh texts from the Early Modern Welsh period (1500-1850) in a readily searchable electronic format for researchers in Celtic studies and historical linguistics. The corpus is produced in a format that conforms to the standards of the Text Encoding Initiative (TEI). Texts can be viewed, browsed, searched and downloaded in different formats (including the original XML). The project has received funding from the Arts and Humanities Research Board (now the AHRC) Resource Enhancement award.
The Hypermedia Corpus of Spoken Japanese is a joint research project involving universities in Japan and the USA. It has been running since the mid-1990s and is supported by the Japanese Ministry of Education. The samples from the corpus presented on this website consist of free speech, role plays and conversations involving native speakers and foreign learners of Japanese. The corpus is unusual in making the data available in digital video and audio form (movies) as well as in full transcripts (in Japanese script), thus allowing users to study intonation, facial expressions, gestures and other features of non-verbal communication without recourse to complex linguistic notation systems. Teachers of Japanese as a foreign language as well as researchers in linguistics will therefore find this corpus of value. The information pages are in both English and Japanese, but the transcriptions of the spoken extracts are in Japanese only.
The ICAME website contains information about the International Computer Archive of Modern and Medieval English, an organisation of linguists and information scientists working with English corpora. The organisation distributes a number of machine-readable collections of text and makes available information about work done on these and other English corpora through a comprehensive online bibliography. Additions to the bibliography can be made via the site. The ICAME Journal is one of the leading journals in corpus linguistics, dating back to 1979, and electronic versions of the publication are available on the site (PDF format). Another valuable resource on the site is the Corpora email list page which not only provides information about the list but also offers access to the archives of all previous messages. The website also provides manuals for the corpora and text collections distributed by ICAME, among them the Brown; LOB; FLOB; Frown; Helsinki; and London-Lund corpora. The ICAME corpus collection itself is available online but only to registered users of the ICAME. A CD-ROM may be purchased from the site. ICAME holds an annual conference and some information about that can be found on the site. This site will be of interested to corpus users and academic linguists interested in the potential of electronic language processing.
The International Corpus of English (ICE) website presents a corpus compilation project that aims to provide comparable corpora of English from different English-speaking regions around the world. Each corpus will contain one million words of spoken and written language, taken from a wide range of sources and situations. There is a common corpus design that is being used by every compilation team, and a common scheme for grammatical annotation, thus ensuring compatibility between the corpora. The site describes the corpus design and annotation schemes and provides information about the different ICE teams, including information about the different varieties of English, bibliographical references and related links. As of January 2008, the following corpora are available for download: Hong Kong; East Africa; India; Philippines; and Singapore. The corpora from Great Britain and New Zealand are available on CD. Sample sound files can be found on the website.
International Journal of Corpus Linguistics (IJCL) is a scholarly periodical publishing new contributions in the growing area of corpus linguistic research. Corpus linguistics provides the computational methods for extracting linguistic knowledge on the basis of systematic empirical analysis of naturally occurring language. The IJCL home page offers an overview of the journal and guidelines for contributions, as well as, contents and abstracts. Additional information about the journal, subscription information and a link to IngentaJournals (giving online access to subscribed users) is available on the page. Contents are available online from volume 1, issue 1, 1996 while contents and abstracts are available from volume 5, issue 1, 2000.
The IntraText Digital Library website makes available hundreds of texts in over 35 languages ranging from Sardo to Tetum. The works cover a wide range of subjects, including literature, religious, philosophical, legal and scientific works. The religious section contains over 3,000 texts. There is a range of corpora, of use to students of linguistics. The site is a wonderful resource for linguists and students of literature or cultural studies. The works are presented in hypertext form with concordances. The site can be searched by language, author, or title.
The Italnet project consists of two major collections of interest to linguists: the Opera del Vocabolario Italiano and FIOLA, the Franco-Italian online archive. In addition to these, the website provides links to: the International Gramsci Society home page and online journal; the inventory catalogue of the drawings in the Biblioteca Ambrosiana, Milan; and the website of the exhibition Renaissance Dante in print (1472-1629). The Opera del Vocabolario Italiano is a database of early Italian writing, including works written before 1375 (the year of Boccaccio's death). It currently contains approximately 2,000 documents, including the prose and poetry of Dante, Petrarch, Boccaccio, and other less famous poets, and also merchants' records and medieval chronicles.
The collection totals over 21 million running words, and around 480,000 distinct lexical forms. The texts have been classified by genre, and information is also available on their date of composition and linguistic area. The collection is available as a searchable database over the Internet, provided the user is registered with ItalNet, or they are accessing the database via an ARTFL subscribing institution. One can search for single and multiple words and phrases across the whole collection, or limit searches to single authors and works, time periods and linguistic area. Results are available as detailed concordance or keyword-in-context (the latter showing a single line of text only for each occurrence). For each occurrence, an abbreviated reference is given (indicating page numbers), and a full bibliography is attached at the end of the results. In addition, results can also be obtained as a table listing the number of occurrences of the keyword/phrase and the reference, in descending order of popularity. This is expressed as a simple count rather than a percentage. Depending on one's Web browser, one may print off or save results as HTML or plain text files. It is not possible to access the full-text of any single work contained in the Opera del Vocabolario Italiano.
FIOLA - Franco-Italian online archive is a new and at present very small collection of texts written in a mix of French and Italian. It currently contains only two documents: 'La Guerra di Attila' and 'l'Entrée d'Espagne', though further texts are being prepared for inclusion. It will concentrate on works written between the 12th century and the Renaissance. The collection is available as a searchable database. One can look for words or phrases. Results give a count for all occurrences, and concordance. It is possible to browse through the full-texts of FIOLA, though for this facility an ARTFL username and password must first be obtained.
The IViE corpus: English Intonation in the British Isles website provides information about the Intonational Variation in English (IViE) project and access to the IViE corpus. The project examined cross-varietal and stylistic variation in English intonation, and was funded by the ESRC. It ran between 1997 and 2001 at the Phonetics Laboratory, University of Oxford and Department of Linguistics, University of Cambridge. The corpus created by the project includes 36 hours of speech recordings of nine urban varieties of English (London, Cambridge, Cardiff, Leeds, Bradford, Liverpool, Belfast, Dublin). Three of the varieties represent the speech of ethic minority groups. The recordings were collected among 16-year-olds in secondary schools and represent several different speaking styles. Part of the corpus has been prosodically transcribed. The corpus is freely available for academic research and teaching purposes and can be downloaded from the website or searched online. Information about the corpus and the research based on it can be found on the webpage. A number of the publications by the project can be accessed online. The corpus can also be ordered via the Oxford Text Archive (OTA) website (formerly part of the Arts and Humanities Data Service (AHDS)), on completion of a request access form.
JRC-Acquis Multilingual Parallel Corpus is a collection of European Union legal texts, in 22 of the member state's languages, that has been aligned and coded in XML, providing an invaluable tool for linguistic research and a resource for computational linguistic applications. The corpus consists of a selection of texts from the Acquis Communautaire (AC), the total body of European Union (EU) law, applicable in the the EU member states and contains some 636 million word tokens. The languages included are, Bulgarian; Czech; Swedish; German; Greek; English; Spanish; Estonian; Finnish; French; Hungarian; Italian; Lithuanian; Latvian; Maltese; Dutch; Polish; Portuguese; Romanian; Slovak; Slovene; and Danish. The language pairs have been aligned automatically, using two different sets of software and is not proof read by humans. The texts are legal documents from different countries expressing EU legislation. The texts, are thus, not necessarily translations of each other. For example, the sub corpus of aligned Finnish and Maltese texts are most likely not translations of each other but rather translations or interpretations of a separate original text. They are still parallel texts useful for translation studies or comparative studies. The complete corpus, as separate texts in different languages or aligned language pairs, in two version, is downloadable from the site. In addition there is a biography of publications concerning the project, where some articles are downloadable as PDF-files. This makes this a valuable tool for anyone interested in translation studies, comparative linguistics or European languages in general.
The KRYS I corpus is a set of some 6300 documents, in PDF-format, that has been collected by students at the University of Glasgow. The documents have been classified in 70 different genres in ten broader categories. The students were given one of these genres each and were told to collect up to 100 documents available on the Web within this genre. This means that the sampling itself becomes part of the research. Some 5300 documents were reclassified by independent researchers. For a substantial part of the documents the genre differed between the initial and the secondary classification. The different classifications are included in the metadata associated with the documents. The site contains information about the sampling and methods used when collecting the data. The corpus is available for research purposes subject to the demands of the copyright holders. This is a unique resource and valuable for researchers and students in, for example, the areas of automatic text classification; text mining; and pattern recognition.
The Lampeter Corpus of Early Modern English Tracts is a collection of non-literary prose texts covering the period between 1640 and 1740. The period is enclosed between the outbreak of the Civil War in 1642 and the beginnings of the Industrial Revolution in the 18th century and is marked by the standardisation of British English. The corpus consists of 120 texts (tracts, pamphlets). The texts are subdivided into ten decades and six domains: religion; politics; economy; science; law; and miscellaneous. Each domain is represented by two texts in each decade. The total comes up to 1.1 million words. The texts are encoded according to the guidelines of the Text Encoding Initiative (TEI) and use of the Standard Generalised Markup Language (SGML). They are available free of charge for scholarly research and are aimed at linguists and historians.
This corpus consists of two collections of seventeenth-century English "newsbooks". Both were drawn from the Thomason Tracts collection, which is held at the British Library, and available in graphical form via Early English Books Online (EEBO). The construction of these electronic versions were in both cases funded by the British Academy. They address a wide range of news (especially foreign and political news) from the time of their publication (the first 5 and a half months of Oliver Cromwell's reign as Lord Protector). Important contemporary events include Glencairn's Rebellion in Scotland, the negotiation of a peace with Holland, and Queen Christiana of Sweden's abdication. This resource is available via the Oxford Text Archive (OTA) website, and can be downloaded as a zipped file, in XML format.
The website of the Lancaster Speech, Thought and Writing Presentation Written Corpus describes two aspects of the project. One focus of the project is the investigation of the nature of Speech, Thought, and Writing Presentation in narrative texts. The project is essentially divided into the Spoken Corpus and the Written Corpus. The Spoken Corpus is described separately and features an online handbook to accompany the corpus, and the user can listen to, or view a sample of the Spoken Corpus. There is a list of publications and a slide presentation. The Written Corpus is also accompanied by a handbook, a sample, and a list of publications. This project received funding from the Arts and Humanities Research Council (AHRB) within the Research Grants scheme. The resource can also be downloaded from the Oxford Text Archive (OTA) website (formerly part of the Arts and Humanities Data Service (AHDS)).
The Language Technology Activities in the Web Technology Sector website contains information about the activities of the Joint Research Centre (JRC) of the European commission. The idea is to use language technology to overcome the language barrier between different European languages and to fight the information overflow encountered on the web. The centre works with document analysis and retrieval systems for that purpose. The website contains information about different tools and methods for document analysis and retrieval along with reports and articles in the area in PDF-format. To facilitate this research two important language resources have been created. Both of them are built up from parallel texts from the European Commission. The JRC-Acquis is a large aligned parallel corpus containing parallel texts in 22 languages. The DGT-TM translation memory is a collection of translations between languages. Although smaller and more limited that the JRC-Acquis, most of the alignments in teh translation memory are manually corrected. Both resources are freely downloadable from the website in XML-format with information about the encoding. Although the site is not easy to navigate it still contains some very useful information and resources and will be of value to researchers and students in computational and corpus linguistics, especially in the area of parallel corpora and translation studies.
A Linguistic Atlas of Early Middle English, 1150 - 1325 (LAEME) is an interactive online atlas, designed to enable regional and chronological linguistic study of English during this period. The Atlas complements the printed 'Linguistic Atlas of Late Mediaeval English' (LALME), which covers the period immediately following LAEME. Resources provided as part of the atlas include: a comprehensive introduction to the atlas, its contents and uses; a corpus of lexico-gramatically tagged texts (in a searchable database); a database of information regarding LAEME corpus sources; information on the software used by LAEME with instructions on concordancing, dictionary-making and map-making; and a corpus of etymologies and changes. Searches are performed mainly via 'task' buttons, which bring up search fields relevant to particular interests, namely: mapping; concordancing; timetables; tagged texts; and dictionaries. The maps illustrating regional word usages are particularly useful for those researching the origins of a particular work or manuscript. LAEME is designed specifically as a non-commercial teaching and research resource, to be cited as per a printed text. This resource would be of use both to linguists and also to medievalists studying manuscripts of the period.
The Linguistic Data Consortium (LDC) is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The intended use of LDC-Online is to facilitate linguistic research and development. LDC-Online retrieves only concordances or statistical summaries, not whole documents. The LDC's catalogue currently contains over 200 corpora of language data and this is continuing to expand. These corpora are usually available on CD-ROM from the LDC, with full details of their contents being provided on the website. To use the full range of services provided by the LDC membership is required, details of which are provided at the site. The website requires registration in order to access all its constituent parts.
Literary and Linguistic Computing (LLC) is a quarterly journal published by the Association for Literary and Linguistic Computing. Individual subscription to the journal provides automatic Association membership. LLC focuses on the application of computing and information technology to literature and language research and teaching: digital libraries; corpus databases; electronic dictionaries; electronic publishing and teaching. The site gives access to: contents and abstracts dating back to 1986; instructions for authors; online alerting service; and links to related journals.
The Michigan Corpus of Academic Spoken English (MICASE) has recorded and transcribed nearly 200 hours (over 1.7 million words) of English spoken text within academic contexts at Michigan. The corpus includes spoken English from academic staff, students, native and non-native speakers across all subject areas. The academic settings include lectures; symposiums; student presentations; seminars; tutorials; vivas; and tutorials. The types of discourse events include monologues, panels, and interactive sessions.The corpus is browsable and searchable by a range of metadata including: speech event type; academic faculty and discipline; participant level or role; discourse mode; gender; age; and native speaker status.
Search results are displayed as a concordance (KWIC) with the option of specifying additional attributes (e.g. gender, age) for display with each result. Browse results display a list of files matching the criteria accompanied by information on the number of occurrences of any specified term, the recording length, and the word count of the transcript. Each transcript is accompanied by a header containing metadata and the transcript itself is available as either an HTML or SGML file. The transcripts are not encoded with part of speech tags although the SGML encoding, which is based on the Text Encoding Initiative (TEI) Guidelines, does indicate change of speaker (with metadata); pauses, events, and overlaps.
The website provides supporting information about the project including a bibliography of research, transcription conventions, categories, and help on searching. There is also a small set of teaching materials and research presentations incorporating MICASE. The project was assisted in the transcription of digital audio files by SoundScriber software, which is also freely downloadable from the site.
The Middle English Grammar Project (MEG) is funded by the Norwegian Research Council and based at the University of Stavanger, Norway and the University of Glasgow. The eventual aim of the project is to produce a reference grammar of Middle English, based on a corpus of electronic texts. The project's website provides: an introduction to the project and its methods; a description of work currently being done by project members; a list of related sites; a list of related publications by project staff; news and contact information. The site also gives access to HTML and PDF versions of the MEG corpus of electronic texts, which can be browsed by dialect region. This site would be of interest to those studying linguistics or Middle English.
Mike Scott's Web is the homepage of Mike Scott and contains links to software developed by him. Most important is the link to Wordsmith Tools that is a very powerful and useful set of programs for analysing texts and text corpora. It supports a variety of input formats, does not require pre-indexing of the texts, and settings can be adjusted to take account of text characteristics and tagging. A very wide range of languages and character sets are supported. The software can create concordances, word lists and sets of prominent keywords. plus there are utilities for text manipulation and processing.The website also features wordlists derived form large reference corpora which can be used with the Keywords tool in order to identify words which occur with an unexpected frequently in a given text. There is an FAQ, installation instructions plus news and updates.The user can download a demo version of the program from this site, which has full functionality except that there is a limit on the number of query results returned. A licence for the full version can be obtained from Oxford University Press.
MPQA Releases - Corpus and Opinion Recognition System website contains information about and gives access to the MPQA (multi-perspective question answering) opinion corpus, which is a collection of news articles annotated for opinions and sentiments. The corpus is annotated with a system that encodes opinions and sentiments, expressed in the texts, in terms of contextual polarity. The site contains information about the corpus and the instructions used for annotating the corpus. The corpus itself is freely available and a request for downloading the texts can be sent from the webpage. A lexicon is downloadable directly from the page. In addition, the website enables the user to request downloading of OpinionFinder, which is a computer program that automatically identifies subjective sentences as well as various aspects of subjectivity within sentences. This website is a useful resource for researchers and students of corpus linguistics, computational linguistics and semantics.
The New Zealand English (NZE) website is a compilation of materials provided by a number of researchers on various aspects of English as spoken in New Zealand. The site has a simple, clear layout and provides articles on the origins, social variation and sounds of NZE. Extensive bibliography on NZE is also provided. The research projects listed include: NZE Dictionary Centre; Corpora of NZE; the NZE Journal; 'Origins of NZE Project'; English On-line Project (resources for teaching); and Evaluating English Accents Worldwide. A brief description of projects and contact details for further information are also provided.
The website associated with the Newcastle Electronic Corpus of Tyneside English (NECTE) describes a project aiming to improve access to and promote the re-use of dialect recordings made in the Newcastle conurbation between 1969 and 1994. The original corpus consisted of 86 loosely-structured interviews, most of which were subsequently phonetically and orthographically transcribed. Interviewees were drawn from a sample of the population of Gateshead in North-East England, spanning various social classes and age groups, and were encouraged to talk about their life histories and their attitudes to the local dialect. The more recent corpus (the ESRC-funded Phonological Variation and Change in Contemporary Spoken English), recorded in the early 1990s, set out to examine salient patterns of phonological variation and change in contemporary spoken British English, focusing on localised versus non-localised patterns of change. The NECTE project has amalgamated the two corpora and created the first TEI-conformant electronic vernacular corpus in a range of formats (sound files as well as phonetic and orthographic transcriptions that are also part-of-speech tagged). The site provides documentation about: the original resources and the NECTE team's enhancement of them; information about the people involved; publications resulting from the project; references; links; and appendices. The transcription and the audio files themselves are not accessible online. The site should be of use to anyone interested in Geordie dialect, linguistics, sociology, sociolinguistics, and the local public interested in changes in Tyneside expressions, folklore and reminiscences. The project was funded by the AHRC under its Resource Enhancement scheme. The resource can also be downloaded in XML format from the Oxford Text Archive (OTA) website (formerly part of the Arts and Humanities Data Service (AHDS)).
OLAC : Open Language Archives Community is the website of an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: developing consensus on best current practice for the digital archiving of language resources; and developing a network of interoperating repositories and services for housing and accessing such resources. This website is useful for those searching for language resources, such as: language corpora; linguistic databases; and language documentation. It is also of prime importance for those interested in how to describe language resources and how to share those descriptions. The site contains descriptions of the OLAC community and the various technical standards which it is developing. There are links to search interfaces which can be used to search the participating repositories for language resources.
Online Corpora is a collection of large scale linguistic corpora, compiled by professor Mark Davies at Brigham Young University. The corpora that are freely available on the site are Corpus of Contemporary American English (COCA); British National Corpus (BNC, not compiled by professor Davies); TIME Magazine corpus; Corpus del Español; and Corpus do Português. The site allows for searching the corpora for words and phrases with the result as key words in context, where it is possible to expend the context further. A word can be modified by its part of speech to limit the result further. The search engine and interface is easy to access and is a valuable tool for anyone interested in linguistic research.
Paris Speech in the Past is a collection of semi-literary representations of vernacular French speech from the 16th to 19th centuries, which is preceded by a set of tax-rolls from late 17th century Paris. The material can be downloaded as a zipped collection of RTF documents from the Oxford Text Archive website (formerly part of the Arts and Humanities Data Service (AHDS)). Access to the files is free of charge, although users are requested to agree to a brief statement of terms and conditions.
The Parsed Corpus of Early English Correspondence (PCEEC) consists of 4970 letters from 84 different collections, written between 1410 and 1695 and contains some 2.2 million word tokens in total. The corpus was compiled by the Sociolinguistics and Language History project team at the Department of English, University of Helsinki. The corpus is part of speech and syntactically annotated and the website gives information about the different tagging schemes used. The corpus itself is distributed by the Oxford Text Archive (OTA) and it may be used subject to copyright restrictions. The corpus is designed to be compatible with CorpusSearch, which is a suite of search tools designed by Beth Randall at the University of Pennsylvania. This is a valuable resource for anyone researching or studying the development of the English language. The Corpus can also be ordered via the Oxford Text Archive (OTA) website, (formerly part of the Arts and Humanities Data Service (AHDS)), on completion of a request access form.
The Penn-Helsinki Parsed Corpus of Historical English is a corpus of prose text samples of Middle English, Early Modern English and Modern British English. The corpus has been annotated for syntactic structure, allowing the user to search for syntax features as well as text strings. The corpus is available to institutions on a subscription basis. This website describes the corpus and provides instructions as to its use.The corpus contains a total of 1.3 million words, from over 50 text samples, each of which is given in three forms: a text file, a part-of-speech tagged file and a parsed file. There is also an additional file with philological and bibliographical information about each text. The website provides: a condensed version of the annotations manual; provenances for each of the texts used in the corpus; instructions for the search engine; and, in PDF format, the complete manual for the CorpusSearch programme.
The Gutenberg Project contains a collection of thousands of German literary classics (full-text versions). The novels are ordered alphabetically by author and there is also a search facility. The site provides a brief description of each author and their works, not all of which are present in the archive. As well as novels, the collection contains 20,000 poems and over 2,000 fairy tales and fables. The site also has useful links to several other online literature projects in a variety of different European languages, including Swedish, Finnish, Dutch, and Norwegian. The site is updated regularly as new works are added to the collection and is a valuable resource for providing German-language texts. The Project is part of Spiegal Online.
QAMUS is a website created by the lexicographer Tim Buckwalter, which describes the procedure of Arabic lexicography. It includes: compiling a corpus; producing the frequency of tokens; concordancing; and morphological parsing. The site provides detailed information on all these categories and the approach required in particular for identification of the word in Arabic and morphological analysis. The site's guidelines are supplemented by examples, tables and concordance files for illustrations. The text of the website is mainly in English with some Arabic words and transliterated texts. Although the transliterated texts might be difficult to read for the novice user, the table of the transliteration system provided on the site will help to identify those corresponding letters which can not be identified immediately. This is a very useful site for those specialists in Arabic lexicography and Arabic natural language processing.
Research and Development Unit for English Studies (RDUES) is the website of a research unit, based at the University of Central England, which consists of a team of corpus linguists and statisticians engaged in developing electronic databases and tools for the description of modern English language in use. Since the Unit's inception in 1989, work has progressed on various projects, all of which are summarised on the website. These have included: Neologisms in Journalistic Text; Analysis of Verbal Interaction and Automated Text Retrieval (AVIATOR); Automatic Collocational Retrieval of NYMs (ACRONYM); Analysis and Prediction of Innovation in the Lexicon (APRIL); System of Hypermatrix Analysis, Retrieval, Evaluation and Summarisation (SHARES); and WebCorp, a suite of tools for accessing the World Wide Web as a corpus. Most of the databases are not directly accessible from this site, although demonstration entries are provided in some instances. The WebCorp search engine is, however, publicly available. The site includes a bibliography of RDUES publications, some of which are available online. Many of the project description pages are also accompanied by more specific bibliographies.
This Web page describes the project Recent Grammatical Change in British and American English: A Corpus-based Approach conducted by Professor Geoffrey Leech of the University of Lancaster. The site lists his publications on the subject, and describes the Brown family of corpora of British and American written English - The Brown Corpus, The LOB (Lancaster-Oslo/Bergen) Corpus, The FROWN (Freiburg-Brown) Corpus and the FLOB (Freiburg Lancaster Oslo/Bergen) Corpus, used in the project. It aims to chart and analyse changes in frequency in the use of the English language within the thirty-year period 1961-1991. The focus is on areas of change occurring in the usage of modal auxiliaries, semi-modals, aspect, tense and mood and other areas such as noun phrase categories, questions and punctuation. The findings are described on the site and compared to provisional findings regarding spoken English. The project received a Research Grant from the Arts and Humanities Research Board (AHRB).
The Scottish Corpus of Texts and Speech (SCOTS) project aims to create a collection of audio and visual material and texts in electronic form which relate to language usage in Scotland (Scots and Scottish English, as well as other community languages). The constantly evolving corpus is available online and is intended to present a linguistic picture of contemporary Scotland. It contains over 1,100 documents and over 4 million words. The documents collected date from 1945 onwards, and most of the spoken texts were recorded since 2000. A search and browse facility is provided. The project is funded by the Arts and Humanities Research Council (AHRC) in collaboration with the School of English and Scottish Language and Literature at Glasgow University. Aside from the corpus itself, the website provides basic background information on the project, details of the people involved, links to related sites, and an opportunity to suggest texts or receive further information on the project. This is a valuable site for students and researchers of Scots languages and literature, but would also be of use to anyone with an interest in Scottish language or culture.
The online resource SCRIBE - Spoken Corpus of British English provides information on a pilot project that 'investigated the construction of a corpus of spoken British English'. The project ran in the academic year 1989/90 and was funded by the UK Department of Trade and Industry and the UK Science and Engineering Research Council. Research was facilitated by the partnership between the University College London, Cambridge University, Edinburgh University, the Speech Research Unit, and the National Physical Laboratory. This resource is part of the UCL website. Despite the project's short duration, resulting from the shortage of funding, a substantial prototype corpus was collected and partially annotated. The resource describes the current status of the project as well as provides its existing documentation in 'The SCRIBE Manual' that can be viewed online (HTML format). There are also samples of annotated audio recordings which can be downloaded. These have been grouped into two categories: Sample of many talker recordings and Sample of few talker recordings. Both categories provide recordings of male and female speakers, representing four dialect areas: South East, Glasgow, Leeds and Birmingham. This resource will be of interest, and use, to researchers of spoken English and corpus linguistics.
SMULTRON (Stockholm MULtilingual TReebank) is a parallel treebank initially developed by the Computational Linguistics Group at the Department of Linguistics, at Stockholm University. There are around 1000 sentences from a novel and some reports, in three languages, Swedish, English and German, that have been part of speech and syntactically annotated. The site contains a list of publications regarding the project, the articles downloadable as PDF-files, information about the project, the tag schemes and methods used. The obtain the corpus, free of charge, an application form can be filled in on the site. The site is somewhat confusing but this is still a valuable resource for anyone interested in Swedish, English or German grammar or translation studies.
Språkbanken (the Swedish Language Bank) holds a large, comprehensive electronic corpus of written Swedish. It is based on works of fiction, legal texts, official reports and daily newspapers. The corpus runs to about 75 million words. Information is given for all words on frequency of use through the whole corpus, and within individual texts. Encoding includes annotations on parts-of-speech for all words.The collection is searchable through the Web or Telnet. Individual words can be analysed, and there is limited support for multiple word and phrase searching. Results show frequency of use, the keyword-in-context (the length of surrounding text can be altered), and citations. Results can be saved as a file and obtained through anonymous FTP from the Bank of Swedish.Included in the corpus are the collected works of C. J. L. Almqvist, C. M. Bellman, and August Strindberg. There is also a historical corpus of Old Swedish, consisting of about 2 million words, and a corpus of 19th century novels consisting of about 3.7 million words. This is a valuable resource for anyone interested in Swedish or corpus linguistics.
The Survey of English Usage is one of the first corpora of the English language, started in 1959 by Prof. Randolph Quirk. It contains a million-word corpus of written and spoken English collected between 1955 and 1985, originally available on paper, now computerised. The project was continued by Prof. Sidney Greenbaum, director until 1996, and Dr. Bas Aarts, its current director. A new corpus was started in 1990, the International Corpus of English (ICE), with twenty centres around the world preparing a million-word corpus each of their own national or regional variety of English. The first one to be ready is the British English corpus, ICE-GB, which is downloadable or available on a CD-ROM. The site offers an excellent introduction to the project, as well as links to other projects; an extensive bibliography of books and articles using materials from the Survey; a list of MA and PhD theses written in the English department of University College London; annual reports; and links to other linguistic resources.
Text Analysis Portal for Research at the University of Alberta (TAPoR) is a project that aims to collect and make available tools for text analysis. It functions as a portal for a large set of useful computer tools that allow the researcher to visualise and analyse any plain, HMTL or XML text. One important idea is that the portal should provide the user access to these tools without having to download and install software on their own computers. The tools of TAPoR are freely available online. The site is a tad confusing but contains not only the tools but tutorials and instructions for their use and a set of texts that are called recipes and functions as instructions for a wide range of different kinds research methods. This site contains useful and invaluable tools for anyone interested in text analysis, both linguistic and for literature studies.
Tect Corpora and Corpus Linguistics is a useful resource for people interested in working with text corpora. It does not contain original material but offers an extensive list of links to other sites: corpora of various types and languages; software for text analysis and tagging; courses in corpus linguistics; bibliography; online papers and materials on corpus linguistics. The site is not particularly user-friendly to the novice researcher, but may be useful for the more experienced one who will uncover some new sources of information.
The TITUS (Thesaurus Indogermanischer Text- und Sprachmaterialien) project Web page is a multilingual online text retrieval system for Indo-European languages. The project started in 1987 with the creation of a digital collection in ancient Indo-European languages. The site contains texts in the following language families: Vedic; Sanskrit; Middle and Modern Indic; Old, middle, and modern Iranian; Anatolian; Tocharian; Armenian; Baltic; Slavic; Germanic; Greek; Italic; Celtic; Caucasian; Uralic; Proto-Cretan; Semitic; and Dravidic. Some material needs special software which is freely available from the site. The site also makes available: teaching material, such as detailed language maps and audio materials; news related to the area of study; the FAQ section; information about jobs in this area of research; an events diary; links to external related projects and institutions; Indo-European courses, mainly in Germany and in Austria; and a bibliography. Technical information, such as Unicode documentation and relevant software, is also available from the site. A number of the texts may be of interest to scholars of religion, including a selection of Buddhist and Hindu works, Avestan (Zoroastrian) texts, and multiple Bible versions, including the Septuagint (Greek translation of the Old Testament.) The user should note that the site uses split frames, which can sometimes complicate its navigation.
This website offers an interface to the text of the Time magazine from 1923 to the present day, over 100 million words in all. Users can search for a word or phrase and retrieve all instances of the string in context. Searches can be restricted to a particular period in time, and they may include information about part-of-speech (word class). The results can be displayed in different ways, allowing the user to see, for example, how the frequency of a word has changed over time. The interface also allows for retrieval of collocates (surrounding words). The resource offers a powerful way to explore the English language as published in the Time magazine over the years. The interface is easy to use. The accompanying help texts provide ample information about how to use the interface and also offers suggestions of the kind of questions that can be answered using the tool. This resource would be of use to anyone interested in the English language, language change, American English, and corpus linguistics. It also offers a valuable tool for looking at cultural and historical events as reported in the Time magazine.
Treebank Wiki is an online collection of links to treebanks, text corpora marked up with syntactic structures which are intended as analytical materials in corpus linguistics. The treebanks listed are in a variety of languages, including: English; Portuguese; Catalan; Spanish; Danish; Dutch; Czech; Romanian; Russian; Slovenian; and Italian. Many are available in the graphical format eGXL. The wiki also includes links to: the SFB 673-X1 project, which deals with multimodal alignment corpora; and Indogram, a project which examines the automatic induction of probabilistic document grammars as models of web genres.
This is the website for the Tycho Brahe project, based at the University of São Paulo. The project aims to research the relationship between prosody and syntax in the process of language change that led from Classical Portuguese to Modern European Portuguese. As well as linguistic and mathematical research, the project is also producing the Tycho-Brahe Parsed Corpus of Historical Portuguese, and a Comparative Tagged Corpus of Spoken Modern European Portuguese and Brazilian Portuguese. The former comprises texts written by Portuguese authors between 1550 and 1850, made available electronically for educational and research purposes. The user must complete an access-request form to download the texts and a link to the Tycho-Brahe Corpus is available through this site. The latter Corpus consists of categorized recorded registers of speakers of both dialects. The main website features all the papers, downloadable in PDF or Word format, written as part of the project between 1998 and 2003. Abstracts of the papers are also available. The user may also access details of the sub-projects in HTML, and information about the project's meeting and seminars. Although this site seems no longer to be updated it will be of interest to anyone working within the field of Portuguese linguistics.
The UAM (Universidad Autonoma de Madrid) corpus tool site gives access to software for downloading that enables the annotation of text corpora. The tool allows the creation of an annotation scheme within the framework of Systemic Functional Linguistics and may be used to annotate text according to the scheme. The annotation is done in XML and is saved in separate files which allows for overlapping analyses. The annotation uses system networks to describe the texts. The software allows searches of the texts according to features coded in the analysis. The site contains a manual for running the software and designing and using the annotation schemes. This tool is useful for researchers and students that need to annotate and analyse texts according to the theory of Systemic Functional Linguistics.
This is UCREL's website, the University Centre for Computer Corpus Research on Language; a research centre based in the Department of Computing and the Department of Linguistics and English Language at Lancaster University. The research group is dedicated in particular to corpus linguistics (the analysis of large bodies of text). This resource offers details on the centre's research; related, current, past and forthcoming events; technical online papers (articles dealing with corpora and computational linguistics and corpus manuals) which can be downloaded as PDFs; list of publications; and corpus annotation. The website also provides links to ACL Anthology (A Digital Archive of Research Papers in Computational Linguistics); further corpus software developed at Lancaster University and other relevant Corpus Linguistic sites, resources and tools.
The corpus consists of 1,489 essays written by 440 Swedish university students of English at three different levels, the majority in their first term of full-time studies. The total number of words is 1,221,265, which means an average essay length of 820 words. The most typical essay, from the first term, is somewhat shorter, averaging 777 words. The resource is available via the Oxford Text Archive (OTA) website, as a zipped plain text file.
This website presents a corpus of Medieval Welsh prose (from 1350 to 1425) including around 1.8 million words in 100 texts from 28 manuscripts. The corpus is fully searchable, and includes word lists which can be filtered by language and manuscript. Additionally, texts have been categorised by genre (Astronomy; Genealogy; Geography; Grammar; History; Law; Mabinogion; Medical; Religious; Romance; Wisdom). The project has been funded by the Arts and Humanities Research Council (AHRC).
The WordHoard is a project based at the Northwestern University and consists of a tool for the close reading and scholarly analysis of deeply tagged texts. In addition the site contains some corpora with tagged text designed to be used with the tool. These corpora are: Early Greek Epic, Homer, Hesiod, and the Homeric Hymns in the original Greek, with English and/or German translations; the complete Chaucer; all the poetical works of Spenser; and all plays and poems by Shakespeare. The website consists, mainly, of an extensive manual and documentation of the project which explains how to use the WordHoard tool and discusses research methods and possibilities. The texts that are available are lemmatised and tagged for parts of speech and have markers for speaker name; speaker gender; and speaker mortality. This is a very useful tool for systematic research on deeply tagged texts and, in addition, is useful as a discussion on corpus linguistic research methods within literary research.
The York Poetry Corpus is an annotated selection of Old English poetic texts from the Helsinki Corpus of English Texts. It contains 71,490 words; the size of the corpus is approximately 2.5 megabytes. It is funded by an ESRC grant.The York Poetry Corpus is a part of a larger project aiming to produce syntactically annotated corpora for all stages of the English language. It is intended for students and scholars studying the history of the English language. The Corpus is freely available for educational and research purposes. Viewing the manuals is unrestricted, but the texts themselves may be viewed after filling out an access request form. The York Poetry Corpus can also be ordered via the Oxford Text Archive (OTA) website (formerly part of the Arts and Humanities Data Service (AHDS)), upon completion of a request form.