PAROLE-DK og ePAROLE

Ressourcerne er udarbejdet af Ole Norling-Christensen, Britt-Katrin Keson, Jørg Asmussen m.fl.

PAROLE-projektets målsætning

Målet med PAROLE-projektet (1996-1998) var at opbygge og udgive omfattende basale og genanvendelige skriftsprogsressourcer for alle EU-sprog. Dette skulle ske i form af:

Almensproglige tekstkorpusser med 20 millioner ord for hvert af de følgende 14 sprog: dansk, engelsk, finsk, flamsk, fransk, græsk, hollandsk, irsk, italiensk, katalansk, norsk, portugisisk, svensk og tysk;
Sprogteknologiske leksika med 20.000 lemmaer for hvert af de følgende 12 sprog: dansk, engelsk, finsk, fransk, græsk, hollandsk, italiensk, katalansk, portugisisk, spansk, svensk og tysk.

Ressourcernes særlige værdi lå ikke kun i deres størrelse og det antal sprog, de dækkede, men bestod først og fremmest i, at de blev opbygget med fælles standarder og specifikationer.

Tekstkorpusser

Tekstkorpusserne blev samlet og annoteret efter de samme retningslinjer:

Teksterne skulle udvælges på baggrund af fastlagte fælles kriterier, fx skulle tidspunktet for teksternes tilblivelse være efter 1970, og de følgende udgivelsesmedier skulle være repræsenteret: bog, avis, tidsskrift og diverse;
alle tekster skulle annoteres i det samme markup-format (PAROLE DTD), både bibliografiske oplysninger og tekststrukturen ned til afsnitsniveau;
en delmængde af korpusset på 250.000 ord skulle annoteres med morfosyntaktiske oplysninger med et fælles PAROLE-tagsæt, som grundlæggende skulle være fælles for alle sprog, dog med mulighed for en række tags til opmærkning af træk, som er specifikke for et sprog. Den danske version af denne del af korpusset er offentligt tilgængelig og kan downloades fra denne side.

Sprogteknologiske leksika

Leksikaene er harmoniseret efter en fælles model, som blev udviklet til formålet (PAROLE-modellen), og som gør det muligt at kode morfologiske og syntaktiske oplysninger for alle involverede sprog. Således er alle leksika opbygget efter de samme designprincipper og lingvistiske specifikationer og anvender samme format.

Projektets medvirkende

Center for Language Technology (Danmark)
Centro de Linguistica da Universidade de Lisboa (Portugal)
Department of General Linguistics, University of Helsinki (Finland)
Department of Swedish, Språkdata, University of Gothenburg (Sverige)
Det Danske Sprog-og Litteraturselskab, DSL (Danmark)
Erli (Frankrig)
Fundacion Bosch Gimpera, Universitat de Barcelona (Spanien)
Institut d'Estudis Catalans (Spanien)
Institut für Deutsche Sprache (Tyskland)
Institut National de la Langue Francaise, CNRS (Frankrig)
Instituut voor Nederlandse lexicologie (nu Instituut voor de Nederlandse taal) (Holland)
Institute for Language, Speech and Hearing, University of Sheffield (U.K.)
Institute Teangelaiochta Eireann (Irland)
Instituto de Engenharia de Sistemas e Computadores (Portugal)
Université de Liege BELTEXT (Belgien)
University of Birmingham (U.K.)
University of Pisa (Italien)
Institute for Language and Speech Processing, R.C. "Athena" (Grækenland)

Ressourcer til download

Du kan frit downloade de følgende PAROLE-ressourcer ved at klikke på linkene:

Det morfosyntaktisk annoterede korpus PAROLE-DK med 250.000 ord og dokumentation
ePAROLE – betaversion af det morfosyntaktisk annoterede korpus PAROLE-DK tagget med ePOS-tagsættet. Der findes ingen dokumentation af korpusset endnu, men en beskrivelse af tagsættet i Design of the ePOS tagger

This work (PAROLE-DK and ePAROLE – Morphosyntactically Annotated Danish Language Corpus, by Ole Norling-Christensen, Britt-Katrin Keson, Jørg Asmussen, The Society for Danish Language and Literature, DSL), identified by The Society for Danish Language and Literature, DSL, is free of known copyright restrictions.

Korpusserne består af sætninger og små uddrag (citater) i tilfældig rækkefølge. De indeholder ingen hele tekster.

PAROLE-DK and ePAROLE

Aim of the PAROLE Project

The aim of the PAROLE project (1996-1998) was to compile and make available large, generic and re-usable written language resources for all EU Languages, comprising more specifically:

General language text corpora of 20 million words for each of the following 14 languages: Catalan, Danish, Dutch, English, Finnish, Flemish, French, German, Greek, Irish, Italian, Norwegian, Portuguese and Swedish;
Computational lexicons with 20.000 lemmas for each of the following 12 languages: Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish, Swedish.

The value of these resources lies not only in the size and number of languages covered by the project, but also in the fact that they are built according to common standards and specifications.

Text Corpora

The text corpora have been compiled and annotated following the same guidelines:

Texts are selected on the basis of specified common parameters for time of production (after 1970) and proportionate representation of the textual material according to publication medium: book, newspaper, periodical and miscellaneous;
all texts are annotated using the same mark-up format (PAROLE DTD) as regards bibliographical information and text structure (annotation at the level of paragraph);
a subset of the corpus (250.000 words) is morphosyntactically annotated according to a common core PAROLE tagset, extended with a set of language specific features.

Computational Lexicons

For the lexicons, harmonisation is achieved by developing a common model (the PAROLE model) which caters for the encoding of morphological and syntactic information in all languages; thus, all the lexicons are built according to the same design principles and linguistic specifications and are encoded in the same representation format.

Project Partners:

Center for Language Technology (Denmark)
Centro de Linguistica da Universidade de Lisboa (Portugal)
Department of General Linguistics, University of Helsinki (Finland)
Department of Swedish, Språkdata, University of Gothenburg (Sweden)
Det Danske Sprog-og Litteraturselskab, DSL (Denmark)
Erli (France)
Fundacion Bosch Gimpera, Universitat de Barcelona (Spain)
Institut d'Estudis Catalans (Spain)
Institut für Deutsche Sprache (Germany)
Institut National de la Langue Francaise, CNRS (France)
Instituut voor Nederlandse lexicologie (now Instituut voor de Nederlandse taal) (The Netherlands)
Institute for Language, Speech and Hearing, University of Sheffield (United Kingdom)
Institute Teangelaiochta Eireann (Ireland)
Instituto de Engenharia de Sistemas e Computadores (Portugal)
Université de Liege BELTEXT (Belgium)
University of Birmingham (United Kingdom)
University of Pisa (Italy)
Institute for Language and Speech Processing, R.C. "Athena" (Greece)

Resources for download

You are free to download the following PAROLE resources, just click on the links:

Morphosyntactically Annotated PAROLE-DK Corpus comprising 250.000 words, including documentation
ePAROLE – beta version of the morphosyntactically annotated PAROLE-DK Corpus tagged with the ePOS tag set. No documentation yet, refer to Design of the ePOS tagger instead

These corpus resources comprise sentences or shorter excerpts in arbitrary order. They do not contain full texts.