The zip-file contains two lists:
The technical format of these lists is plain text with LF newlines as used by Unix, Linux, and OS X. This may cause some formatting issues in Windows-based text processors that often can be solved by opening the list in a text editor and save it under a new name. Each line comprises three items: The part of speech (POS) of a lemma, the lemma itself, and its frequency. These items are separated by TAB characters. The following snippet shows the first ten lines of the excluding list.
T i 0.032249628510297 V være 0.0309023882233708 C og 0.029584070147617 P en 0.0253413695013101 P den 0.0248728148892572 T på 0.015317332743123 T til 0.0152345449047462 P det 0.0147142978135353 U at 0.0144963754376622 T af 0.014170235977948
The lemmas are tagged with one of the POS-markers from the table below. The including list contains words of any POS whereas the excluding list (ex-list in the table) only includes words of those POS marked with a dot.
|U||"unique"||at, som, der||•|
POW in the liste above means 'part of word'. As hyphens and apostrophes are defined as word delimiters in the underlying corpus, some word parts occur in the including list. These belong to one of the following types:
The lemma frequency is a real number indicating the number of occurrences of all forms of the lemma in the underlying corpus divided by the size of the corpus in tokens. The underlying corpus has a size of approximately 880 million tokens and comprises text material from 1983 until 2016.
By downloading the following corpus corpus resources, you agree to the conditions for the use of DSL's language resources. You need a password to unzip the resource files. To obtain a password, please send a mail to firstname.lastname@example.org with a brief description of the purpose(s) you intend to use the resources for.
Edited by Jørg Asmussen · 2018-07-04
The underlying corpus is the 2016 version of the BAKSPEJLET Corpus which is used by the editorial staff of The Danish Dictionary. Learn more about the BAKSPEJLET Corpus and other corpora compiled by DSL (in Danish). ↩