The zip-file contains two lists:
The technical format of these lists is plain text with LF newlines as used by Unix, Linux, and OS X. This may cause some formatting issues in Windows-based text processors that often can be solved by opening the list in a text editor and save it under a new name. Each line comprises three items: The part of speech (POS) of a lemma, the lemma itself, and its frequency. These items are separated by TAB characters. The following snippet shows the first ten lines of the excluding list.
T i 0.032249628510297 V være 0.0309023882233708 C og 0.029584070147617 P en 0.0253413695013101 P den 0.0248728148892572 T på 0.015317332743123 T til 0.0152345449047462 P det 0.0147142978135353 U at 0.0144963754376622 T af 0.014170235977948
The lemmas are tagged with one of the POS-markers from the table below. The including list contains words of any POS whereas the excluding list (ex-list in the table) only includes words of those POS marked with a dot.
|U||“unique”||at, som, der||•|
POW in the liste above means ‘part of word’. As hyphens and apostrophes are defined as word delimiters in the underlying corpus, some word parts occur in the including list. These belong to one of the following types:
The lemma frequency is a real number indicating the number of occurrences of all forms of the lemma in the underlying corpus divided by the size of the corpus in tokens. The underlying corpus has a size of approximately 880 million tokens and comprises text material from 1983 until 2016.
OBS! By downloading resources from this site, you agree to the conditions for using them. You need a password to unzip the resource files. To obtain a password, please send a mail to firstname.lastname@example.org with a brief description of the purpose(s) you intend to use the resources for.