Most frequently used lemmas in Danish

The zip-file contains two lists:

  1. Most frequently used 10000 lemmas of Danish including proper nouns and numerals
  2. Most frequently used 10000 lemmas of Danish excluding proper nouns and numerals.

The technical format of these lists is plain text with LF newlines as used by Unix, Linux, and OS X. This may cause some formatting issues in Windows-based text processors that often can be solved by opening the list in a text editor and save it under a new name. Each line comprises three items: The part of speech (POS) of a lemma, the lemma itself, and its frequency. These items are separated by TAB characters. The following snippet shows the first ten lines of the excluding list.

T	i	0.032249628510297
V	være	0.0309023882233708
C	og	0.029584070147617
P	en	0.0253413695013101
P	den	0.0248728148892572
T	på	0.015317332743123
T	til	0.0152345449047462
P	det	0.0147142978135353
U	at	0.0144963754376622
T	af	0.014170235977948

The lemmas are tagged with one of the POS-markers from the table below. The including list contains words of any POS whereas the excluding list (ex-list in the table) only includes words of those POS marked with a dot.

Tag POS Example Ex-list
A adjective god
C conjunction og
D adverb ud
EW POW lex.item anti@
I interjection ja
L numeral 13
LW POW numeral 10
M POW morph.item @erne
NC common noun år
NP proper noun Danmark
NW POW noun tv
P pronoun den
T preposition i
U “unique” at, som, der
V verb være

POW in the liste above means ‘part of word’. As hyphens and apostrophes are defined as word delimiters in the underlying corpus, some word parts occur in the including list. These belong to one of the following types:

The lemma frequency is a real number indicating the number of occurrences of all forms of the lemma in the underlying corpus divided by the size of the corpus in tokens. The underlying corpus has a size of approximately 880 million tokens and comprises text material from 1983 until 2016.


