DSL-logo
Sprogteknologi // Language Technology

Go to English version 🇬🇧

Word2vec-model for dansk

Ressourcen er udarbejdet af Nicolai Hartvig SĂžrensen, DSL.

Semantiske modeller er trĂŠnet pĂ„ DSL's tekstkorpusser ved hjĂŠlp af Python-pakken Gensims (ƘehƯƙek & Sojka 2010, ƘehƯƙek 2013) implementation af Word2vec-algoritmen (Mikolov et al 2013a, 2013b). Modellerne stilles til rĂ„dighed i tre formater:

Modellerne er trÊnet med 500 features, et "vindue" pÄ 5 ord omkring sÞgeordet og ord, der optrÊder mindre end 5 gange i korpus, er sorteret fra, og der er anvendt "skip-gram" som trÊningsalgoritme. Modellen "DSL_skipgram_2020.model" er trÊnet med et korpus pÄ godt en milliard lÞbende ord med tekster fra 1983 til og med 2019.

En Word2vec-model kan bruges til at foreslĂ„ semantisk lignende ord for et givent ord – her de 10 mest semantisk lignende ord til bredbĂ„ndsforbindelse, listet med det mest semantisk lignende ord fĂžrst:

>>> from gensim import models
>>> MODEL_FILE = '/path/to/dsl_skipgram_2020.model'
>>> model = models.Word2Vec.load(MODEL_FILE)
>>> model.wv.most_similar(positive=[u'bredbÄndsforbindelse'])
[(u'internetforbindelse', 0.8066693544387817), 
(u'netforbindelse', 0.7755526900291443), 
(u'internetopkobling', 0.7506338953971863), 
(u'fastnettelefon', 0.744118332862854), 
(u'opkobling', 0.7368448972702026), 
(u'adslforbindelse', 0.7236030697822571), 
(u'adsl', 0.7201685905456543), 
(u'dataforbindelse', 0.7131620049476624), 
(u'internetadgang', 0.6997736692428589), 
(u'netadgang', 0.689690113067627)]

Tallet er en "score for lighed", mellem 0 og 1: Jo tÊttere pÄ 1 jo mere semantisk lig er ordet ifÞlge modellen.

Det gÊlder ogsÄ ord, der aldrig har vÊret beskrevet i ordbÞger pÄ dansk:

>>> model.wv.most_similar(positive=[u'shubidua'])
[(u'gasolin', 0.7520411610603333), 
(u'gnags', 0.7255157828330994), 
(u'steppeulvene', 0.6861495971679688), 
(u'danseorkestret', 0.6719813346862793), 
(u'sweethearts', 0.6715674996376038), 
(u'rocazino', 0.6714069843292236), 
(u'moonjam', 0.6660619974136353), 
(u'kandis', 0.6593234539031982), 
(u'nephew', 0.6578758955001831), 
(u'outlandish', 0.6574865579605103)]

Eftersom ord beskrives med vektorer, kan man ogsÄ "lÊgge dem sammen" og "trÊkke dem fra hinanden". Hvis man trÊkker vektoren for england fra vektoren for london og lÊgger vektoren for japan til, fÄr man Japans hovedstad som nÊrmeste resultat:

>>> wv_tokyo = model.wv['london'] - model.wv['england'] + model.wv['japan']
>>> model.wv.most_similar(positive=[wv_tokyo])[:3]
[('tokyo', 0.7956316471099854),
 ('japan', 0.6992685794830322),
 ('london', 0.6554198265075684)]

VĂŠr opmĂŠrksom pĂ„, at der optrĂŠder fejlstavede ord i modellen og at visse tegn, fx "-", er fjernet fra ordene. Som det fremgĂ„r af eksemplerne, skelnes der desuden ikke mellem grundformer og bĂžjningsformer – et ord er et ord.

Download

FĂžr du downloader materialet, skal du acceptere betingelserne vedrĂžrende ophavsret, brug og kreditering.


Word2vec-model for Danish

Compiled by Nicolai Hartvig SĂžrensen, Society of Danish Language and Literature, DSL.

Semantic models trained on DSL's text corpora applying the Python library Gensim's (ƘehƯƙek & Sojka 2010, ƘehƯƙek 2013) implementation of the Word2vec algorithm (Mikolov et al 2013a, 2013b). The models are provided in three formats:

The models are trained with 500 features and a window of 5 words around the search word. Words that occur less than 5 times in the corpus are ignored. Skip-gram is used as training algorithm. The model "DSL_skipgram_2020.model" has been trained on a corpus of more than 1 billion words with texts written between 1983 and 2019.

A Word2vec model may be used to suggest semantically similar words for a given word. The following example shows the 10 most semantically similar words to bredbÄndsforbindelse ('broadband connection'), sorted by semantic similarity thus showing the most semantically similar words first:

>>> from gensim import models
>>> MODEL_FILE = '/path/to/dsl_skipgram_2020.model'
>>> model = models.Word2Vec.load(MODEL_FILE)
>>> model.wv.most_similar(positive=[u'bredbÄndsforbindelse'])
[(u'internetforbindelse', 0.8066693544387817), 
(u'netforbindelse', 0.7755526900291443), 
(u'internetopkobling', 0.7506338953971863), 
(u'fastnettelefon', 0.744118332862854), 
(u'opkobling', 0.7368448972702026), 
(u'adslforbindelse', 0.7236030697822571), 
(u'adsl', 0.7201685905456543), 
(u'dataforbindelse', 0.7131620049476624), 
(u'internetadgang', 0.6997736692428589), 
(u'netadgang', 0.689690113067627)]

The numbers in the example are a "score of similarity" that lies between 0 and 1: The closer it gets to 1 the greater semantic similarity according to the model.

Semantic similarity may also be computed on words that never have been described in Danish dictionaries as the following example shows ('shubidua' is the name of a Danish pop band as are the semantically similar words):

>>> model.wv.most_similar(positive=[u'shubidua'])
[(u'gasolin', 0.7520411610603333), 
(u'gnags', 0.7255157828330994), 
(u'steppeulvene', 0.6861495971679688), 
(u'danseorkestret', 0.6719813346862793), 
(u'sweethearts', 0.6715674996376038), 
(u'rocazino', 0.6714069843292236), 
(u'moonjam', 0.6660619974136353), 
(u'kandis', 0.6593234539031982), 
(u'nephew', 0.6578758955001831), 
(u'outlandish', 0.6574865579605103)]

As words are denoted as vectors, it is possible to add/subtract them to/from another. If you subtract the vector for england from the vector for london and add the vector for japan you get the capital of Japan as the closest result:

>>> wv_tokyo = model.wv['london'] - model.wv['england'] + model.wv['japan']
>>> model.wv.most_similar(positive=[wv_tokyo])[:3]
[('tokyo', 0.7956316471099854),
 ('japan', 0.6992685794830322),
 ('london', 0.6554198265075684)]

Note that misspelled words occur in the model and that certain characters, e.g. "-", have been removed from the words. Words are not lemmatized as the examples given here show.

Download

Before downloading this material, you must accept the conditions for copyright, use, and crediting that apply.