| I. IDENTIFYING INFORMATION |
|
| Title* |
Swedish analogy test set v1.0
|
| Subtitle |
Swedish semantic and syntactic similarity test set
|
| Created by* |
Tosin Adewumi (tosin.adewumi@ltu.se), ML Group, LTU
|
| Publisher(s)* |
Språkbanken Text (sb-info@svenska.gu.se)
|
| Link(s) / permanent identifier(s)* |
https://spraakbanken.gu.se/en/resources/analogy
|
| License(s)* |
CC BY 4.0
|
| Abstract* |
The Swedish analogy test set follows the format of the original Google version. However, it is bigger and balanced across the 2 major categories, having a total of 20,638 samples, made up of 10,381 semantic and 10,257 syntactic samples. It is also roughly balanced across the syntactic subsections. There are 5 semantic subsections and 6 syntactic subsections. The dataset was constructed, partly using the samples in the English version, with the help of tools dedicated to Swedish translation and it was proof-read for corrections by two native speakers (with a percentage agreement of 98.93\%).
|
| Funded by* |
Vinnova (grant no. 2019-02996)
|
| Cite as |
[1]
|
| Related datasets |
Part of the SuperLim collection (https://spraakbanken.gu.se/en/resources/superlim).
|
|
|
| II. USAGE |
|
| Key applications |
Intrinsic evaluation of Swedish word embeddings
|
| Intended task(s)/usage(s) |
|
| Recommended evaluation measures |
|
| Dataset function(s) |
Testing
|
| Recommended split(s) |
Test set only
|
|
|
| III. DATA |
|
| Primary data* |
Text
|
| Language* |
Swedish
|
| Dataset in numbers* |
Total of 20,638 samples; 10,381 semantic samples and 10,257 syntactic samples
|
| Nature of the content* |
Each sample contains 2 pairs of words. Hence, there are 4 similar words per line.
|
| Format* |
Each sample contains 2 pairs of words. Hence, there are 4 similar words per line.
|
| Data source(s)* |
Partly based on the English version by: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. New additions were made using the following online tools: https://bab.la and https://en.wiktionary.org/wiki/
|
| Data collection method(s)* |
Two Swedish native speakers proof-read the finished version and the inter-agreement score calculated. This was after compilation from part of the English version (Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.), which was translated. Additional data source is en.wiktionary.org/wiki
|
| Data selection and filtering* |
Does not apply
|
| Data preprocessing* |
Does not apply
|
| Data labeling* |
Does not apply
|
| Annotator characteristics |
Two Swedish native speakers
|
|
|
| IV. ETHICS AND CAVEATS |
|
| Ethical considerations |
|
| Things to watch out for |
|
|
|
| V. ABOUT DOCUMENTATION |
|
| Data last updated* |
2021-05-12
|
| Which changes have been made, compared to the previous version* |
Some linguistic errors and typos in the previous version have been corrected by Lars Borin and Aleksandrs Berdicevskis
|
| Access to previous versions |
None
|
| This document created* |
2021-05-20, Tosin Adewumi
|
| This document last updated* |
2021-05-20, Tosin Adewumi
|
| Where to look for further details |
[2],[1]
|
| Documentation template version* |
v1.0
|
|
|
| VI. OTHER |
|
| Related projects |
|
|
|
| References |
[1] Adewumi, T. P., Liwicki, F., & Liwicki, M. (2020). Corpora compared: The case of the swedish gigaword & wikipedia corpora. arXiv preprint arXiv:2011.03281. [2] Adewumi, T. P., Liwicki, F., & Liwicki, M. (2020). Exploring Swedish & English fastText Embeddings with the Transformer. arXiv preprint arXiv:2007.16007.
|