I. IDENTIFYING INFORMATION
Title* Swedish FAQ (mismatched) v1.0
Subtitle Frequently asked questions from Swedish authorities' websites with shuffled answers
Created by* Aleksandrs Berdicevskis (aleksandrs.berdicevskis@gu.se)
Publisher(s)* Språkbanken Text (sb-info@svenska.gu.se)
Link(s) / permanent identifier(s)* https://spraakbanken.gu.se/en/resources/faq
License(s)* CC BY 4.0
Abstract* This dataset contains questions and answers collected from FAQs on the websites of Swedish authorities (Försäkringskassan, Skatteverket, Vårdguiden). The dataset is a part of the SuperLim collection. Each QA pair belongs to a certain category (e.g. "Förälder :: Väntar barn :: Föräldrapenning innan barnet fötts :: Vanliga frågor"). The are 513 QA pairs and 50 categories. The number of QA pairs in a category strongly varies. Within each category, the answers are randomly shuffled. The task is to match questions and answers.
Funded by* Vinnova (grant no. 2020-02523)
Cite as
Related datasets Part of the SuperLim collection (https://spraakbanken.gu.se/en/resources/superlim)
II. USAGE
Key applications Machine Learning, Question Answering, Evaluation of language models
Intended task(s)/usage(s) (1) Evaluate models on the following task: given the question, find the suitable answer, it it is present.
Recommended evaluation measures (1) Accuracy
Dataset function(s) Testing
Recommended split(s) Test data only
III. DATA
Primary data* Text
Language* Swedish
Dataset in numbers* 513 question-answer pairs; 50 categories; 3 sources; The number of QA pairs in a category strongly varies.
Nature of the content* Each QA pair belongs to a certain category (e.g. "Förälder :: Väntar barn :: Föräldrapenning innan barnet fötts :: Vanliga frågor"). Within each category, the answers are randomly shuffled. Importantly, the shuffling always occurs only within categories. This is necessary, since many questions cannot be interpreted without the background provided by the category (e.g. "Måste bilen registreras i mitt namn?" from the category "Förälder :: Om ditt barn har en funktions­nedsättning :: Bilstöd för barn :: Vanliga frågor"). Moreover, different categories can potentially contain similar questions ("Hur mycket pengar får jag?") with different answers.
Format* "Tab-separated, seven columns:
"category id" -- unique integer id of the category;
"source" -- the name of the authority from which the QAs were collected;
"category";
"question";
"candidate_answer" -- the answer that was obtained by the random shuffling (if the shuffling is rerun, the "candidate_answer" is likely to be different);
"correct_answer" -- the original, correct answer that matches the question; "candidate_answer" may or may not be the same as "correct_answer".
"link" -- the url of the page from which questions and answers were collected. In most cases, there is one link per category; some categories share the link."
Data source(s)* Websites of the Swedish Social Insurance Agency (Försäkringskassan; https://www.forsakringskassan.se/), the central national infrastructure for Swedish healthcare online (Vårdguiden, https://www.1177.se/), The Swedish Tax Agency (Skatteverket, https://skatteverket.se/). Specific links can be found in the dataset.
Data collection method(s)* Questions and answers were collected manually from various pages of the aforementioned websites. The pages were found either by searching for "vanliga frågor" and "frågor och svar" on the websites or navigating using the websites' menu.
Data selection and filtering* QA pairs were excluded if answers a) had complicated formatting; b) were divided into several subsections with subheadings; c) contained images and widgets; d) were very long; e) contained just one word (Ja or Nej). Categories that contained less than two suitable QA pairs were excluded. The contents of the websites were not exhausted, more categories are available. Sometimes different categories contained identical or nearly identical QA pairs, in such cases all instances but one were deleted (if only the questions were similar, but the answers were different, the pair was kept).
Data preprocessing* All line breaks were removed, so was all other formatting. Bulleted lists were simplified (list preceded by :, every item preceded by -). Http links were removed. Some long answers were shortened.
Data labeling* No labeling was added (questions and answers are matched from the beginning, given the nature of data)
Annotator characteristics PhD in linguistics; fluent non-native speaker of Swedish
Inter-annotator agreement
IV. ETHICS AND CAVEATS
Ethical considerations
Things to watch out for
V. ABOUT DOCUMENTATION
Data last updated* 2021-04-16, v1.0
Which changes have been made, compared to the previous version* This is the first official version
Access to previous versions
This document created* 2021-05-05, Aleksandrs Berdicevskis
This document last updated* 2021-05-19, Aleksandrs Berdicevskis
Where to look for further details
VI. OTHER
Related projects
References