A chatbot needs data for two main reasons: to know what people are saying to it, and to know what to say back.
An effective chatbot requires a massive amount of training data in order to quickly resolve user requests without human intervention. However, the main obstacle to the development of a chatbot is obtaining realistic and task-oriented dialog data to train these machine learning-based systems. Chatbots are only as good as the training they are given.
We have drawn up the final list of the best conversational data sets to form a chatbot, broken down into question-answer data, customer support data, dialog data, and multilingual data.
Question-Answer Datasets for Chatbot Training
AmbigQA is a new open-domain question answering task that consists of predicting a set of question and answer pairs, where each plausible answer is associated with a disambiguated rewriting of the original question. The data set covers 14,042 open-ended QI-open questions.
Break is a set of data for understanding issues, aimed at training models to reason about complex issues. It consists of 83,978 natural language questions, annotated with a new meaning representation, the Question Decomposition Meaning Representation (QDMR). Each example includes the natural question and its QDMR representation.
CommonsenseQA is a set of multiple-choice question answer data that requires different types of common sense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distracting answers. The data set is provided in two main training/validation/test sets: "random assignment", which is the main evaluation assignment, and "question token assignment".
CoQA is a large-scale data set for the construction of conversational question answering systems. The CoQA contains 127,000 questions with answers, obtained from 8,000 conversations involving text passages from seven different domains.
DROP is a 96-question repository, created by the opposing party, in which a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations on them (such as adding, counting or sorting). These operations require a much more complete understanding of paragraph content than was required for previous data sets.
DuReader 2.0 is a large-scale, open-domain Chinese data set for reading comprehension (RK) and question answering (QA). It contains over 300K questions, 1.4M obvious documents and corresponding human-generated answers.
HotpotQA is a set of question response data that includes natural multi-skip questions, with a strong emphasis on supporting facts to allow for more explicit question answering systems. The data set consists of 113,000 Wikipedia-based QA pairs.
NarrativeQA is a data set constructed to encourage deeper understanding of language. This dataset involves reasoning about reading whole books or movie scripts. This dataset contains approximately 45,000 pairs of free text question-and-answer pairs. There are two modes of understanding this dataset: (1) reading comprehension on summaries and (2) reading comprehension on whole books/scripts.
Natural Questions (NQ), a new large-scale corpus for training and evaluating open-ended question answering systems, and the first to replicate the end-to-end process in which people find answers to questions. NQ is a large corpus, consisting of 300,000 questions of natural origin, as well as human-annotated answers from Wikipedia pages, for use in training in quality assurance systems. In addition, we have included 16,000 examples where the answers (to the same questions) are provided by 5 different annotators, useful for evaluating the performance of the QA systems learned.
The objective of the NewsQA dataset is to help the research community build algorithms capable of answering questions that require human-scale understanding and reasoning skills. Based on CNN articles from the DeepMind Q&A database, we have prepared a Reading Comprehension dataset of 120,000 pairs of questions and answers.
OpenBookQA, inspired by open-book exams to assess human understanding of a subject. The open book that accompanies our questions is a set of 1329 elementary level scientific facts. Approximately 6,000 questions focus on understanding these facts and applying them to new situations.
QASC is a question-and-answer data set that focuses on sentence composition. It consists of 9,980 8-channel multiple-choice questions on elementary school science (8,134 train, 926 dev, 920 test), and is accompanied by a corpus of 17M sentences.
QuAC, a data set for answering questions in context that contains 14K information-seeking QI dialogues (100K questions in total). Question Answering in Context is a dataset for modeling, understanding, and participating in information-seeking dialogues. The data instances consist of an interactive dialogue between two crowd workers: (1) a student who asks a sequence of free questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts (staves) of the text. QuAC introduces challenges not found in existing machine comprehension data sets: its questions are often more open-ended, unanswered, or only meaningful in the context of dialogue.
Question-and-answer dataset: This corpus includes Wikipedia articles, factual questions manually generated from them, and answers to these manually generated questions for use in academic research.
A set of Quora questions to determine whether pairs of question texts actually correspond to semantically equivalent queries. More than 400,000 lines of potential questions duplicate question pairs.
RecipeQA is a set of data for multimodal understanding of recipes. It consists of more than 36,000 pairs of automatically generated questions and answers from approximately 20,000 unique recipes with step-by-step instructions and images. Each RecipeQA question involves multiple modalities such as titles, descriptions or images, and working towards an answer requires (i) a common understanding of images and text, (ii) capturing the temporal flow of events, and (iii) understanding procedural knowledge.
The Stanford Question Answering Dataset (SQuAD) is a set of reading comprehension data consisting of questions asked by social workers on a set of Wikipedia articles, where the answer to each question is a segment of text, or span, of the corresponding reading passage. With more than 100,000 question-answer pairs on more than 500 articles, SQuAD is significantly larger than previous reading comprehension datasets. SQuAD2.0 combines the 100,000 questions from SQuAD1.1 with more than 50,000 new unanswered questions written in a contradictory manner by crowd workers to look like answered questions.
TREC QA Collection: TREC has had a track record of answering questions since 1999. In each track, the task was defined so that systems had to retrieve small fragments of text containing an answer to open-domain and closed-domain questions.
TyDi QA is a set of question response data covering 11 typologically diverse languages with 204K question-answer pairs. The languages in TyDi QA are diverse in terms of their typology -- the set of linguistic characteristics that each language expresses -- so we expect that the models performing on this set will be generalizable to a large number of languages around the world. It contains linguistic phenomena that would not be found in English-only corpora.
The WikiQA corpus: A set of publicly available pairs of questions and phrases collected and annotated for research on the answer to open-domain questions. In order to reflect the true information needs of general users, they used Bing query logs as a source of questions. Each question is linked to a Wikipedia page that potentially contains the answer.
Yahoo Language Data: This page presents manually maintained QA datasets from Yahoo responses.
Customer Support Datasets for Chatbot Training
Customer Support on Twitter: This Kaggle dataset includes more than 3 million tweets and responses from leading brands on Twitter.
Relational Strategies in Customer Service Dataset: A dataset of travel-related customer service data from four sources. Conversation logs from three commercial customer service VIAs and airline forums on TripAdvisor.com during the month of August 2016.
Ubuntu Dialogue Corpus: Consists of nearly one million two-person conversations from Ubuntu discussion logs, used to receive technical support for various Ubuntu-related issues. The dataset contains 930,000 dialogs and over 100,000,000 words.
Pro tip 💡
Discover how to automate your data labeling to increase the productivity of your labeling teams! Dive into model-in-the-loop, active learning, and implement automation strategies in your own projects.
Dialogue Datasets for Chatbot Training
A data set of 502 dialogues with 12,000 annotated statements between a user and a wizard discussing natural language movie preferences. The data were collected using the Oz Assistant method between two paid workers, one of whom acts as an "assistant" and the other as a "user".
ConvAI2 dataset: The dataset contains more than 2000 dialogs for a PersonaChat contest, where human evaluators recruited through the Yandex.Toloka crowdsourcing platform chatted with bots submitted by teams.
Cornell Movie-Dialogs Corpus: This corpus contains an extensive collection of metadata-rich fictional conversations extracted from raw movie scripts: 220,579 conversational exchanges between 10,292 movie character pairs involving 9,035 characters from 617 movies.
Maluuba goal-oriented dialogue: A set of open dialogue data where the conversation is aimed at accomplishing a task or making a decision - in particular, finding flights and a hotel. The data set contains complex conversations and decisions covering over 250 hotels, flights and destinations.
Multi-Domain Wizard-of-Oz dataset (MultiWOZ): A comprehensive collection of written conversations covering multiple domains and topics. The dataset contains 10,000 dialogs, and is at least an order of magnitude larger than any previous task-oriented annotated corpus.
The NPS Chat Corpus: This corpus consists of 10,567 messages out of approximately 500,000 messages collected from various online chat services in accordance with their terms of service.
Santa Barbara Corpus of Spoken American English: This dataset contains approximately 249,000 words of transcription, audio and timestamp at the individual intonation units.
SGD (Schema-Guided Dialogue) dataset, containing over 16k of multi-domain conversations covering 16 domains. Our dataset exceeds the size of existing task-oriented dialog corpora, while highlighting the challenges of creating large-scale virtual wizards. It provides a challenging test bed for a number of tasks, including language comprehension, slot filling, dialog status monitoring, and response generation.
Semantic Web IRC Chat Logs Interest Group: This automatically generated IRC chat log is available in RDF, since 2004, on a daily basis, including timestamps and nicknames.
Multilingual Datasets for Chatbot Training
EXCITEMENTS datasets: These datasets, available in English and Italian, contain negative comments from customers giving reasons for their dissatisfaction with a given company.
NUS Corpus: This corpus was created for the standardization and translation of social media texts. It is built by randomly selecting 2,000 messages from the NUS corpus of SMS in English and then translating them into formal Chinese.
OPUS is a growing collection of translated texts from the web. In the OPUS project they try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus. It contains dialog datasets as well as other types of datasets.
We recently updated our website with a list of the best open-sourced datasets used by ML teams across industries. We are constantly updating this page, adding more datasets to help you find the best training data you need for your projects. Find more open-sourced datasets here.