WikiHowNFQA

Introducing WikiHowNFQA, a unique collection of 'how-to' content from WikiHow, transformed into a rich dataset featuring 11,746 human-authored answers and 74,527 supporting documents. Designed for researchers, it presents a unique opportunity to tackle the challenges of creating comprehensive answers from multiple documents, and grounding those answers in the real-world context provided by the supporting documents.

Sticker

About WikiHowNFQA

The WikiHowNFQA dataset is derived from WikiHow, a popular online platform that provides how-to guides on a wide range of topics. The dataset is structured to include a question, a set of related documents, and a human-authored answer. The questions are non-factoid, requiring comprehensive, multi-sentence answers. The related documents provide the necessary information to generate an answer.

WikiHowNFQA is designed for researchers and presents a unique opportunity to tackle the challenges of creating comprehensive answers from multiple documents, and grounding those answers in the real-world context provided by the supporting documents.

Dataset Structure

The WikiHowNFQA dataset is composed of instances, each containing a question, a set of related documents, and a human-authored answer. The WikiHowNFQA dataset is divided into two parts:

QA Part

This part contains questions, answers, and links to web archive snapshots of related HTML pages. It is accessible for downloading on Hugging Face. Each dataset instance includes:

  • article_id: An integer identifier for the article corresponding to article_id from WikHow API.
  • question: The non-factoid instructional question.
  • answer: The human-written answer to the question corresponding human-written answer article summary from WikiHow website.
  • related_document_urls_wayback_snapshots: A list of URLs to web archive snapshots of related documents corresponding references from WikiHow article.
  • split: The split of the dataset that the instance belongs to ('train', 'validation', or 'test').
  • cluster: An integer identifier for the cluster that the instance belongs to.

Document Content Part

This part contains parsed HTML content from related documents. It is available for research groups to use via signing a Data Transfer Agreement with RMIT University. Each instance includes:

  • article_id: The unique identifier of the article on the WikiHow website.
  • original_url: The original URL of the web page containing the article.
  • archive_url: The URL of a snapshot of the web page from archive.org. The snapshot is the version closest to when the article was created or modified.
  • parsed_text: The plain text parsed from the URL in the form of text passages without any HTML text and page structures.
  • parsed_md: The text parsed in MD format, which preserves formatting such as tables and lists when extracting text content from the web page.

Dataset Instances

Leaderboard


Model Automatic Evaluation Human Evaluation
Rouge-1 Rouge-2 Rouge-L BertScore Prefer Model Prefer Gold Tie
DPR + BART 39.8 12.4 23.0 0.881 13 52 35
text-davinci-003 32.2 8.5 19.7 0.873 18 53 29
DPR + text-davinci-003 35.4 9.2 20.2 0.868 56 15 29

Download

The Document Content part is accessible after filling out one of the following forms and sending it to lurunchik@gmail.com. If you are able to send the form to authorized personnel at your university, please use the Institutional Form (preferred method). Otherwise, you can complete the Individual Form yourself. An example of a filled-out Individual Form is provided below for your reference.