Structured Contents extends to Snapshot API

We’re excited to announce the early beta release of Structured Contents in Snapshot API. This first release, available to testing partners, includes parsed Wikipedia articles outputted as structured JSON files (NDJSON format compressed in tar.gz), with a consistent schema. This release includes six Wikipedia languages: English, French, German, Italian, Spanish, and Portuguese.

Overview

In September 2023 we launched the beta Structured Contents On-demand API endpoint and have since received invaluable feedback from developers. A common request was the need to access the structured information in bulk via the Snapshot API; we’re fulfilling that ask now.

Similar to the On-demand endpoint, the Structured Contents Snapshot endpoint includes parsed Wikipedia data such as abstracts, Wikidata QIDs, short descriptions, main images, infoboxes, and article sections with links. We are actively evaluating additional parsed elements—such as references (now added), lists, and tables—to incorporate in future updates. EDIT: References and Citations have been added to all Structured Contents endpoints as of March 2025.

Beta Access

This early beta release is intended for QA testing with our testing partners to help refine the Structured Contents feature before a broader release. If you are interested in becoming a testing partner, please express your interest to our sales team.

Open Datasets

We’ve released a version of Structured Contents datasets to some open platforms frequented by the AI community. These datasets aren’t as updated like our snapshots are. The freely available open datasets don’t include more recent features like parsed references and citations but do provide open access to gather important feedback from these communities to help improve our product for everybody.

Dataset on Hugging Face

Alongside this release, we’re also making available a Hugging Face dataset of the new beta Structured Contents snapshots (wikipedia french and english). All of the information regarding the Hugging Face dataset is posted on our blog.

Dataset on Kaggle

We’ve also made the beta Structured Contents snapshots (wikipedia french and english) on Kaggle – see the Kaggle dataset article here.

Relevant Reference links:

Chuck Reynolds, Staff Product Manager, Growth