In the early 2000’s, the Internet was new and had grown out of the existing media framework. The peer-to-peer nature and freedom to publish resulted in the popular adage – “Don’t believe everything you read on the Internet”, and with it a generalized, colloquial mistrust.
That issue never went away. In fact, it’s certainly worse today. The proliferation of misinformation and low quality media is no secret. Yet, from the same space and culture grew Wikipedia and its sister projects. With a high standard for ethics, a committed community with strong editorial standards, the 300+ editions of Wikipedia that millions of monthly readers consume, were born. Today, dozens of studies back the information quality of the encyclopedia, as do the many reusers that depend on Wikimedia data for their systems and products.
With the advent of LLMs, misinformation / disinformation and low-quality information overall don’t just affect humans. Luckily, Wikimedia projects continue to be a fountain of high-quality training data. As LLMs become more performant and demand less tokens for training, to secure the quality of their responses, model makers may want to follow the pack and, as the Washington Post reported EleutherAI did for their Pile LLM, upsample Wikipedia multiple times in your training data.
Using Wikimedia Enterprise, model makers and other reusers can easily consume Wikimedia project data via our APIs, and using our Credibility Signals features, Enterprise customers can vet the data that comes through as they consume it.
What are Credibility Signals?
Credibility Signals data-points across Wikimedia Enterprise’s suite of APIs help reusers decide when to, how to and why to use what is written into the online encyclopedia. From a single new word, figures in tables, to an entire new article across all languages, these signals help our API users gain greater visibility into decisions made on the project by its loyal editing communities. Judging content quality, ruling on it and upkeep, is made up by standards and heuristics that are codified by the different communities, but because this information is not centralized (neither on a single project nor across them), it is hard to access. Credibility signals shine a light on that blind spot, for example:
- Our in-house machine learning models evaluate references, edits, and content quality
- Machine-readable tracking of community editorial decisions, e.g., no-index tags and page protections, enable nuanced, contextual and timely notification about critical topics
- Refined signals like breaking news flags, or inputs like hourly pageviews and edit frequency isolate new and emerging content
- Collective policies and partner committees enable preemptive safety from threats and bad actors
No Data is Objective
Data is all about context. Inside the wiki-verse, contributions are judged on the tenet of “verifiability, not truth”. Much like the function of peer review in science, on Wiki, consensus defines what is closest to “truth”. Editing, quality and credibility are community determined. Thus, context is driven by a revision’s editor, the sources cited and what is written.
An edit’s trust can only increase after several revisions: a single revision, even if by the most trusted editor, cannot significantly signal trust. It is the group, the critical mass, that can generate an edit’s trustworthiness. We can serialize our community’s work, programmatically replicating these practices from across 300+ language projects. We can maintain databases and build relationships with these groups through other teams and community liaisons within the Foundation, ensuring that we understand changes and maintain the quality of our signals. It is important to note that because of this difference, some signals will not be accessible in some languages.
Use Cases
- What specific information to show your search engine knowledge panel
- See our article on how to build your own knowledge panel
- Disambiguation for RAG system
- More precise training for your LLM
- Knowledge Graph building
Example Credibility Signals fields
See our Data Dictionary for all available credibility signals, labeled with a purple badge. Here are a few examples:
‘Protection’
The protection signal is a translation of protection policies on Wikipedia. Each language project has its own policies and different levels of editor permissions that are necessary for revising a certain article. In order to figure out available protection levels for English Wiki for example, use this API call. You can gather the following:
- Type: the type of event that the protection is applied to
- Level: editor status needed to operate in the type of protection
- Expiry: timestamp (number, seconds), length of time that this protection or restriction is active, can be
infinite
,indefinite
,infinity
, ornever
, for a never-expiring protection
Articles are given a protected status by language project admins when they are at particular risk of vandalism, for example, if the subject of the article is trending, or a famous political figure or athlete.
It is important to note that if an article has a protected status in one language, it may not in another.
‘Visibility’
If the editing community has flagged a particular, often old, revision as containing potentially damaging information, they will change its visibility. When revision is flagged for visibility it is removed entirely from the encyclopedia’s history. This is called “oversighting“. These three booleans offer insight into whether an article’s body, the revision’s editor, or an edit comment may contain harmful data. Oversighted revisions may contain privately identifiable information or extremely offensive language.
When these return “false” it indicates where the potentially harmful data is.
- Text: indicates if the text of this particular revision is visible
- Editor: indicates if the editor name of this particular revision is visible
- Comment: indicates if the comment attached to this particular revision is visible
‘Version.editor’
Editor-specific signals that can help contextualize a revision. This is the “who” of the revision. How long has an editor been editing? How many edits have they made? Do they have special seniority rights to delete, alter or block certain articles or other editors? You can read more about editor levels and rights on English Wikipedia here. This signal distills the most important and clear data about that deep policy.
- Identifier: unique MediaWiki ID for the editor
- Name: username of the editor
- Edit Count: number of edits this editor has made so far
- Groups: set of groups this editor belongs to
- Is Bot: signals if editor is a bot or not
- Is Anonymous: signals if editor is anonymous, meaning they do not log in to edit
- Date Started: displays the date of the editor’s first edit in RFC3339 format
- Is Admin: if editor is admin or not
- Is Patroller: if editor is patroller or not
- Has Advanced Rights: checks if user has advanced rights
You may be interested in using the output of these three models to support your ingestion preferences —
- RevertRisk
- ReferenceRisk (upcoming)
- Article Quality (upcoming)
How do you get credibility signals from Wikimedia Enterprise APIs?
Credibility Signal metadata is included with every article payload in all Wikimedia Enterprise APIs. The following is an On-demand JSON response payload for the 2024 Summer Olympics article with only credibility signals fields included:
{
"name": "2024 Summer Olympics",
"protection": [
{
"type": "edit",
"level": "autoconfirmed",
"expiry": "2024-08-17T02:41:19Z"
}
],
"version": {
"identifier": 1240518903,
"comment": "Attempting to simplify \"Nations\" cell in infobox.",
"scores": {
"revertrisk": {
"prediction": true,
"probability": {
"false": 0.27714723348617554,
"true": 0.7228527665138245
}
}
},
"editor": {
"identifier": 44680730,
"name": "AFC Vixen",
"edit_count": 1632,
"groups": [
"extendedconfirmed",
"*",
"user",
"autoconfirmed"
],
"date_started": "2022-10-17T00:28:35Z"
},
"number_of_characters": 214211,
"size": {
"value": 214855,
"unit_text": "B"
},
"maintenance_tags": {}
},
"watchers_count": 527
}
Definition of terms
- Verifiability: A central tenet of Wikipedia’s editing policy. Anyone using the encyclopedia must be able to understand where the information in an article comes from. Is there a source? This is key to maintaining the integrity of information through discussion and argumentation.
- Revertrisk: A machine learning model that outputs the likelihood of a certain edit across wikipedia language projects of being reverted by an editor. It is a strong heuristic for quality.
- Vandalism: Any disruptive or mal-intended editing behavior across all wikimedia projects. Naïvely added incorrect information is not considered vandalism, but a natural part of the growth of the encyclopedia.
- Breaking news: any new article or new edit on Wikipedia about a real-life event.
- Content Reusers: Organizations who use Wikimedia content off-platform in their own products
- The Foundation: Refers to the Wikimedia Foundation. Created in 2003 as a non-profit, two years after Wikipedia. WMF hosts and fundraises for the Wikimedia projects and develops tools to empower our communities of editors and content reusers
- The Community: Individual Volunteers. 100K+ active editors distributed around the world who self-organize to curate and edit 300+ language versions of Wikipedia and other sister projects.
— Francisco Navas, Product Manager
Photo Credits
Wellcome Library, London, CC BY 4.0, via Wikimedia Commons