Skip to content

How a study gets on this atlas

Every automated step between a paper existing on PubMed and a page being live here. No hand-waving. No magic.

The pipeline, in order

The atlas ingests new studies once a day. Each stage runs as a scheduled job on Vercel (one stage per hour, 06:00–22:00 UTC). A stage only reads what previous stages produced; nothing skips ahead.

01

Fetch

PubMed query via NCBI E-utilities

Fixed ME/CFS + long-COVID + comparator search terms, version-controlled. Result: a raw import row per paper, storing title, abstract, authors, journal, year, and the raw PubMed record. No LLM. No interpretation.

02

Normalise

Pure string work

Shapes the raw record into canonical fields (title casing, author-list shape, journal normalisation, year parsing). Still no LLM.

03

Deduplicate

Near-duplicate detection

Checks title + author + year against existing studies. Near-duplicates (same paper on preprint + journal version) are merged into a single Study with both sources attached.

04

Lineage check

Citation graph

Detects when a new paper cites or replicates an existing atlas entry. Stores the link. Does not alter either study’s text.

05

Classify

LLM, structured, fixed prompt

Input: title + abstract. Output: five provenance enums — publicationType (RCT / observational / review / preprint / …), peerReviewStatus, evidenceLevel (E0 strongest → E3 preliminary), caseDefinitionQuality, diseaseContext. Model: Claude Haiku. These enums control the wording the next stage is allowed to use.

06

Summarise

LLM, design-aware prompt (generator v1)

Input: title + abstract + the five enums. Output: 9 structured fields (plain-language summary, advanced summary, why it matters, what it does not prove, observed findings, inferred conclusions, remaining questions, methodological strengths, limitations). The prompt tells the model what wording discipline applies to this study’s design (e.g. no causal verbs unless RCT + strong evidence level; bridge qualifier for non-ME/CFS populations).

07

Draft

Private database row

Packages the nine fields into a draft Study record. Not visible on the public site yet.

08

Moderate

Auto-moderation + lint gate (enforce mode)

Applies deterministic checks (retracted? duplicate? weak case definition? preprint? psychosomatic paradigm without consensus evidence?) and the Pass B lint gate. Any draft with a CRITICAL wording flag (efficacy leakage, causal leakage, population mismatch, preprint-settled claim, etc.) is routed to a HOLD queue and does not auto-publish. The HOLD queue is manually triaged by the maintainer.

09

Auto-publish

Public atlas

Flips the draft to public. From this point the study page is indexable by search engines.

Versioning

Each LLM prompt carries a version string (class.v…, gen.v…). When the prompt changes, the version changes. Each deterministic scanner carries a version string (pass-b.v1.4.2026-04-17 today). Every study row stores which prompt version produced its text.

What the pipeline does not do

  • It does not re-rank studies by importance.
  • It does not call any paper “good” or “bad”.
  • It does not aggregate results across studies.
  • It does not produce recommendations.
  • It does not answer patient questions.

Current corpus

  • 6,129 public study pages.
  • 16 research topics.
  • Generator v1 active. Scanner v1.4 active. Lint gate in enforce mode.

See the editorial policy for the editorial contract this pipeline is held to.