How a study gets on this atlas
Every automated step between a paper existing on PubMed and a page being live here. No hand-waving. No magic.
The pipeline, in order
The atlas ingests new studies once a day. Each stage runs as a scheduled job on Vercel (one stage per hour, 06:00–22:00 UTC). A stage only reads what previous stages produced; nothing skips ahead.
Fetch
— PubMed query via NCBI E-utilitiesFixed ME/CFS + long-COVID + comparator search terms, version-controlled. Result: a raw import row per paper, storing title, abstract, authors, journal, year, and the raw PubMed record. No LLM. No interpretation.
Normalise
— Pure string workShapes the raw record into canonical fields (title casing, author-list shape, journal normalisation, year parsing). Still no LLM.
Deduplicate
— Near-duplicate detectionChecks title + author + year against existing studies. Near-duplicates (same paper on preprint + journal version) are merged into a single Study with both sources attached.
Lineage check
— Citation graphDetects when a new paper cites or replicates an existing atlas entry. Stores the link. Does not alter either study’s text.
Classify
— LLM, structured, fixed promptInput: title + abstract. Output: five provenance enums — publicationType (RCT / observational / review / preprint / …), peerReviewStatus, evidenceLevel (E0 strongest → E3 preliminary), caseDefinitionQuality, diseaseContext. Model: Claude Haiku. These enums control the wording the next stage is allowed to use.
Summarise
— LLM, design-aware prompt (generator v1)Input: title + abstract + the five enums. Output: 9 structured fields (plain-language summary, advanced summary, why it matters, what it does not prove, observed findings, inferred conclusions, remaining questions, methodological strengths, limitations). The prompt tells the model what wording discipline applies to this study’s design (e.g. no causal verbs unless RCT + strong evidence level; bridge qualifier for non-ME/CFS populations).
Draft
— Private database rowPackages the nine fields into a draft Study record. Not visible on the public site yet.
Moderate
— Auto-moderation + lint gate (enforce mode)Applies deterministic checks (retracted? duplicate? weak case definition? preprint? psychosomatic paradigm without consensus evidence?) and the Pass B lint gate. Any draft with a CRITICAL wording flag (efficacy leakage, causal leakage, population mismatch, preprint-settled claim, etc.) is routed to a HOLD queue and does not auto-publish. The HOLD queue is manually triaged by the maintainer.
Auto-publish
— Public atlasFlips the draft to public. From this point the study page is indexable by search engines.
Versioning
Each LLM prompt carries a version string (class.v…, gen.v…). When the prompt changes, the version changes. Each deterministic scanner carries a version string (pass-b.v1.4.2026-04-17 today). Every study row stores which prompt version produced its text.
What the pipeline does not do
- It does not re-rank studies by importance.
- It does not call any paper “good” or “bad”.
- It does not aggregate results across studies.
- It does not produce recommendations.
- It does not answer patient questions.
Current corpus
- 6,129 public study pages.
- 16 research topics.
- Generator v1 active. Scanner v1.4 active. Lint gate in enforce mode.
See the editorial policy for the editorial contract this pipeline is held to.