How a study gets on this atlas

Every automated step between a paper existing on PubMed and a page being live here. No hand-waving. No magic.

The pipeline, in order

The atlas ingests new studies once a day. Each stage runs as a scheduled cron job, one stage per hour, 06:00–22:00 UTC. A stage only reads what previous stages produced; nothing skips ahead.

Fetch

— PubMed query via NCBI E-utilities

Fixed ME/CFS + long-COVID + comparator search terms, version-controlled. Result: a raw import row per paper, storing title, abstract, authors, journal, year, and the raw PubMed record. No LLM. No interpretation.

Normalise

— Pure string work

Shapes the raw record into canonical fields (title casing, author-list shape, journal normalisation, year parsing). Still no LLM.

Deduplicate

— Near-duplicate detection

Checks title + author + year against existing studies. Near-duplicates (same paper on preprint + journal version) are merged into a single Study with both sources attached.

Lineage check

— Citation graph

Detects when a new paper cites or replicates an existing atlas entry. Stores the link. Does not alter either study’s text.

Classify

— LLM, structured, fixed prompt

Input: title + abstract. Output: five provenance enums — publicationType (RCT / observational / review / preprint / …), peerReviewStatus, evidenceLevel (E0 strongest → E3 preliminary), caseDefinitionQuality, diseaseContext. Model: Claude Haiku. These enums control the wording the next stage is allowed to use.

Summarise

— LLM, design-aware prompt (generator v1)

Input: title + abstract + the five enums. Output: 9 structured fields (plain-language summary, advanced summary, why it matters, what it does not prove, observed findings, inferred conclusions, remaining questions, methodological strengths, limitations). The prompt tells the model what wording discipline applies to this study’s design (e.g. no causal verbs unless RCT + strong evidence level; bridge qualifier for non-ME/CFS populations).

Draft

— Private database row

Packages the nine fields into a draft Study record. Not visible on the public site yet.

Moderate

— Auto-moderation + lint gate (enforce mode)

Applies deterministic checks (retracted? duplicate? weak case definition? preprint? psychosomatic paradigm without consensus evidence?) and the Pass B lint gate. Any draft with a CRITICAL wording flag (efficacy leakage, causal leakage, population mismatch, preprint-settled claim, etc.) is routed to a HOLD queue and does not auto-publish. The HOLD queue is manually triaged by the maintainer.

Auto-publish

— Public atlas

Flips the draft to public. From this point the study page is indexable by search engines.

Versioning

Each LLM prompt carries a version string (class.v…, gen.v…). When the prompt changes, the version changes. Each deterministic scanner carries a version string (pass-b.v1.4.2026-04-17 today). Every study row stores which prompt version produced its text.

What the pipeline does not do

It does not re-rank studies by importance.
It does not call any paper “good” or “bad”.
It does not aggregate results across studies.
It does not produce recommendations.
It does not answer patient questions.

Current corpus

6,170 public study pages.
15 research topics.
Generator v1 active. Scanner v1.4 active. Lint gate in enforce mode.

See the editorial policy for the editorial contract this pipeline is held to.