Position Paper  ·  Accepted at WCCI 2026 — IJCNN

Logo Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods

1Vector Institute, Toronto, Canada  ·  2University of Central Florida, USA  ·  3University of Cincinnati, USA
4Independent Researcher, USA  ·  5University of Groningen, The Netherlands
*Corresponding author: shaina.raza@torontomu.ca    Work done outside of Amazon.
Biological vaccination vs. model immunization: controlled exposure builds resistance. Immunization fine-tuning: periodic injection of labeled falsehoods (5-10%) alongside truthful data.

The immunization analogy. Just as biological vaccines use controlled exposure to weakened pathogens to train immune responses, model immunization uses controlled exposure to labeled falsehoods to train rejection responses.

Abstract

Large language models (LLMs) reproduce misinformation by learning the linguistic patterns that make falsehoods persuasive — such as hedging, false presuppositions, and citation fabrication — rather than merely memorizing false facts. We propose model immunization: supervised fine-tuning on curated (false claim, correction) pairs injected as small “vaccine doses” (5–10% of tokens) alongside truthful data. Unlike post-hoc filtering or preference-based alignment, immunization provides direct negative supervision on labeled falsehoods. Across four open-weight model families, immunization improves TruthfulQA accuracy by 12 points and misinformation rejection by 30 points with negligible capability loss. We outline design requirements — dosage, labeling, quarantine, diversity — and call for standardized vaccine corpora and benchmarks that test generalization, making immunization a routine component of responsible LLM development.

Index Terms: large language models, misinformation, bias, generative AI, responsible AI

Immunization Pipeline

The immunization pipeline diagram

Data curation separates truthful and false examples; falsehoods enter a quarantined repository with fact-checker verification before controlled injection during fine-tuning. Validation tests truthfulness and robustness to held-out misinformation; deployment includes continuous monitoring with feedback for iterative refinement. A governance layer ensures accountability and auditability throughout.

Experimental Results

Table I — Efficacy Across Model Families (5% vaccine dosage)

Model TQA Base TQA Imm. Δ TQA Rej. Base Rej. Imm. Δ Rej.
Phi-3-mini (3.8B)38.247.6+9.441.568.0+26.5
Llama-2-7B-Chat35.148.3+13.243.074.5+31.5
Mistral-7B-Instruct42.355.8+13.547.579.0+31.5
Llama-3-8B-Instruct44.758.2+13.551.082.5+31.5
Average 40.1 52.5 +12.4 45.8 76.0 +30.2

Table II — Cross-Domain Transfer (health-only training)

Test Domain Rejection Rate Gap vs. In-Domain
Health (in-domain)84.2%
Political (held-out)61.8%–22.4
Science (held-out)68.5%–15.7
Avg. held-out 65.2% –19.0

Partial but meaningful cross-domain transfer, suggesting pattern-level learning beyond specific claims.

Table III — Dosage Ablation (Mistral-7B-Instruct)

Dosage TruthfulQA Rejection MMLU
0% (base)42.347.558.4
2%47.8 (+5.5)64.5 (+17.0)58.2
5%  ✓ recommended55.8 (+13.5)79.0 (+31.5)57.9
10%57.2 (+14.9)83.5 (+36.0)57.1
20%56.8 (+14.5)85.0 (+37.5)55.2 (–3.2)

Gains plateau after 10%. At 20% dosage, MMLU degrades by 1.9pp — supporting the 5–10% recommendation.

Technique Input Goal Signal
Adversarial trainingperturbationsrobustnessnone
RLHFpreferencesalignmentindirect
Immunizationfalsehoodstruthfulnessdirect
Post-hoc detectionoutputsfilteringreactive
(a) Comparison by input type, goal, and factuality signal.
Training Fine-tuning Inference Adversarial Training RLHF Model Immuniz. Post-hoc Detection
(b) Lifecycle timeline: when each defense applies.
Figure 3: Misinformation defense techniques across the LLM lifecycle. Immunization provides direct negative supervision on labeled falsehoods during fine-tuning, distinct from indirect preference signals (RLHF) and reactive output filtering (post-hoc detection).

Linguistic Patterns in Misinformation

Misinformation is fundamentally a linguistic phenomenon. Beyond incorrect facts, false claims exhibit recurring discourse patterns that make them persuasive. These are learnable features that models acquire from training data — and that immunization specifically targets.

Pattern Description and Example
Hedged assertionWeasel words avoiding commitment: “Some experts believe...”, “It has been reported that...”
False presuppositionSmuggles false premise into discourse: “When did the government admit the cover-up?”
Citation fabricationMimics scholarly sourcing: “According to a Stanford study...” (nonexistent)
Emotive amplificationSubstitutes affect for evidence: “The SHOCKING truth they don’t want you to know”
False balancePresents fringe views as equally credible: “While most scientists say X, others argue Y”
Temporal manipulationImplies causation through proximity: “Shortly after the vaccine rollout, deaths increased”

BibTeX

@article{ModelImmunization2026,
  title   = {Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods},
  author  = {Raza, Shaina and Qureshi, Rizwan and Farooq, Azib and Lotif, Marcelo
             and Chadha, Aman and Pandya, Deval and Emmanouilidis, Christos},
  journal = {WCCI (IJCNN)},
  year    = {2026},
  url     = {https://www.arxiv.org/abs/2505.17870}
}