Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods

Raza, Shaina; Qureshi, Rizwan; Farooq, Azib; Lotif, Marcelo; Chadha, Aman; Pandya, Deval; Emmanouilidis, Christos

Position Paper · Accepted at WCCI 2026 — IJCNN

Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods

Shaina Raza^1*, Rizwan Qureshi², Azib Farooq³, Marcelo Lotif¹, Aman Chadha^4†, Deval Pandya¹, Christos Emmanouilidis⁵

¹Vector Institute, Toronto, Canada · ²University of Central Florida, USA · ³University of Cincinnati, USA
⁴Independent Researcher, USA · ⁵University of Groningen, The Netherlands

^*Corresponding author: shaina.raza@torontomu.ca ^†Work done outside of Amazon.

arXiv Code BibTeX

Biological vaccination vs. model immunization: controlled exposure builds resistance.

Immunization fine-tuning: periodic injection of labeled falsehoods (5-10%) alongside truthful data.

The immunization analogy. Just as biological vaccines use controlled exposure to weakened pathogens to train immune responses, model immunization uses controlled exposure to labeled falsehoods to train rejection responses.

Abstract

Large language models (LLMs) reproduce misinformation by learning the linguistic patterns that make falsehoods persuasive — such as hedging, false presuppositions, and citation fabrication — rather than merely memorizing false facts. We propose model immunization: supervised fine-tuning on curated (false claim, correction) pairs injected as small “vaccine doses” (5–10% of tokens) alongside truthful data. Unlike post-hoc filtering or preference-based alignment, immunization provides direct negative supervision on labeled falsehoods. Across four open-weight model families, immunization improves TruthfulQA accuracy by 12 points and misinformation rejection by 30 points with negligible capability loss. We outline design requirements — dosage, labeling, quarantine, diversity — and call for standardized vaccine corpora and benchmarks that test generalization, making immunization a routine component of responsible LLM development.

Index Terms: large language models, misinformation, bias, generative AI, responsible AI

Immunization Pipeline

Data curation separates truthful and false examples; falsehoods enter a quarantined repository with fact-checker verification before controlled injection during fine-tuning. Validation tests truthfulness and robustness to held-out misinformation; deployment includes continuous monitoring with feedback for iterative refinement. A governance layer ensures accountability and auditability throughout.

Experimental Results

Table I — Efficacy Across Model Families (5% vaccine dosage)

Model	TQA Base	TQA Imm.	Δ TQA	Rej. Base	Rej. Imm.	Δ Rej.
Phi-3-mini (3.8B)	38.2	47.6	+9.4	41.5	68.0	+26.5
Llama-2-7B-Chat	35.1	48.3	+13.2	43.0	74.5	+31.5
Mistral-7B-Instruct	42.3	55.8	+13.5	47.5	79.0	+31.5
Llama-3-8B-Instruct	44.7	58.2	+13.5	51.0	82.5	+31.5
Average	40.1	52.5	+12.4	45.8	76.0	+30.2

Table II — Cross-Domain Transfer (health-only training)

Test Domain	Rejection Rate	Gap vs. In-Domain
Health (in-domain)	84.2%	—
Political (held-out)	61.8%	–22.4
Science (held-out)	68.5%	–15.7
Avg. held-out	65.2%	–19.0

Partial but meaningful cross-domain transfer, suggesting pattern-level learning beyond specific claims.

Table III — Dosage Ablation (Mistral-7B-Instruct)

Dosage	TruthfulQA	Rejection	MMLU
0% (base)	42.3	47.5	58.4
2%	47.8 (+5.5)	64.5 (+17.0)	58.2
5% ✓ recommended	55.8 (+13.5)	79.0 (+31.5)	57.9
10%	57.2 (+14.9)	83.5 (+36.0)	57.1
20%	56.8 (+14.5)	85.0 (+37.5)	55.2 (–3.2)

Gains plateau after 10%. At 20% dosage, MMLU degrades by 1.9pp — supporting the 5–10% recommendation.

Technique	Input	Goal	Signal
Adversarial training	perturbations	robustness	none
RLHF	preferences	alignment	indirect
Immunization	falsehoods	truthfulness	direct
Post-hoc detection	outputs	filtering	reactive

(a) Comparison by input type, goal, and factuality signal.

(b) Lifecycle timeline: when each defense applies.

Figure 3: Misinformation defense techniques across the LLM lifecycle. Immunization provides direct negative supervision on labeled falsehoods during fine-tuning, distinct from indirect preference signals (RLHF) and reactive output filtering (post-hoc detection).

Linguistic Patterns in Misinformation

Misinformation is fundamentally a linguistic phenomenon. Beyond incorrect facts, false claims exhibit recurring discourse patterns that make them persuasive. These are learnable features that models acquire from training data — and that immunization specifically targets.

Pattern	Description and Example
Hedged assertion	Weasel words avoiding commitment: “Some experts believe...”, “It has been reported that...”
False presupposition	Smuggles false premise into discourse: “When did the government admit the cover-up?”
Citation fabrication	Mimics scholarly sourcing: “According to a Stanford study...” (nonexistent)
Emotive amplification	Substitutes affect for evidence: “The SHOCKING truth they don’t want you to know”
False balance	Presents fringe views as equally credible: “While most scientists say X, others argue Y”
Temporal manipulation	Implies causation through proximity: “Shortly after the vaccine rollout, deaths increased”

BibTeX

@article{ModelImmunization2026,
  title   = {Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods},
  author  = {Raza, Shaina and Qureshi, Rizwan and Farooq, Azib and Lotif, Marcelo
             and Chadha, Aman and Pandya, Deval and Emmanouilidis, Christos},
  journal = {WCCI (IJCNN)},
  year    = {2026},
  url     = {https://www.arxiv.org/abs/2505.17870}
}