Reducing Hallucinations in LLMs via Factuality-Aware Preference Learning

Sindhuja Chaduvula¹, Ahmed Radwan¹, Azib Farooq², Yani Ioannou³, Shaina Raza¹

¹Vector Institute for Artificial Intelligence, ²University of Cincinnati, ³University of Calgary

Main approach of doing Factual Alignment Using DPO

Abstract

Preference alignment methods such as RLHF and Direct Preference Optimization (DPO) improve instruction following, but they can also reinforce hallucinations when preference judgments reward fluency and confidence over factual correctness. We introduce F-DPO (Factuality-aware Direct Preference Optimization), a simple extension of DPO that uses only binary factuality labels. F-DPO (i) applies a label-flipping transformation that corrects misordered preference pairs so the chosen response is never less factual than the rejected one, and (ii) adds a factuality-aware margin that emphasizes pairs with clear correctness differences, while reducing to standard DPO when both responses share the same factuality. We construct factuality-aware preference data by augmenting DPO pairs with binary factuality indicators and synthetic hallucinated variants. Across seven open-weight LLMs (1B–14B), F-DPO consistently improves factuality and reduces hallucination rates relative to both base models and standard DPO. On Qwen3-8B, F-DPO reduces hallucination rates by 5× (from 0.424 to 0.084) while improving factuality scores by 50% (from 5.26 to 7.90). F-DPO also generalizes to out-of-distribution benchmarks: on TruthfulQA, Qwen2.5-14B achieves +17% MC1 accuracy (0.500 to 0.585) and +49% MC2 accuracy (0.357 to 0.531). F-DPO requires no auxiliary reward model, token-level annotations, or multi-stage training.

Table 1: Comparison of factuality-alignment properties across related DPO-based methods.

Method	Single-stage	Label Correction	Factuality Margin	Hallucinations Penalty	External Free	Response-level	Compute Efficient
Standard-DPO	✓	✗	✗	✗	✓	✓	✓
MASK-DPO	✓	✓	✗	✓	✗	✗	✗
FactTune	✓	✗	✗	✓	✗	✓	✓
Context-DPO	✓	✗	✗	✗	✓	✗	✓
Flame	✗	✗	✗	✓	✗	✓	✗
SafeDPO	✓	✗	✓	✓	✓	✓	✓
Self-alignment DPO	✓	✗	✗	✓	✓	✓	✓
Ours	✓	✓	✓	✓	✓	✓	✓

Architecture

Data Pipeline

Algorithm 1: FactualDPO Training

Input: Dataset D, reference policy π_ref, penalty λ, temperature β
Output: Trained policy π_θ

π_θ ← π_ref

// Phase 1: Label Transformation

for each (x, y_w, y_l, h_w, h_l) ∈ D do

Δh ← h_l - h_w
if Δh < 0 then

(y_w, y_l, h_w, h_l) ← (y_l, y_w, h_l, h_w)

// Phase 2: Training

for each iteration do

Sample minibatch B ⊂ D
for each (x, y_w, y_l, h_w, h_l) ∈ B do

// Recompute after transformation (now Δh ∈ {0,1})
Δh ← h_l - h_w
m ← log (π_θ(y_w|x) / π_θ(y_l|x)) − log (π_ref(y_w|x) / π_ref(y_l|x))

m^fact ← m - λ · Δh λ ∈ {0, 2, 4, 6, 8, 10, 20, 30, 50, 100}

L ← - (1/|B|) ∑ log σ(β · m^fact)
Update θ via gradient descent on L

return π_θ

Table 2: Ablation — Effect of Label Flipping

Model	Method	Flip	Fact. ↑	Hal. ↓	Win ↑
Qwen2.5-14B
	Standard DPO	✗	7.90	0.080	—
	Standard DPO	✓	8.33	0.036	0.65
	F-DPO	✗	8.49	0.032	0.70
	F-DPO	✓	8.84	0.008	0.78
Qwen3-8B
	Standard DPO	✗	6.14	0.302	—
	Standard DPO	✓	6.32	0.280	0.53
	F-DPO	✗	7.14	0.150	0.66
	F-DPO	✓	7.90	0.084	0.70
Qwen2-7B
	Standard DPO	✗	6.50	0.238	—
	Standard DPO	✓	6.95	0.176	0.62
	F-DPO	✗	7.14	0.150	0.66
	F-DPO	✓	7.60	0.082	0.70
LLaMA-3-8B
	Standard DPO	✗	6.00	0.290	—
	Standard DPO	✓	6.35	0.260	0.59
	F-DPO	✗	6.50	0.234	0.56
	F-DPO	✓	7.00	0.154	0.72
Gemma-2-9B
	Standard DPO	✗	8.04	0.092	—
	Standard DPO	✓	8.27	0.064	0.53
	F-DPO	✗	8.06	0.088	0.49
	F-DPO	✓	8.26	0.068	0.57

Green: best per model Yellow: second-best / improvement “—” indicates not applicable

BibTeX

@article{FactualAlignment2026,
  title={Reducing Hallucinations in LLMs via Factuality-Aware Preference Learning},
  author={Sindhuja Chaduvula, Ahmed Radwan, Azib Farooq, Yani Ioannou, Shaina Raza},
  journal={arXiv preprint arXiv:2601.03027},
  year={2026},
  url={https://github.com/VectorInstitute/Factual-Preference-Alignment}
}

Acknowledgments

Resources used in preparing this research were provided, in part, by the Province of Ontario and the Government of Canada through CIFAR, as well as companies sponsoring the Vector Institute (partners). This research was funded by the European Union’s Horizon Europe research and innovation programme under the AIXPERT project (Grant Agreement No. 101214389), which aims to develop an agentic, multi-layered, GenAI-powered framework for creating explainable, accountable, and transparent AI systems.