Skip to content
SSM
← All work

ARCHIVE · ENTRY 02 · CLASS: PUBLIC

arXiv Paper2025–present

LLM Reliability Research

Hallucination detection · LLM Lens

Detecting hallucinations from inside the model — reading hidden states before unreliable text reaches a user.

how it resolves
Prompt
Model
Hidden states
Probe
Reliability signal
Problem

Language models hallucinate confidently, and most detection happens after generation with external verifiers. Can the unreliability be read directly from the model's own internal representations, cheaply and at inference time?

Built

PyTorch pipelines that extract hidden-state embeddings from models like LLaMA and Qwen, cross-layer and hierarchical probing architectures that model correlations across layers, scalable labelled-dataset generation, and a full evaluation suite — ROC-AUC, PR-AUC, F1, Expected Calibration Error, Brier score, threshold optimization — across A100/H100 GPUs.

Role

Research intern — experimentation pipelines, probing architectures, dataset construction, and evaluation. Co-author on the resulting paper.

Proof
  • Co-authored arXiv:2604.06277 on distilling hallucination signals into transformer representations.
  • Built a 15k-sample representation-level hallucination dataset with full hidden states.
  • Five probing architectures for internal hallucination detection without inference-time verifiers.
Stack
PyTorchTransformersHugging FaceCUDAA100/H100
Field note · classified