ARCHIVE · ENTRY 02 · CLASS: PUBLIC
LLM Reliability Research
Hallucination detection · LLM Lens
Detecting hallucinations from inside the model — reading hidden states before unreliable text reaches a user.
Language models hallucinate confidently, and most detection happens after generation with external verifiers. Can the unreliability be read directly from the model's own internal representations, cheaply and at inference time?
PyTorch pipelines that extract hidden-state embeddings from models like LLaMA and Qwen, cross-layer and hierarchical probing architectures that model correlations across layers, scalable labelled-dataset generation, and a full evaluation suite — ROC-AUC, PR-AUC, F1, Expected Calibration Error, Brier score, threshold optimization — across A100/H100 GPUs.
Research intern — experimentation pipelines, probing architectures, dataset construction, and evaluation. Co-author on the resulting paper.
- Co-authored arXiv:2604.06277 on distilling hallucination signals into transformer representations.
- Built a 15k-sample representation-level hallucination dataset with full hidden states.
- Five probing architectures for internal hallucination detection without inference-time verifiers.