LLM Reliability Research

Hallucination detection · LLM Lens

Detecting hallucinations from inside the model — reading hidden states before unreliable text reaches a user.

how it resolves

Prompt

Model

Hidden states

Probe

Reliability signal

Problem

Language models hallucinate confidently, and most detection happens after generation with external verifiers. Can the unreliability be read directly from the model's own internal representations, cheaply and at inference time?

Built

PyTorch pipelines that extract hidden-state embeddings from models like LLaMA and Qwen, cross-layer and hierarchical probing architectures that model correlations across layers, scalable labelled-dataset generation, and a full evaluation suite — ROC-AUC, PR-AUC, F1, Expected Calibration Error, Brier score, threshold optimization — across A100/H100 GPUs.

Role

Research intern — experimentation pipelines, probing architectures, dataset construction, and evaluation. Co-author on the resulting paper.

Proof

Co-authored arXiv:2604.06277 on distilling hallucination signals into transformer representations.
Built a 15k-sample representation-level hallucination dataset with full hidden states.
Five probing architectures for internal hallucination detection without inference-time verifiers.

Stack

PyTorchTransformersHugging FaceCUDAA100/H100

Read the paper ↗

Field note · classified

NextRakshak→