How language models fail — and how systems can respond.
My work focuses on LLM reliability: detecting hallucinations through internal transformer representations. I study whether hidden states, cross-layer behavior, and uncertainty signals can flag unreliable generations before they ever reach a user.
Read reliability from inside the model, not from an external checker after the fact — cheap enough to run at inference time.
Hidden-state probing
Whether internal activations encode hallucination signals across layers and datasets — and how to read them.
Cross-model portability
Whether reliability detectors trained on one model family transfer to another, or have to be re-learned.
Reliability controllers
Lightweight control layers that can detect, abstain, or seek evidence during generation, before text reaches a user.
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
A weak-supervision framework, a 15k-sample representation-level hallucination dataset with full transformer hidden states, and five probing architectures for detecting hallucinations from internal activations — without external inference-time verification.
Read arXiv:2604.06277 ↗Working on LLM reliability and hallucination detection by studying internal transformer representations, uncertainty signals, and cross-layer behavior in modern language models.
- Built PyTorch experimentation pipelines to extract hidden-state embeddings from transformer models such as LLaMA and Qwen for token-level and sequence-level hallucination analysis.
- Designed cross-layer and hierarchical transformer architectures to model correlations across hidden states for better hallucination classification and uncertainty estimation.
- Created scalable dataset generation workflows using Hugging Face models to synthesize prompts, responses, and labeled hallucination corpora for benchmarking.
- Implemented evaluation frameworks spanning ROC-AUC, PR-AUC, F1, Expected Calibration Error, Brier score, and threshold optimization for reliability assessment.
- Ran reproducible GPU-heavy experiments across CUDA and Colab A100/H100 environments while managing checkpoints, experiment logs, and training pipelines.
- Co-authored the arXiv paper 'Weakly Supervised Distillation of Hallucination Signals into Transformer Representations', released on April 7, 2026.