Research

How language models fail — and how systems can respond.

My work focuses on LLM reliability: detecting hallucinations through internal transformer representations. I study whether hidden states, cross-layer behavior, and uncertainty signals can flag unreliable generations before they ever reach a user.

detection path

Prompt

Model

Hidden states

Probe

Reliability signal

Read reliability from inside the model, not from an external checker after the fact — cheap enough to run at inference time.

Open questions

Hidden-state probing

Whether internal activations encode hallucination signals across layers and datasets — and how to read them.

Cross-model portability

Whether reliability detectors trained on one model family transfer to another, or have to be re-learned.

Reliability controllers

Lightweight control layers that can detect, abstain, or seek evidence during generation, before text reaches a user.

Output

arXiv · 2026

Weakly Supervised Distillation of Hallucination Signals into Transformer Representations

A weak-supervision framework, a 15k-sample representation-level hallucination dataset with full transformer hidden states, and five probing architectures for detecting hallucinations from internal activations — without external inference-time verification.

Read arXiv:2604.06277 ↗

In practice — LLM Lens

Working on LLM reliability and hallucination detection by studying internal transformer representations, uncertainty signals, and cross-layer behavior in modern language models.

Built PyTorch experimentation pipelines to extract hidden-state embeddings from transformer models such as LLaMA and Qwen for token-level and sequence-level hallucination analysis.
Designed cross-layer and hierarchical transformer architectures to model correlations across hidden states for better hallucination classification and uncertainty estimation.
Created scalable dataset generation workflows using Hugging Face models to synthesize prompts, responses, and labeled hallucination corpora for benchmarking.
Implemented evaluation frameworks spanning ROC-AUC, PR-AUC, F1, Expected Calibration Error, Brier score, and threshold optimization for reliability assessment.
Ran reproducible GPU-heavy experiments across CUDA and Colab A100/H100 environments while managing checkpoints, experiment logs, and training pipelines.
Co-authored the arXiv paper 'Weakly Supervised Distillation of Hallucination Signals into Transformer Representations', released on April 7, 2026.

Evaluation

ROC-AUCPR-AUCF1Expected Calibration ErrorBrier scoreThreshold optimization