Skip to content
SSM
Research

How language models fail — and how systems can respond.

My work focuses on LLM reliability: detecting hallucinations through internal transformer representations. I study whether hidden states, cross-layer behavior, and uncertainty signals can flag unreliable generations before they ever reach a user.

detection path
Prompt
Model
Hidden states
Probe
Reliability signal

Read reliability from inside the model, not from an external checker after the fact — cheap enough to run at inference time.

Open questions
Q1

Hidden-state probing

Whether internal activations encode hallucination signals across layers and datasets — and how to read them.

Q2

Cross-model portability

Whether reliability detectors trained on one model family transfer to another, or have to be re-learned.

Q3

Reliability controllers

Lightweight control layers that can detect, abstain, or seek evidence during generation, before text reaches a user.

Output
arXiv · 2026

Weakly Supervised Distillation of Hallucination Signals into Transformer Representations

A weak-supervision framework, a 15k-sample representation-level hallucination dataset with full transformer hidden states, and five probing architectures for detecting hallucinations from internal activations — without external inference-time verification.

Read arXiv:2604.06277 ↗
In practice — LLM Lens

Working on LLM reliability and hallucination detection by studying internal transformer representations, uncertainty signals, and cross-layer behavior in modern language models.

  • Built PyTorch experimentation pipelines to extract hidden-state embeddings from transformer models such as LLaMA and Qwen for token-level and sequence-level hallucination analysis.
  • Designed cross-layer and hierarchical transformer architectures to model correlations across hidden states for better hallucination classification and uncertainty estimation.
  • Created scalable dataset generation workflows using Hugging Face models to synthesize prompts, responses, and labeled hallucination corpora for benchmarking.
  • Implemented evaluation frameworks spanning ROC-AUC, PR-AUC, F1, Expected Calibration Error, Brier score, and threshold optimization for reliability assessment.
  • Ran reproducible GPU-heavy experiments across CUDA and Colab A100/H100 environments while managing checkpoints, experiment logs, and training pipelines.
  • Co-authored the arXiv paper 'Weakly Supervised Distillation of Hallucination Signals into Transformer Representations', released on April 7, 2026.
Evaluation
ROC-AUCPR-AUCF1Expected Calibration ErrorBrier scoreThreshold optimization