Papers and Projects - Josh Engels

Papers

How Transparent is DiffusionGemma?arXiv, 2026.Joshua Engels, Callum McDougall, Bilal Chughtai, ..., Rohin Shah, and Neel Nanda.Paper | Blog | Twitter

Building Production-Ready Probes for Gemini.arXiv, 2026.Janos Kramar, Joshua Engels, Zhengxuan Wang, Bilal Chughtai, Rohin Shah, Neel Nanda, and Arthur Conmy.Paper | Twitter

Training on Documents About Monitoring Leads to CoT Obfuscation.arXiv, 2026.Reilly Haskins, Bilal Chughtai, and Joshua Engels.Paper | Blog | Twitter

When Reading the Chain of Thought Falls Short:
A Testbed for Reasoning Trace Analysis.MI Workshop @ ICML 2026.Daria Ivanova, Riya Tyagi, Joshua Engels, and Neel Nanda.Paper | Blog

Designing Effective Monitor-Based Interventions for Mitigating Reward Hacking During RL.MI Workshop @ ICML 2026.Aria Wong, Joshua Engels, and Neel Nanda.Paper | Blog

Scaling Laws For Scalable Oversight.Neurips 2025 (Spotlight).Joshua Engels*, David Baek*, Subhash Kantamneni*, and Max Tegmark.Paper | Code | Twitter

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing.ICML 2025.Subhash Kantamneni*, Joshua Engels*, Senthooran Rajamanoharan, Max Tegmark, and Neel Nanda.Paper | Blog | Code | Twitter

Low Rank Adapting Models for Sparse Autoencoders.ICML 2025.Mathew Chen*, Joshua Engels*, and Max Tegmark.Paper | Code | Twitter

Simple Mechanistic Explanations for
Out-Of-Context Reasoning.R2-FM Workshop @ ICML 2025.Atticus Wang*, Joshua Engels*, Oliver Clive-Griffin*, Senthooran Rajamanoharan, and Neel Nanda.Paper

Dense SAE Latents Are Features, Not Bugs.Neurips 2025.Xiaoqing Sun, Alessandro Stolfo, Joshua Engels, Ben Wu, Senthooran Rajamanoharan, Mrinmaya Sachan, and Max Tegmark.Paper

Decomposing the Dark Matter of Sparse Autoencoders.TMLR 2025.Joshua Engels, Logan Smith, and Max Tegmark.Paper | Code | Twitter

The Geometry of Concepts: Sparse Autoencoder Feature Structure.Entropy 2025.Yuxiao Li, Eric J. Michaud, David D. Baek, Joshua Engels, Xiaoqing Sun, and Max Tegmark.Paper

Efficient Dictionary Learning with Switch Sparse Autoencoders.ICLR 2025.Anish Mudide, Joshua Engels, Eric J Michaud, Max Tegmark, and Christian Schroeder de Witt.Paper | Code | Twitter

Not All Language Model Features Are Linear.ICLR 2025.Joshua Engels, Eric J. Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark.Paper | Code | Twitter | Talk

* indicates equal contribution

Talks

Developing Areas of Alignment Science — June 2026

A Pragmatic Vision for Interpretability — December 2025

SAEs: Progress and Limitations — March 2025

Not All Language Model Features Are Linear — December 2024

Other Projects and Writing

LLM-Driven Feature Discovery — June 2026

Why Do Naive SFT Filters For Safety Properties Fail? — June 2026

SFT Drives Gemini's Safety Properties — June 2026

Building and Evaluating Model Diffing Agents — June 2026

Thought Editing: Steering Models by Editing Their Chain of Thought — February 2026

Brief Explorations in LLM Value Rankings — January 2026

Can We Interpret Latent Reasoning Using Current Mechanistic Interpretability Tools? — December 2025

Prompting Models to Obfuscate Their CoT — December 2025

How Can Interpretability Researchers Help AGI Go Well? — December 2025

A Pragmatic Vision for Interpretability — December 2025

Current LLMs Seem to Rarely Detect CoT Tampering — November 2025

Negative Results on Group SAEs — May 2025

Interim Research Report: Mechanisms of Awareness — May 2025

Spreadsheet of 50 Weird LLM Phenomenon — May 2025

TinySAE: A Minimal SAE Implementation — March 2025

Algorithms Papers (pre 2025)

Approximate Nearest Neighbor Search with Window Filters.ICML 2024.Joshua Engels, Benjamin Landrum, Shangdi Yu, Laxman Dhulipala, and Julian Shun.Paper | Code

DESSERT: An Efficient Algorithm for Vector Set Search with
Vector Set Queries.Neurips 2023.Joshua Engels, Benjamin Coleman, Vihan Lakshman, and Anshumali Shrivastava.Paper | Code

Practical Near Neighbor Search via Group Testing.Neurips 2021 (Spotlight).Joshua Engels, Benjamin Coleman, and Anshumali Shrivastava.Paper | Code