Understanding Search in Transformers

Previously, I was at Conjecture working with J a n us and Nicholas Kees on mechanistic interpretability for LLMs. As part of my thesis work and my role as a research lead for AI safety camp and UnSearch, we continued work on toy models trained on spatial tasks¹ (in particular, mazes).

Some relevant outputs:

maze-dataset, a package for generating and working with maze datasets, in particular providing a wide range of output formats suitable for anything from VLMs to autoregressive text models. Arxiv paper, JOSS version.
Research Intuitions post
Structured World Representations in Maze-Solving Transformers: ArXiv, Code
Transformers Use Causal World Models in Maze-Solving Tasks

Please see github.com/understanding-search and unsearch.org for updates.

Not directly related to mechinterp for transformers, but we also did some work using maze-dataset on implicit networks: arxiv.org/abs/2410.03020

See alignmentforum.org/posts/FDjTgDcGPc7B98AES/searching-for-search-4 ↩︎