Understanding Search in Transformers
Previously, I was at Conjecture working with Janus and Nicholas Kees on mechanistic interpretability for LLMs. As part of my thesis work and my role as a research lead for AI safety camp and UnSearch, we continued work on toy models trained on spatial tasks1 (in particular, mazes).
Some relevant outputs:
maze-dataset, a package for generating and working with maze datasets, in particular providing a wide range of output formats suitable for anything from VLMs to autoregressive text models. Arxiv paper, JOSS version.- Research Intuitions post
- Structured World Representations in Maze-Solving Transformers: ArXiv, Code
- Transformers Use Causal World Models in Maze-Solving Tasks
Please see github.com/understanding-search and unsearch.org for updates.
Not directly related to mechinterp for transformers, but we also did some work using maze-dataset on implicit networks: arxiv.org/abs/2410.03020