Understanding Search in Transformers
Previously, I was at Conjecture working with Janus and Nicholas Kees on mechanistic interpretability for LLMs. Now, as part of my thesis and my role as a research lead for AI safety camp and UnSearch, I’m continuing this work, albeit focusing more on toy models trained on spatial tasks, with the goal of finding, understanding, and re-targeting the search process implemented internally in transformer networks.
Please see github.com/understanding-search and unsearch.org for the latest updates.
Some work we’ve put out so far:
- (coming soon – ICLR submission)
- Structured World Representations in Maze-Solving Transformers: ArXiv, Code
- maze-dataset paper
- Research Intuitions post
Not directly related to mechinterp for transformers, but we also did some work using maze-dataset on implicit networks: arxiv.org/abs/2410.03020