Understanding Search in Transformers

Previously, I was at Conjecture working with Janus and Nicholas Kees on mechanistic interpretability for LLMs. Now, as part of my thesis and my role as a research lead for AI safety camp and UnSearch, I’m continuing this work, albeit focusing more on toy models trained on spatial tasks, with the goal of finding, understanding, and re-targeting the search process implemented internally in transformer networks.

Please see github.com/understanding-search and unsearch.org for the latest updates.

Some work we’ve put out so far:

Not directly related to mechinterp for transformers, but we also did some work using maze-dataset on implicit networks: arxiv.org/abs/2410.03020