Cyborg Language Models

The “win condition” of mechanistic interpretability research is sometimes described as replacing most or all of a large language model with explicitly programmed components. Yet, little to no research on this seems to be happening. This project explores why it might finally be time to start, with several motivating examples.

The core idea: instead of making the model learn everything from scratch, we can provide explicit features as additional embedding “layers” that give the model information it would otherwise have to infer. This makes the model’s job easier and gives us a more interpretable model in the process.


This work is ongoing and not yet public – please reach out if you’re interested in collaborating, or would like a WIP draft!