Attention Motifs
A lot of mechanistic interpretability work on transformers, particularly classical circuits work, involves finding heads that are causally involved in some given task, and then staring at the patterns they produce to try and understand what the heads are doing. Wouldn’t it be nice if there was a systematic way to organize attention heads across many models based on the patterns they produce?
I spent some time trying to develop a method trying to do exactly that. I found many, many clever ways that did not work: thinking about the attention patterns as absorbing markov chains and studying the dynamics of time to absorption; some topological data analysis methods; etc. What ended up working was taking a lot of fairly straightforward statistical features about the patterns and gram matrices and a few other things, and doing a PCA.
This work comes with a web tool: attention-motifs.github.io
If you have a head of interest, you can input it here to see the kinds of patterns it produces, heads across models which produce similar patterns, and whether any of those heads belong to known classes.
This work is ongoing – please shoot me an email if you have any questions, ideas, or have a known class of heads you’d like to add to the database!