Current

SPD Stochastic Parametric Decomposition
Attention Motifs understanding attention heads by organizing them via the patterns they produce
Cyborg Language Models replacing LLM components with explicitly programmed features for interpretability

Past research

Understanding Search in Transformers
mechanistic interpretability of transformer networks trained on toy spatial tasks (mazes)
Inverse Scaling
resistance to prompt injection might be an example of a task which exhibits inverse scaling in large transformer networks
CE-learn
Studying learning in the C. elegans nematode
RL-dreams
value function robustness for world-model RL by giving it weird dreams
BNBP
Training of Hodgkin-Huxley neural nets with SGD
ML for medical imaging
Segmentation of CT images using a U-Net architecture