In more detail, with respect to the concrete points above.
Completing the analysis of phase transitions and associated structure formation in the Toy Models of Superposition (preliminary work reported in the SLT & Alignment summit’sSLT High 4lecture).See Chen et al. (2023).Performing a similar analysis for theInduction Headspaper.See Hoogland et al. (2024).For diverse models that are known to contain structure/circuits, we will attempt to:
detect phase transitions (using a range of metrics, including train and test losses, RLCT and singular fluctuation),classify weights at each transition into state & control variables,
perform mechanistic interpretability analyses at these transitions,
compare these analysis to MechInterp structures found at the end of training.
Classifying transitions into state & control variables remains to be done in the next few months. We have performed some mechanistic/structural analysis, and more of this kind of analysis is currently underway.