What progress have you made since your last update?
Interpreting "goals" turned out to be out of reach, so I did what I said in the description and pivoted towards studying easier LLM phenomena which build towards being able to interpret the hard things. I spent some time researching how grammatical structures are represented, and have since pivoted towards trying to understand how "intermediate variables" are represented and passed between layers. My current high-level direction is basically "break the big black box down into smaller black boxes, and monitor their communication".
What are your next steps?
I'm currently approaching "inter-layer interpretability" with SAE-based circuit-style analysis. I basically want to figure out whether it is possible to do IOI-style things but with SAE features at different layers as the unit of ablation. I'm also looking into how to do SAE-based ablation well (to make results less noisy). I'm researching these questions in MATS under Neel Nanda.
Is there anything others could help you with?
If anyone reading this is interested in the things I described above, I could use collaborators! In particular, if you're somewhat new to alignment and would be interested in a setup where I throw a concrete specification for an experiment at you and you spend an afternoon coding it up, I'd be interested in talking to you.