Scaling Training Process Transparency

Technical AI safety

Robert Krzyzanowski

ActiveGrant

$5,150raised

$5,000funding goal

Fully funded and not currently accepting donations.

Project summary

I would like to scale up experiments in training process transparency in order to better understand formation of various mechanisms in language models. This work involves training small to medium scale transformer models and analyzing gradients via training data attribution throughout the training process. I believe the resulting insights will inform new directions in mechanistic interpretability and how to detect precursors of deception, particularly in instances where this is hard or impossible on fully trained models.

Key Activities:

Research training process transparency and the formation of mechanisms along training trajectories of language models
Release research outputs and open-source tooling for performing similar experiments

Key Reasons

Mechanistic interpretability (MI) is making decent headway on producing a full picture of model cognition. However, good outcomes for AI alignment are predicated on:
- Addressing failure modes such as deception and deceptive alignment where the structure of the problematic cognition is adversarial to the designer. MI on final model snapshots may overlook relevant mechanisms, particularly if they are subtle internally or hard to access through computational means, and/or only exhibited out of training distribution. Training process transparency can capture these mechanisms before their detection becomes difficult to access.
- Coverage of detectable mechanisms prior to deployments. Training process transparency can increase coverage of where to look for instances of problematic mechanisms to ensure there are no unaccounted mechanisms.
Tooling/Open-Source: Producing open-source packages and research outputs will increase innovation speed on these lines of research.
Compute: Given that this research agenda focuses on analyzing shifts throughout the full training process, employing existing trained models and relatively sparse checkpoints is insufficient.

Project goals

The goals of this project are to better understand how mechanisms form within language models, including:

Induction heads and circuits employing them
- I have already reproduced induction head formation and am in the process of analyzing them to attribute training data responsible for their formation.
- This exercise acts as a good proof-of-concept for the methodology, as induction heads are a mechanism known to reliably occur at particular scales, but their etiology is as yet poorly understood.
IOI or similar complexity circuits
New novel circuits particularly amenable to discovery in light of the training process

The impact of this work will be to publish research packages and analysis of toy models which improves the state of knowledge around mechanism formation in transformer-based models.

I expect this research to last at most 2-3 months before re-evaluation of the research direction, and am seeking this funding for that time period.

How will this funding be used?

This funding will be used for compute and infrastructure costs associated with the project.

A100 GPU $3/hr x ~20 hours per average training run = $60/toy model
- + $20 for additional average post-processing and analysis (ETL on S3 blobs using AWS infra)
- E.g., an A4000 GPU for $0.7/hr at 10 hours per day = $7 to analyze a given result set
- I may substitute the exact GPU type according to the experiment, but currently have not had a need for higher than A100s.
Standard precautions will be taken to avoid overspending on compute and infrastructure costs
- GPU instances will only be active during training runs
- GPUs will be sized appropriately to the training task, such as ensuring that the context length and batch size maximally utilizes the available GPU memory

This funding will enable me to run 50-60 experiments that require individual training runs. I may opt to execute larger training runs on multi-GPU instances or multi-nodes setups if initial experiments are successful, in a way that drives less.

Who is on your team and what's your track record on similar projects?

I am conducting this research as an independent researcher. I completed the SERI MATS program with a focus on how to detect deception, and have published prior results such as this example on detecting various monosemantic and polysemantic MLP neurons using training process transparency:
https://www.alignmentforum.org/posts/DtkA5jysFZGv7W4qP/training-process-transparency-through-gradient
I also have unpublished results related to attention-based mechanisms at the parameter level (e.g. such as parenthesis matching) which I plan to consolidate into the research outputs of this project.

What are the most likely causes and outcomes if this project fails? (premortem)

This project can fail due to:

Fundamental limitations of the methodology: Despite existing results on identifying functionality of MLP neurons, the techniques might not be applicable to functionality of attention heads or circuits.
1. This is unlikely to be the case as I already have some preliminary results on attribution of attention head mechanisms.
Inability to replicate existing circuits: The first step to analyzing results is to replicate existing known mechanisms in a toy model, such as induction heads, IOI, and other behaviors/circuits.
1. This is unlikely to be the case as I have already reproduced induction head formation, and expect to reproduce more complex known behaviors with additional training time.
Difficulty in discovering novel circuits: While I expect training data attribution to simplify the discovery of novel mechanisms and accompanying circuits, this may not be feasible in the requested timeframe (2-3 months) at this scale of toy language model training runs.

What other funding are you or your project getting?

This project does not currently have any other funding.

Robert Krzyzanowski

about 1 year ago

Progress update

What progress have you made since your last update?

I have most recently been focused on research scaling sparse autoencoders to attention layers, which has been submitted as a research paper to NeurIPS and accepted as a Spotlight presentation at the ICML Mechanistic Interpretability workshop.

As an update to Scaling Training Process Transparency, I am working with my summer mentee Gavin Ratcliffe and co-advisor Sara Price to supervise a project that combines developmental interpretability with sleeper agents. This project is a natural extension of Training Process Transparency to larger models that are considered model organisms of deception, and will help answer some questions around the mechanism and formation of the deception trigger in sleeper agents, as well as the corresponding defection behavior.

As part of this project, we intend to use the funds allocated for Scaling Training Process Transparency to address any relevant compute expenses. Morally, the work on this project is equivalent to that of Scaling Training Process Transparency, in that it takes the shape of analyzing mechanism formation throughout the training (or in this case, fine-tuning) process for a model sufficiently large to exhibit interesting behaviors (in this case, deception triggers and deception backdoor behavior).

What are your next steps?

We will be releasing our results on developmental interpretability with sleeper agents as the natural extension to Scaling Training Process Transparency as soon as we have results, which we expect to achieve later this summer '24.

Is there anything others could help you with?

Yes. We would appreciate anyone who is interested in mechanistic interpretability, developmental interpretability and/or model organisms of deception to review our project and validate that it conforms to the stated purposes of this grant. We would also appreciate any feedback and ideas from interested advisors as we are in active iteration on this project.

Austin Chen

over 1 year ago

Approving this as it falls within our purview of technical AI safety research. Best of luck with your research, Robert!

Robert Krzyzanowski

over 1 year ago

I have identified an engineering bottleneck in scaling this approach that I am currently working through. I am going to provisionally accept the project, but may return the funds if I am unable to address this bottleneck within the project timeline.

donated $5,000

Evan Hubinger

almost 2 years ago

Main points in favor of this grant

I am excited about more work in the realm of training transparency, and I know that Rob is capable of executing here from having mentored him previously.

Donor's main reservations

The main way I could imagine this being a bad idea is if it's not a good use of Rob's time, but I'll defer to his judgement there.

Process for deciding amount

I'd likely be willing to fund more than 5k, but I'll cover the full expenses being requested for now.

Conflicts of interest

Rob was a mentee of mine in the SERI MATS program.