Exploring novel research directions in prosaic AI alignment

Lawrence Chan

CompleteGrant

$30,000raised

Project summary

TL;DR: I am planning to independently explore various technical research directions in AI Safety, which will involve literature reviews, small-scale experiments, and writing on different aspects of the alignment problem. I’m currently working on a few directions in adversarial training. I'll probably work on this for 3 months FTE at a rate of approximately $10,000 per month, totaling around $30,000. I might continue this if it goes well.

Motivation and longer project description:

In my time as an AIS grantmaker, I’ve found that there really isn’t much diversity in the projects people are working on – the vast majority work on some variant of mech interp and LM jailbreaks/redteaming/evals, with the remainder mainly working on a few conceptual or theory projects such as Singular Learning Theory or Boundaries. I’m quite sure that there are other underexplored areas of AIS, and so I’ve decided to leave my position at Evals to work on finding new areas. From Neel’s experiences with the grokking mech interp work (and from working with him on said project), it seems that “think hard on your own about a project while doing a bunch of toy experiments, then turn your preliminary results into a few academic papers” can be a promising path to creating and popularizing new research directions. I’m hoping to recreate the success the community had with mech interp in 2022, but in another area of research.

This month I’ve been focused on looking into adversarial training stuff, because it seems like an area that was of great interest even ~2 years ago but was pretty much dropped after. Since I started ~3 weeks ago, I’ve done a literature review of the Latent Adversarial Training space (annotated bibliography here), have read up on the old “adversarial examples as bugs not features” debate (mainly as a source of toy experiments), have solicited a few semi-formal problems in this area that I’ve been thinking about, and have started writing up thoughts on the theory of adversarial training/examples. I’m currently a bit disappointed about the state of academic AT as it relates to using AT for AGI Safety, so it seems that I’ll probably do less lit reviews and more toy experiments and independent writing/thinking going forward. Here's a few of the research directions I’m currently looking into. Note that these are currently more selected for tractability instead of importance: (I’m doing this because I think it’s valuable to get real-world feedback on related topics while working on more conceptual problems.)

Applying mech interp via sparse autoencoders to datasets from the “adversarial examples as bugs not features” debate (and the subsequent discussion on Distill), to see if they can differentiate robust and non-robust features.
Testing the limits of latent adversarial training in toy examples, to better understand when it makes sense to relax AT by allowing for internal perturbations.
Exploring non-input space, model-internals-based adversarial training methods on Redwood Research’s toy text examples from 2022.

Ultimately, I aim to produce several literature reviews, 2-3 blog posts with my perspectives on AI alignment, some toy problems with preliminary results, and potentially a CFP, depending on the quality of these results. (I think if I don’t have at least 2 literature reviews, a blog post, and a toy problem with preliminary results, I would consider this effort a failure.)

How will this funding be used?

The majority of it will be spent on my salary + various taxes and fees, including paying for access to a coworking space.

Who is on your team and what's your track record on similar projects?

It's just me working on it. I've been involved in many conceptual formalization projects and have been involved with/written several papers, including in the past as a PhD Student/independent researcher or otherwise with minimal outside support or supervision.

What are the most likely causes and outcomes if this project fails? (premortem)

The primary concern is that most independent researchers don’t do super well in terms of productivity. However, I’ve had some luck doing this in late 2022 (when extending Neel’s grokking results and getting it published), I’ve set up accountability schemes with the MATS Scholar Support team, and I will find collaborators at FAR AI, Redwood, and various PhD programs. It also seems plausible that I can get mentorship from my PhD advisors or other senior researchers once I have preliminary results. So I think the risk of a complete burn out/failure is quite low.

That being said, a secondary concern is that the areas I'm looking at or the projects I'm considering may be completely unpromising. I think this might still be an interesting result nonetheless, and I'd be surprised if I couldn't find any projects I'd be excited to work on.

There's a decent chance I'll be working/continue working part time in other roles over the next few months, so the actual project deadline will be in ~6 months, but I intend to spend the equivalent of 3 months FTE on this project or other comparable independent research efforts.

What other funding are you or your project getting?

None: I'm currently funding this effort out of pocket, with the goal of getting retroactive funding assuming things go well. At lower funding amounts I'm likely to apply for retroactive funding for a portion of the work.

Lawrence Chan

about 1 year ago

Final report

Description of subprojects and results, including major changes from the original proposal

See my progress update here: https://manifund.org//projects/exploring-novel-research-directions-in-prosaic-ai-alignment?tab=comments#796ba4d9-f6e7-441f-9858-db4ce03a56a2

Spending breakdown

$30k -- salary + taxes

(My compute spend was provided by Constellation and FAR AI.)

Lawrence Chan

about 1 year ago

Progress update

What progress have you made since your last update?

(Un)fortunately, a lot of the research areas I was interested in exploring have become substantially more mainstream since I wrote the research proposal. For example, Stephen Casper and collaborators have put out their latent adversarial training paper, FAR has completed their work on adversarial example/training scaling laws for transformers, and many at Anthropic and other labs are investing significant amounts of time and resources into adversarial training and related areas.

Instead, I've done work trying to lay the theoretical framework behind ideas in mechanistic interpretability. While there's been a lot of incremental empirical work on improving SAEs or applying SAEs to larger models, there's many theoretical questions in interp that are much more neglected. Specifically, over the course of the grant, I've worked on and completed the following two projects:

Compact Proofs and Mechanistic Interp: There's been a small but steady amount of ongoing discussion on methods to evaluate circuits or mechanistic explanations in general. On one end, we have sampling-based methods like causal scrubbing, and on the other end, we have proofs. So it's natural to explore the question -- can we use proof length/quality to evaluate the degree of mechanistic undertanding? Can we even write nonvacuousproofs about model behavior at all? We've completed preliminary work showing that the answer to both questions is yes on small max-of-K transformers: blog post, paper.
Models of Computation in Superposition: There's an implicit model of computation in superposition that people in mech interp seem to rely on, where models are using superposition to approximately represent a sparse boolean circuit. In contrast, the standard toy models of superposition are Anthropic's TMS, which focuses on representational superposition, and Scherlis et al's model involving quadratic forms, both of which only consider superposition occurring at a single layer. With some collaborators, we've built out a model of superposition that is closer to the implicit model, where superposition both allows for more compact computation of larger circuits, and where superposition can be maintained across many layers (paper).

What are your next steps?

I'm deciding between returning to METR, my PhD, or other job opportunities.

In case I do pursue a job that allows me to do related research, I'll probably follow up to the projects as follows:

Proofs and Mech Interp: I don't believe that formal proofs will scale to frontier models, but working on the project has made me more convinced of the feasibility of building formal-ish systems for doing mech interp. A natural follow up would be ARC Theory's heuristic arguments work (as laid out in Jacob's recent blog post), which does neatly go around one of the main issues with scaling proofs. I'd probably work on empirical work applying heuristic arguments on transformer models.
Models of Superposition: Over the course of this work, I've become convinced that the model of networks as computing sparse boolean circuits is incorrect. Instead, it seems like every sort of interesting sort of computation requires inhibition or competition between features, and sparse boolean circuits do not allow for inhibition. I think building a model of how inhibition works for features in superposition is the natural next step. (Relatedly, see this post by Redwood on how inhibition allows for substantially more representation power in a one-layer attention transformer than "just" skip-trigrams.)

Is there anything others could help you with?

I'm pretty confused what to do career wise -- any advice would be appreciated.

donated $4,800

Marcus Abramovitch

over 1 year ago

I really liked this when I saw it and I flagged it as something I needed to look back at. Now that I have looked into it more in depth and see Neel's endorsement (which has worked out fantastically for my grant to Joseph Bloom), as well as Evan's endorsement, I finished off the required funding. I also met Lawrence briefly and was impressed by him but this is minor.

I also think his reasoning for doing this work is great.

Neel Nanda

over 1 year ago

Lawrence is great, very experienced with alignment, and I trust his judgement, this seems like a great thing to fund! I would donate myself if this was tax deductible in the UK (which I don't think it is?)

Austin Chen

over 1 year ago

@NeelNanda thanks for weighing in! Manifund doesn't have a UK entity set up, unfortunately. One thing that might be possible would be to figure out a donation swap where eg you commit to donating $10k via Givewell UK and some US-based person who was planning on giving to Givewell instead donates $10k to this project, and you both take tax deductions for your respective countries.

Austin Chen

over 1 year ago

Approving this project, as Lawrence's work falls squarely within Manifund's cause of advancing technical AI safety!

donated $25,000

Evan Hubinger

over 1 year ago

Main points in favor of this grant

Normally I'm somewhat skeptical of totally independent alignment work, but Lawrence has a solid track record and I think his project ideas sound quite exciting. I was also recommended this grant specifically by someone I trust, and encouraged Lawrence to put it up here.

Donor's main reservations

Independent alignment work without any mentorship doesn't have a fantastic track record in my opinion, so it's definitely possible that not much of value will come from this other than helping keep Lawrence learning and doing work (though that is still meaningful upside).

Process for deciding amount

I would fund the full amount here, but I'm starting to run out of money in my Manifund pot. I'd appreciate other funders stepping in to top this off.

Conflicts of interest

None.

Joel Becker

over 1 year ago

I'm not going to look into this right now, because (my skim of) your project pattern-matches to things that I think other regrantors would fund in my absence. Please feel free to get in touch if you haven't received an offer from someone else in 2-4 weeks time.

donated $4,800

Marcus Abramovitch

over 1 year ago

This is very interesting. Want to set up a call and talk about it? Ill take notes (and maybe publish those with your permission)