I really liked this when I saw it and I flagged it as something I needed to look back at. Now that I have looked into it more in depth and see Neel's endorsement (which has worked out fantastically for my grant to Joseph Bloom), as well as Evan's endorsement, I finished off the required funding. I also met Lawrence briefly and was impressed by him but this is minor.
I also think his reasoning for doing this work is great.
Exploring novel research directions in prosaic AI alignment
Project summary
TL;DR: I am planning to independently explore various technical research directions in AI Safety, which will involve literature reviews, small-scale experiments, and writing on different aspects of the alignment problem. I’m currently working on a few directions in adversarial training. I'll probably work on this for 3 months FTE at a rate of approximately $10,000 per month, totaling around $30,000. I might continue this if it goes well.
Motivation and longer project description:
In my time as an AIS grantmaker, I’ve found that there really isn’t much diversity in the projects people are working on – the vast majority work on some variant of mech interp and LM jailbreaks/redteaming/evals, with the remainder mainly working on a few conceptual or theory projects such as Singular Learning Theory or Boundaries. I’m quite sure that there are other underexplored areas of AIS, and so I’ve decided to leave my position at Evals to work on finding new areas. From Neel’s experiences with the grokking mech interp work (and from working with him on said project), it seems that “think hard on your own about a project while doing a bunch of toy experiments, then turn your preliminary results into a few academic papers” can be a promising path to creating and popularizing new research directions. I’m hoping to recreate the success the community had with mech interp in 2022, but in another area of research.
This month I’ve been focused on looking into adversarial training stuff, because it seems like an area that was of great interest even ~2 years ago but was pretty much dropped after. Since I started ~3 weeks ago, I’ve done a literature review of the Latent Adversarial Training space (annotated bibliography here), have read up on the old “adversarial examples as bugs not features” debate (mainly as a source of toy experiments), have solicited a few semi-formal problems in this area that I’ve been thinking about, and have started writing up thoughts on the theory of adversarial training/examples. I’m currently a bit disappointed about the state of academic AT as it relates to using AT for AGI Safety, so it seems that I’ll probably do less lit reviews and more toy experiments and independent writing/thinking going forward. Here's a few of the research directions I’m currently looking into. Note that these are currently more selected for tractability instead of importance: (I’m doing this because I think it’s valuable to get real-world feedback on related topics while working on more conceptual problems.)
Applying mech interp via sparse autoencoders to datasets from the “adversarial examples as bugs not features” debate (and the subsequent discussion on Distill), to see if they can differentiate robust and non-robust features.
Testing the limits of latent adversarial training in toy examples, to better understand when it makes sense to relax AT by allowing for internal perturbations.
Exploring non-input space, model-internals-based adversarial training methods on Redwood Research’s toy text examples from 2022.
Ultimately, I aim to produce several literature reviews, 2-3 blog posts with my perspectives on AI alignment, some toy problems with preliminary results, and potentially a CFP, depending on the quality of these results. (I think if I don’t have at least 2 literature reviews, a blog post, and a toy problem with preliminary results, I would consider this effort a failure.)
How will this funding be used?
The majority of it will be spent on my salary + various taxes and fees, including paying for access to a coworking space.
Who is on your team and what's your track record on similar projects?
It's just me working on it. I've been involved in many conceptual formalization projects and have been involved with/written several papers, including in the past as a PhD Student/independent researcher or otherwise with minimal outside support or supervision.
What are the most likely causes and outcomes if this project fails? (premortem)
The primary concern is that most independent researchers don’t do super well in terms of productivity. However, I’ve had some luck doing this in late 2022 (when extending Neel’s grokking results and getting it published), I’ve set up accountability schemes with the MATS Scholar Support team, and I will find collaborators at FAR AI, Redwood, and various PhD programs. It also seems plausible that I can get mentorship from my PhD advisors or other senior researchers once I have preliminary results. So I think the risk of a complete burn out/failure is quite low.
That being said, a secondary concern is that the areas I'm looking at or the projects I'm considering may be completely unpromising. I think this might still be an interesting result nonetheless, and I'd be surprised if I couldn't find any projects I'd be excited to work on.
There's a decent chance I'll be working/continue working part time in other roles over the next few months, so the actual project deadline will be in ~6 months, but I intend to spend the equivalent of 3 months FTE on this project or other comparable independent research efforts.
What other funding are you or your project getting?
None: I'm currently funding this effort out of pocket, with the goal of getting retroactive funding assuming things go well. At lower funding amounts I'm likely to apply for retroactive funding for a portion of the work.

Marcus Abramovitch
over 1 year ago
Neel Nanda
over 1 year ago
Lawrence is great, very experienced with alignment, and I trust his judgement, this seems like a great thing to fund! I would donate myself if this was tax deductible in the UK (which I don't think it is?)
Austin Chen
over 1 year ago
@NeelNanda thanks for weighing in! Manifund doesn't have a UK entity set up, unfortunately. One thing that might be possible would be to figure out a donation swap where eg you commit to donating $10k via Givewell UK and some US-based person who was planning on giving to Givewell instead donates $10k to this project, and you both take tax deductions for your respective countries.
Austin Chen
over 1 year ago
Approving this project, as Lawrence's work falls squarely within Manifund's cause of advancing technical AI safety!
Evan Hubinger
over 1 year ago
Main points in favor of this grant
Normally I'm somewhat skeptical of totally independent alignment work, but Lawrence has a solid track record and I think his project ideas sound quite exciting. I was also recommended this grant specifically by someone I trust, and encouraged Lawrence to put it up here.
Donor's main reservations
Independent alignment work without any mentorship doesn't have a fantastic track record in my opinion, so it's definitely possible that not much of value will come from this other than helping keep Lawrence learning and doing work (though that is still meaningful upside).
Process for deciding amount
I would fund the full amount here, but I'm starting to run out of money in my Manifund pot. I'd appreciate other funders stepping in to top this off.
Conflicts of interest
None.
Joel Becker
over 1 year ago
I'm not going to look into this right now, because (my skim of) your project pattern-matches to things that I think other regrantors would fund in my absence. Please feel free to get in touch if you haven't received an offer from someone else in 2-4 weeks time.

Marcus Abramovitch
over 1 year ago
This is very interesting. Want to set up a call and talk about it? Ill take notes (and maybe publish those with your permission)