Project summary
TL;DR: I am planning to independently explore various technical research directions in AI Safety, which will involve literature reviews, small-scale experiments, and writing on different aspects of the alignment problem. I’m currently working on a few directions in adversarial training. I'll probably work on this for 3 months FTE at a rate of approximately $10,000 per month, totaling around $30,000. I might continue this if it goes well.
Motivation and longer project description:
In my time as an AIS grantmaker, I’ve found that there really isn’t much diversity in the projects people are working on – the vast majority work on some variant of mech interp and LM jailbreaks/redteaming/evals, with the remainder mainly working on a few conceptual or theory projects such as Singular Learning Theory or Boundaries. I’m quite sure that there are other underexplored areas of AIS, and so I’ve decided to leave my position at Evals to work on finding new areas. From Neel’s experiences with the grokking mech interp work (and from working with him on said project), it seems that “think hard on your own about a project while doing a bunch of toy experiments, then turn your preliminary results into a few academic papers” can be a promising path to creating and popularizing new research directions. I’m hoping to recreate the success the community had with mech interp in 2022, but in another area of research.
This month I’ve been focused on looking into adversarial training stuff, because it seems like an area that was of great interest even ~2 years ago but was pretty much dropped after. Since I started ~3 weeks ago, I’ve done a literature review of the Latent Adversarial Training space (annotated bibliography here), have read up on the old “adversarial examples as bugs not features” debate (mainly as a source of toy experiments), have solicited a few semi-formal problems in this area that I’ve been thinking about, and have started writing up thoughts on the theory of adversarial training/examples. I’m currently a bit disappointed about the state of academic AT as it relates to using AT for AGI Safety, so it seems that I’ll probably do less lit reviews and more toy experiments and independent writing/thinking going forward. Here's a few of the research directions I’m currently looking into. Note that these are currently more selected for tractability instead of importance: (I’m doing this because I think it’s valuable to get real-world feedback on related topics while working on more conceptual problems.)
Applying mech interp via sparse autoencoders to datasets from the “adversarial examples as bugs not features” debate (and the subsequent discussion on Distill), to see if they can differentiate robust and non-robust features.
Testing the limits of latent adversarial training in toy examples, to better understand when it makes sense to relax AT by allowing for internal perturbations.
Exploring non-input space, model-internals-based adversarial training methods on Redwood Research’s toy text examples from 2022.
Ultimately, I aim to produce several literature reviews, 2-3 blog posts with my perspectives on AI alignment, some toy problems with preliminary results, and potentially a CFP, depending on the quality of these results. (I think if I don’t have at least 2 literature reviews, a blog post, and a toy problem with preliminary results, I would consider this effort a failure.)
How will this funding be used?
The majority of it will be spent on my salary + various taxes and fees, including paying for access to a coworking space.
Who is on your team and what's your track record on similar projects?
It's just me working on it. I've been involved in many conceptual formalization projects and have been involved with/written several papers, including in the past as a PhD Student/independent researcher or otherwise with minimal outside support or supervision.
What are the most likely causes and outcomes if this project fails? (premortem)
The primary concern is that most independent researchers don’t do super well in terms of productivity. However, I’ve had some luck doing this in late 2022 (when extending Neel’s grokking results and getting it published), I’ve set up accountability schemes with the MATS Scholar Support team, and I will find collaborators at FAR AI, Redwood, and various PhD programs. It also seems plausible that I can get mentorship from my PhD advisors or other senior researchers once I have preliminary results. So I think the risk of a complete burn out/failure is quite low.
That being said, a secondary concern is that the areas I'm looking at or the projects I'm considering may be completely unpromising. I think this might still be an interesting result nonetheless, and I'd be surprised if I couldn't find any projects I'd be excited to work on.
There's a decent chance I'll be working/continue working part time in other roles over the next few months, so the actual project deadline will be in ~6 months, but I intend to spend the equivalent of 3 months FTE on this project or other comparable independent research efforts.
What other funding are you or your project getting?
None: I'm currently funding this effort out of pocket, with the goal of getting retroactive funding assuming things go well. At lower funding amounts I'm likely to apply for retroactive funding for a portion of the work.