Manifund foxDevifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
josephbloom avatarjosephbloom avatar
joseph bloom

@josephbloom

Independent Mechanistic Interpretability Research Engineer

https://www.linkedin.com/in/joseph-bloom1/
$0total balance
$0charity balance
$0cash balance

$0 in pending offers

About Me

I'm an independently funded AI Alignment Research Engineer focussing on mechanistic interpretability in reinforcement learning. I'm one of two current maintainers of TransformerLens, a popular open source package for mechanistic interpretability of transformers.

I recently led the Career Development program at the Alignment Research Engineering Accelerator (ARENA). Prior to working in AI Alignment, I studied computational biology, and worked for 2 years as a data scientist in a proteomics startup.

Projects

Joseph Bloom - Independent AI Safety Research

Comments

Joseph Bloom - Independent AI Safety Research
josephbloom avatar

joseph bloom

over 1 year ago

I just wanted to share my recent post: https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream.

In this post I release a set of Spares AutoEncoders for GPT2-Small, code for training sparse autoencoders and discuss the tacit knowledge I've accumulated while learning to train them. Despite a bug with one of the improvements (since fixed and which didn't effect results), these SAE's constitute both the highest quality SAEs published publicly since the Anthropic paper on the same topic and the best documented (eg: sharing loss curves + code + dashboards).

SAEs are super exciting but we've got a lot of work to do in order to understand whether they are capturing all the information we want them to, aren't systematically representing any information and that the information is useful.

Joseph Bloom - Independent AI Safety Research
josephbloom avatar

joseph bloom

over 1 year ago

Progress update

Note: I’ve tried to make this update relatively accessible, but I’m very happy to give more technical details or clarification in the comments or to have a short call with anyone interested in chatting.

What progress have you made since your last update?

TLDR: Over the last four months, I studied trajectory models using mechanistic interpretability techniques, then shifted focus to Sparse Autoencoders (a significant advancement in the field). In this period, I published a post demonstrating goal representation manipulation in one such model and co-authored another, applying learned principles to language models, notably altering their spelling behavior. This work, though insightful, progressed slower than anticipated (which I attribute to several reasons) and may be redundant in some ways due to other recent progress. After consultation with other researchers, as well as Marcus/Dylan, I've redirected my efforts through Sparse AutoEncoders / Language Models (which I discuss in detail in Next Steps). 

The main goal of the grant is to “help predict, detect and/or prevent AI misalignment“ via developing a mechanistic understanding of offline-RL models (a model organism of sorts for Language models like GPT3). I think of mechanistic interpretability as a natural science of neural network internals and of this research as attempting to contribute to our understanding of the natural phenomena that underpin alignment-relevant properties (e.g., goal representations). 

Therefore, I measure progress in the grant via improvements in methods, theories and techniques that enable us to understand neural network internals. We can decompose this into two components:

  • Algorithm Identification: the process of algorithms and intermediate structures that mediate the mapping of neural network inputs to neural network outputs. This is the well-known circuit-finding agenda (e.g., discussed here).

  • Ontology Identification: learning how a neural network thinks about the world (i.e., mapping internal variables in the model's computation to variables in its environment). 

In the first three months of this grant, Jay Bailey and I progressed towards this goal in the gridworld context. In October, we published “Features and Adversaries in MemoryDT”, where we identified and manipulated internal representations of the gridworld in a trajectory model. Before this work, we had several negative results associated with applications of circuit-finding techniques, which were complicated by some interesting reasons (like my intuitions about superposition/capacity derived from prior work in the field being somewhat flawed) and less exciting reasons (mapping circuits is challenging due to distributed processing). The picture painted by our results and work on sparse autoencoders clarifies why we had extensive superposition despite lots of capacity. 

I then worked on a collaboration with Matthew Watkins, extending some of my insights to language models. We published “Linear encoding of character-level information in GPT-J token embeddings”. Spelling is interesting because it constitutes a task where humans have direct insight into the underlying structure in reality (in this case, in the characters that make up words). However, this is hidden from language models due to details about how we input text. There are several academic publications about the surprising phenomenon that LLMs know which characters are in words.  We could identify and edit linear representations of character information in tokens. This work had some surprising results. In particular, when we delete letters in their token representations, the model predicts subsequent letters from the same word (proportional to their distance to the front of the word). This demonstrates that even if you identify concepts in a model, knowing what will happen when you manipulate them may be another significant challenge.

Neither of these posts was particularly popular on LessWrong, which is reasonable given that there has been considerable progress and publications in the field in the last four months. Feedback from some researchers was positive, but suggested the slow progression was evidence that a pivot might be needed. I don’t feel that I can claim we moved the needle on alignment or mechanistic interpretability much with this work and this is somewhat due to the project betting heavily on mechanistic interpretability being harder than it has turned out to be in language models. Nevertheless, the results in these posts tie in nicely to various phenomena that the research community has begun to understand better and I feel I developed significantly via working on this project. Lastly, I should mention that I am doing research building directly on my codebase / the posts and it seems plausible that for unanticipated reasons both the code / insights may become valuable in the future. 

What are your next steps?

TLDR: I’ve pivoted to training Sparse Autoencoders (SAEs) on small language models to assess how they solve the ontology identification problem, which is a prerequisite for reasoning well about goals/agency within neural networks. I’ve built my own SAE training library and followed up on preliminary experimental results under the supervision of Neel Nanda as part of the MATS program. 

Sparse AutoEncoders are a fascinating new technique that advances our understanding of model internals by an incredible amount. This technique enumerates many concepts and identifies which are inferred at run time by a model at a specific position in the network internals. The incredible result is that the concepts, called “features,” are often incredibly human interpretable (e.g., the concept of words that start with the letter “M” or phrases/words associated with Northern England / Scotland). As a computational biologist, I think saying SAEs are to neural networks as DNA sequencing is to cell biology is pretty accurate. 

For this reason, I reached out to Dylan / Marcus (the two significant funders) to check whether it would be ok if I pivoted to working on this new technique in the language model context (as well as checking with Neel Nanda, who supported the shift). They gave me the go-ahead, so that’s been my direction for the last two months. 

To support this research, I built on a few open-source libraries to make my own SAE training library, which I’ve used to train sparse autoencoders on various models, focusing especially on the GPT2 small. Under Neel’s supervision at MATS, I’ve been exploring a few directions that I think try to address the critical alignment relevant questions about SAEs:

  1. Are Sparse AutoEncoders capturing all of the information that we want them to? One way to think about this is that if you sequence DNA, print it, and then stitch it back into an organism, then the organism shouldn’t die. Molecular biologists do small versions of this all the time. The way we train sparse autoencoders very much suggests that we should get a similar property (we can replace the internals with our reconstruction, but in practice, that reconstruction does hurt the model performance. I’ve got some preliminary results showing we can better represent more information with more concepts concurrently without having those concepts become uninterpretable. Still, there’s more work to measure all the variables we care about here, such as how errors propagate through the model. 

  2. Do Sparse AutoEncoders systematically misrepresent any information? To make sure sparse autoencoders come up with features that are interpretable to us, we enumerate many concepts and try to make sure that we don’t have too many features appear at the same time. However, it’s unclear that a biased process will find the “true” underlying concepts. Since AI alignment will likely require we are very good at estimating the true “ontology” of the model, I’m very interested in trying to find ways of measuring the distance between the “true” ontology and what we are finding along axes that aren’t just how well we recover the model performance. I’ve explored some experiments that may get at this via studying QK circuits, which we may follow up on. 

Regarding practical details, I’ll likely settle on a specific direction shortly and pursue that as part of MATS. I’ll write this up in a research plan as part of MATS and share it here. 

Neel expects his mentees to publish their work in academic articles, so I will likely be close to doing that by the time the Manifund grant period ends. Since I’ve received a LightSpeed grant with another six months of funding, I anticipate being able to continue this research for most of this year, by which time I expect to have results that justify further funding.

Is there anything others could help you with?

Whilst I think I’m mostly okay for funding/everything else (not accepting MATS funding or flight reimbursement), it is undoubtedly the case that Sparse Autoencoders are incredibly computer-hungry. So access to a cloud computing cluster or knowing that if I need to run some big experiments, there are enough funds to do so would be good. 

As an estimate, it can cost $3 / hour and take 12 hours to train one SAE on gpt2 small, and we might want to train 12 of these, which would add to about $400. This is a lower bound as varying hyperparameters, working on larger models, and analysing features post-hoc will all increase the compute expenditure. It seems plausible that the previous 10k budget per year will be underestimated by 2 - 5x. 

Since I don’t want the stress of being handed a lot of money to spend on computing, I mildly prefer access to compute clusters (or a line of credit to be used only for computing or something). This isn’t essential/urgent yet as I still have some uncertainty over whether the research is significantly accelerated by training many SAEs or whether it will be essential to work with larger models. 

Joseph Bloom - Independent AI Safety Research
josephbloom avatar

joseph bloom

over 1 year ago

@MarcusAbramovitch Thanks Marcus!

Retroactive funding for Don't Dismiss Simple Alignment Approaches
josephbloom avatar

joseph bloom

over 1 year ago

I wouldn't usually comment on other people's projects but I've been mentioned in the proposal and @Austin's response. Furthermore, I recently published some research which relates to many of the main themes in Chris's post (world models, steering vectors, superposition).

It's not obvious to me that more posts like these will lead to more good work being done. I don't think we are bottlenecked on ambitious, optimistic people and this post is redundant with others in terms of convincing people to be excited about these research outcomes.

I'd be keen on seeing more results of the kind discussed in the post but my prior on paying people to promote that work on LW being optimal funds use is low.

Joseph Bloom - Independent AI Safety Research
josephbloom avatar

joseph bloom

almost 2 years ago

I don't think it's likely I will be hired with DeepMind as I interviewed for a role recently and they decided not to proceed. I was also told to expect that if I had joined the team it's likely I would have been working on language models.

Joseph Bloom - Independent AI Safety Research
josephbloom avatar

joseph bloom

almost 2 years ago

A few points on this topic:

  • Jay Bailey, a former senior software/devops engineer and SERI-MATS scholar has been funded to work on this agenda and has begun helping me out. I'm also discussing collaborations with other people from more of a maths / conceptual alignment background which I hope will be useful.

  • I agree mentorship is useful and plan to make an effort to find a mentor, although I've also been regularly discussing parts of my work with alignment researchers. At least one well respected alignment researcher told that it's plausible that this kind of work is teaching me more than I'd learn at an Org, but I know Neel disagrees.

  • I'm likely to co-work part time in a London AI safety office if one exists in the future.

I think I'm approaching my research with somewhat a scout mindset here. It seems plausible that independent research for some people is pareto optimal for the community across output from potential mentees/mentors. I am also considering an experiment where I do a small collaboration with an organisation which may provide evidence in the other direction. If it were true that this was productive and alleviated a mentorship bottleneck, then finding that out might be valuable/inform future funding strategies.

Transactions

ForDateTypeAmount
Manifund Bankover 1 year agowithdraw51400
Joseph Bloom - Independent AI Safety Researchalmost 2 years agoproject donation+250
Joseph Bloom - Independent AI Safety Researchalmost 2 years agoproject donation+25000
Joseph Bloom - Independent AI Safety Researchalmost 2 years agoproject donation+25000
Joseph Bloom - Independent AI Safety Researchalmost 2 years agoproject donation+790
Joseph Bloom - Independent AI Safety Researchalmost 2 years agoproject donation+10
Joseph Bloom - Independent AI Safety Researchalmost 2 years agoproject donation+200
Joseph Bloom - Independent AI Safety Researchalmost 2 years agoproject donation+100
Joseph Bloom - Independent AI Safety Researchalmost 2 years agoproject donation+50