What progress have you made since your last update?
The 6 weeks were mostly spent on three projects:
Planning out the literature review mentioned in the grant.
Contributing to SLT research related to validating the developmental interpretability research agenda.
Learning to use TPUs and PyTorch/XLA, and teaching this to the developmental interpretability research community.
Project (1) is not yet complete. I made some progress recruiting co-authors and planning a literature review. However, we didn't make much progress writing the actual review: all we have is a review sketch. I consider this an overall failure. (I do still think I can eventually complete the review in the near future without additional funding. More below.)
Project (2) was much more successful. Over the 6 weeks I successfully replicated an in-context learning experiment based on prior work, and this went on to become the foundation for a collaborative research project leading to a paper "The Developmental Landscape of In-Context Learning" (under review, arXiv preprint, associated lesswrong post).
Project (3) was also successful. It culminated in me delivering a tutorial for the local alignment research community on how to accelerate research experiments with free TPUs from Google (tutorial, recording). Since then I have built on this experience by learning JAX and I am planning a JAX course for my research group for later this year.
In addition to these outcomes, I kept a weekly research journal for the first few weeks of the project, which contains some more detailed commentary.
Immediately after the 6 weeks, I transitioned to full time research work at Krueger AI Safety Lab and have been working since then on other projects (plus continuing to collaborate on project (2) with the developmental interpretability research community).
Overall evaluation
My main personal goal for the project was to fill a 6 week gap between paid work opportunities, and contribute in some positive ways to alignment research. I feel that this broader goal was achieved through projects (2) and (3) described above.
This broader motivation was part of my initial pitch to Adam Gleave, the regrantor who awarded my funding. However, Adam was most excited about the literature review aspect of my proposal, and chose to emphasise that aspect of the project in his writeup for this Manifund proposal.
Since I didn't manage to produced a literature review in the 6 weeks and also haven't managed to produce one since then, I consider this project to have failed.
What are your next steps?
However, I still think the project is recoverable. Since setting out to produce the literature review, there have been three publications that partially fill the need for this resource for the SLT/alignment research community.
A summary paper by the founder of SLT: Watanabe, 2022, "Recent Advances in Algebraic Geometry and Bayesian Statistics"
A methods paper including a good technical introduction to different levels of singularity, a central concept for SLT: Lau et al., 2023, "Quantifying Degeneracy in Singular Models via the Learning Coefficient"
Another methods paper including a good technical introduction to different definitions of the learning coefficient, a central quantity in SLT: Furman and Lau, 2024, "Estimating the Local Learning Coefficient at Scale".
These resources are enough to hold the community over but there is still a need for a comprehensive accessible technical introduction to the theory. Moreover, the SLT/developmental interpretability community has made some progress establishing the viability of this research direction, and so the need for such a resource within the wider alignment community is still present.
Working with some of the authors of the latter two articles I mentioned above, I still plan to write up the literature review we have planned. To do so I am waiting for an appropriate opportunity. I believe I will have an opportunity in the latter half of 2024 when my KASL project comes to a conclusion, in the few months before and after the start of my DPhil.
Is there anything others could help you with?
No.
In particular, I do not currently require more funding to find time to work on this.
If there is anyone who is interested in collaborating on the literature review, however, feel free to reach out.