Project summary
6 months support for two people, Damiano and Pietro, to write a paper about (dis)empowerment with Jon Richens, Reuben Adams, Tom Everitt, and Victoria Krakovna. This application is suggested by Evan Hubinger.
Project goals
Produce a paper about agency and (dis)empowerment in the context of multi agent causal models.
Its ultimate aim is to offer formal and operational notions of (dis)empowerment. For example, an intermediate step would be to provide a continuous formalisation of agency, and to investigate which conditions increase or decrease agency.
These definitions should be:
Conceptually satisfactory: so as to deconfuse ambiguity on the concept;
Operationally implementable: both to be able to identify agents from causal models, and to train models (both RL and LLMs) in such a way that they are not disempowering, e.g., they do not reduce the agency of other agents (including us).
We anticipate a thorough study of these forms of (dis)empowerment, including a literature review, motivating examples and threat models, mathematical formalisations, sound and complete criteria to characterise (dis)empowerment, and the study of concrete examples.
How will this funding be used?
- 30k funding support for 6 months for 2 people;
- 10k expenses for 3 or 4 trips to London: rent, flights + transport, food, …;
- 20k taxation.
What is the recipient's track record on similar projects?
Damiano is a PhD student in Math. Pietro is a M.Sc. student in Math.
A first draft of the project, with preliminary results, has been delivered by Victoria Krakovna to Jon Richens and Tom Everitt as part of the selection for Damiano’s research phase of MATS. Jon and Tom agreed to co-supervise the project, and produce a paper out of it. Pietro helped significantly with the deliverable, proofreading it and sharing/discussing ideas. Consequently he joined the project.
For the preliminary results, the scope of the project has been reduced to offer two counterfactual definitions of disempowerment, providing sound and complete criteria to detect them in causal models, and studying how they relate to each other.
When we say "A is (dis)empowered by a policy π_B of B", we understand that both A and B can be agents, groups of agents, or (parts of) the environment itself (so π_B could be a tuple of policies):
Disempowerment as loss of decisional power: If B would choose a different policy, A could make a different decision than the best response to π_B, which would yield A greater utility;
Disempowerment as the need to cause certain outcomes: A has an incentive to control a set of variables X, A can control the variables X, but A's best response to π_B is constrained w.r.t. X, in the sense that if the outcome of X would be guaranteed regardless of A's actions, then A could pursue a different policy that would be at least as good.
How could this project be actively harmful?
Any mathematical formalisation of a notion that is to be avoided can be used to train models that optimise for that notion. As such, this project also suffers from this problem.
What other funding is this person or project getting?
Pietro is a master student getting no other funding. Damiano gets a salary as a PhD student in Barcelona.