You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
In recent years, the research community has raised concern that the development of AI (AGI, TAI) will significantly increase x-risk, and the threat of an AI takeover represents a significant fraction of the risk (more). If AGI systems are able to adopt steganography to hide their reasoning via plain text, then it will be more difficult to detect and thwart takeover attempts. It currently seems plausible that AI labs will unintentionally introduce steganography capabilities via a combination of pre-training and fine-tuning, especially more complex forms of reinforcement learning (RL), preventing this is likely to reduce the risk from rogue AI and related scenarios (more). It is important to investigate conditions under which steganography might be induced (more). A few researches addressed this in the past (more), and this seems a promising direction now (more). In this vein, I plan to train LLMs via RL with the goal to find those conditions, publish results so that the research community and the leading AI labs would change their RL training procedures and reduce the possibility of inadvertently inducing capabilities that enable effective collusion. (more).
I plan to do the following series of experiments:
Series 1. Fine-tuning a LLM using RL to do successful steganography, i.e. an encoder and a decoder are the same model that transmits in stegotext a message hidden from an overseer. The message might be just one bit or several bytes of text. The encoder/decoder model is LLAMA 2 or similar model available for RL. The overseer might be a LLM, a regex filter, or other. More things to try: different hyperparameters (KL penalty, discount, etc.), different prompting, training time, other. The successful steganography means that the overseer doesn’t detect most hidden messages (≃50%) and the decoder receives correct messages most of the time (>50%).
Series 2. Sycophantic answer but non sycophantic CoT. It follows this paper (Roger and Greenblatt, 2023). Measurements are similar to the first experiments. Other experiments where CoT is rewarded differently with the answer.
After that, if the results are negative (no steganography), I plan to publish a post on LessWrong / EA forum with description of this project, the code that I have and experiments results which might help future research in this direction.
If the results are not negative, I plan to expand this work to find the actual conditions that lead to the steganography and to publish a paper about this. I plan to publish a blog post as well.
I plan to conduct the first experiment before 15th of March (the symposium date). I think I will also create a blog post before that date if the results are negative.
Stipend: $32k for four months of full-time work (40 h/week) at $50/h
Compute: $2k for computing expenses
Travel: $2k for travel expenses to and from London
Office: $4k for office expenses at $920/month in the London Initiative for Safe AI
Total: $40k for four month extension program in London
Me, Artem Karpov. During the MATS Winter 2024 Cohort, Artyom worked with Jeffrey Ladish, the Understanding AI Hacking track, on the investigation of steganography in LLMs under RL(HF). Before that, he coauthored a project on inducing human-like biases in moral reasoning LMs, finished ARENA and MLSS programs, worked as software engineer in high load projects for several years.
Perhaps, also Felix Binder
Training via RL is heavy and might require more effort (man-hours, compute) that I’m planning to invest. So I might not reach conclusive results in reasonable time. This will likely require one or several Nvidia A100 or similar GPUs with 80GB for hours to fine tune LLAMA 70B or the like. In this case, I might ask CAIS for more compute and call for help by posting on LessWrong or other sites.
The results might become irrelevant, e.g. if LLMs aren’t used as the core of AGI or if steganography in LLMs can be cheaply prevented or detected.
It may not be possible to induce steganography RL or it may be hard. This is a negative result that might be hard to reach but still it is valuable to investigate as described above.
Infohazard. The results might be used to intentionally induce steganography into LLMs with some malicious goal (a backdoor, e.g.).
So far none.
The existential risk (x-risk) from unaligned AI is likely the major global problem in this century, 1-10% probability by different authors (Ord, 2020, p. 264; Karnofsky, 2021; Carlsmith, 2022). The deception problem was specified as a key step to rouge unaligned AI takeover scenarios (Bostrom, 2014, chap. 6; Hubinger et al., 2019; Carlsmith, 2022, p. 37; Ngo, Chan and Mindermann, 2022, p. 7). It is probable that the core of AGI will be LLMs if we are in a world with a short timeline (Fabien Roger, 2023) though some oppose this idea (Thane Ruthenis, 2023). LLMs can reason inside the text they autoregressively produce, i.e. chain-of-thought (CoT) or similar constructs like tree-of-thoughts (Roger, 2022; Wei et al., 2023). Models might develop the steganography ability, i.e. encode messages inside CoT because it is possible to train a model via SFT to do steganography (Roger and Greenblatt, 2023; Motwani et al., 2024) which can be used to hide reasoning. Thus it is important to know under what conditions models might develop this ability and how to monitor it. There is a growing concern that the steganography might naturally develop and lead to AI takeover when models are trained using RL or RLHF (Ajeya Cotra, 2022; Roger and Greenblatt, 2023) though early research has not been able to demonstrate this (Roger, 2023). RLHF is the primary way to fine-tune current large language models (Gemini Team et al., 2023; OpenAI, 2023; Llama 2: Open Foundation and Fine-Tuned Chat Models | Research - AI at Meta, 2023), so the AI labs might unintentionally induce stenographic capabilities into their models. Thus it is important to address this threat model from the deception via steganography as currently it is unclear when and/or how models might develop this ability in the RL setup, if they are able at all. Another possibility is that models that already can hide messages might contaminate datasets they produce thus inducing the steganography in future models (see this comment).
Another promising direction is to investigate the steganography ability when several models are in one pipeline and some control or monitor others. This setup is highly likely for the current level of alignment of LLMs as discussed in (Holden Karnofsky, 2022; Greenblatt et al., 2023). Then the threat model for this might be, again, the rogue AI as described above or such configurations might become a risk factor for other threat models like value lock-in, enfeeblement, etc. (Allan Dafoe, 2020; Hendrycks and Mazeika, 2022). The AI Control approach (Greenblatt et al., 2023) involves using trusted weaker models to supervise untrusted stronger models. Collusion via steganography might open the possibility for deception by these models because they might be trained via RL together or share data. Steganographic capabilities might be a risk factor for the enfeeblement threat model because models might be penalized to avoid sycophantic reasoning in CoT or a like by another models (overseers), yet still produce untruthful but sycophantic answer because the rewards come from users (Roger and Greenblatt, 2023).
Steganography is one of the ways agents might collude in multi-agent, multi-LLM, setup: models can collude in overt ways, modify the log of their communication to conceal collusion, hack their monitoring, develop other communication channels, use backdoors, etc. (Liu et al., 2019; Hubinger et al., 2024; Karpov, 2024). Probably, among the above listed ways to collude, the steganography is the most available as it doesn’t require external actions, hacking, modifications of weights, etc. RLHF is used widely and we already see evidence of basic steganographic capabilities in GPT-4 (Motwani et al., 2024). It might be enough to assess models after RL training using evaluation frameworks (Motwani et al., 2024) or modify their output to prevent this threat (Roger and Greenblatt, 2023), but this might prove too difficult or expensive.
It's not clear how tractable it is to induce steganographic capabilities in models via RL. Only a few people have tried on smaller models, but no one has yet demonstrated positive results: (Roger, 2023) and as per my discussions with other researchers (Zygimantas Straznickas, Felix Binder). GPT-4, and probably other models, currently unlikely use the steganography in their CoT reasoning (Felix Binder, 2023). However, we think it's possible because larger models can do steganography to some degree if prompted and they can be fine-tuned to do it better (Roger, 2023; Motwani et al., 2024). Also this project requires fine-tuning larger models with billions of parameters, which requires compute and poses technical difficulties (eg. fitting models into memory, slow training time, etc.).
To my best knowledge, this is quite a neglected problem because only 2 researchers have tried this RL for steganography, with the aim to investigate the conditions, so far in the last months, (Roger, 2023) and Zygimantas Straznickas. A few researchers worked on the steganography in LLMs (de Witt et al., 2023; Roger and Greenblatt, 2023; Motwani et al., 2024).
Category / area: honest AI (outer alignment), limited AI.
Specifying conditions under which LLMs might learn to hide messages (steganography) via plausible RL training environments, is highly likely to be useful for immediate research in AI control methods and oversight, as well as for longer term research into understanding risks from scheming AIs. Also this research is applicable to preventing misaligned behavior in current systems already deployed by the leading organizations. This includes leading AI labs (OpenAI, Anthropic, DeepMind etc.), and researchers interested in alignment of deceitful AIs (Carlsmith, Fabien Roger, others). Successful demonstration of the induced steganography behavior resulting from RLHF would provide empirical evidence to the theoretical concerns already discussed by others (Casper et al., 2023). This might raise more concern for the safety of RLHF in general. Specifically, engineers and researchers at the labs are more likely to work on mitigations to disincentivize the development of steganographic capabilities under RL training regimes. Impeding the ability to hide messages in the plain sight is the necessary condition to analyze models output and thus is likely to reduce the risks from systems that run with LLMs inside, or use such LLMs for AI control. Thus this is likely to reduce the risk from unaligned deceitful LLM based agents (or one rogue AI), or from the collusion of LLMs. As was shown above, this deception is a key step in the AI takeover scenario, so reducing its possibility is likely to reduce this x-risk.
This project aims to discover how RLHF or other RL training methods might induce steganographic capabilities because currently, to the best of my knowledge, we don’t know what rewards, setup, size of a model, etc. could lead to models that use the steganography to hide their reasoning. Importantly, this project doesn’t address the basic possibility of the use of steganography by LLMs because that has already been demonstrated (Roger and Greenblatt, 2023; Motwani et al., 2024). The goal is to show that current conditions under which LLMs are tried are likely to lead to the development of the steganography and to specify those conditions, e.g. this might be the configuration where CoT is penalized while a final answer is rewarded (this is the sycophancy example, see (Roger and Greenblatt, 2023)).
There are no bids on this project.