Steganography via RL

Project summary

In recent years, the research community has raised concern that the development of AI (AGI, TAI) will significantly increase x-risk, and the threat of an AI takeover represents a significant fraction of the risk. If AGI systems are able to adopt steganography to hide their reasoning via plain text, then it will be more difficult to detect and thwart takeover attempts. It currently seems plausible that AI labs will unintentionally introduce steganography capabilities via a combination of pre-training and fine-tuning, especially more complex forms of reinforcement learning (RL), preventing this is likely to reduce the risk from rogue AI and related scenarios. It is important to investigate conditions under which steganography might be induced. A few researches addressed this in the past, and this seems a promising direction now. In this vein, I plan to train LLMs via RL with the goal to find those conditions, publish results so that the research community and the leading AI labs would change their RL training procedures and reduce the possibility of inadvertently inducing capabilities that enable effective collusion.

What are this project's goals and how will you achieve them?

I plan to do the following series of experiments:

Series 1. Fine-tuning a LLM using RL to do successful steganography, i.e. an encoder and a decoder are the same model that transmits in stegotext a message hidden from an overseer. The message might be just one bit or several bytes of text. The encoder/decoder model is LLAMA 2 or similar model available for RL. The overseer might be a LLM, a regex filter, or other. More things to try: different hyperparameters (KL penalty, discount, etc.), different prompting, training time, other. The successful steganography means that the overseer doesn’t detect most hidden messages (≃50%) and the decoder receives correct messages most of the time (>50%).

Series 2. Sycophantic answer but non sycophantic CoT. It follows this paper (Roger and Greenblatt, 2023). Measurements are similar to the first experiments. Other experiments where CoT is rewarded differently with the answer.

After that, if the results are negative (no steganography), I plan to publish a post on LessWrong / EA forum with description of this project, the code that I have and experiments results which might help future research in this direction.

If the results are not negative, I plan to expand this work to find the actual conditions that lead to the steganography and to publish a paper about this. I plan to publish a blog post as well.

I plan to conduct the first experiment before 15th of March (the symposium date). I think I will also create a blog post before that date if the results are negative.

How will this funding be used?

Stipend: $32k for four months of full-time work (40 h/week) at $50/h
Compute: $2k for computing expenses
Travel: $2k for travel expenses to and from London
Office: $4k for office expenses at $920/month in the London Initiative for Safe AI
Total: $40k for four month extension program in London

Who is on your team and what's your track record on similar projects?

Me, Artem Karpov. During the MATS Winter 2024 Cohort, Artyom worked with Jeffrey Ladish, the Understanding AI Hacking track, on the investigation of steganography in LLMs under RL(HF). Before that, he coauthored a project on inducing human-like biases in moral reasoning LMs, finished ARENA and MLSS programs, worked as software engineer in high load projects for several years.

Perhaps, also Felix Binder

What are the most likely causes and outcomes if this project fails? (premortem)

Training via RL is heavy and might require more effort (man-hours, compute) that I’m planning to invest. So I might not reach conclusive results in reasonable time. This will likely require one or several Nvidia A100 or similar GPUs with 80GB for hours to fine tune LLAMA 70B or the like. In this case, I might ask CAIS for more compute and call for help by posting on LessWrong or other sites.
The results might become irrelevant, e.g. if LLMs aren’t used as the core of AGI or if steganography in LLMs can be cheaply prevented or detected.
It may not be possible to induce steganography RL or it may be hard. This is a negative result that might be hard to reach but still it is valuable to investigate as described above.
Infohazard. The results might be used to intentionally induce steganography into LLMs with some malicious goal (a backdoor, e.g.).

What other funding are you or your project getting?

So far none.

Thread Model

The existential risk (x-risk) from unaligned AI is likely the major global problem in this century, 1-10% probability by different authors (Ord, 2020, p. 264; Karnofsky, 2021; Carlsmith, 2022). The deception problem was specified as a key step to rouge unaligned AI takeover scenarios (Bostrom, 2014, chap. 6; Hubinger et al., 2019; Carlsmith, 2022, p. 37; Ngo, Chan and Mindermann, 2022, p. 7). It is probable that the core of AGI will be LLMs if we are in a world with a short timeline (Fabien Roger, 2023) though some oppose this idea (Thane Ruthenis, 2023). LLMs can reason inside the text they autoregressively produce, i.e. chain-of-thought (CoT) or similar constructs like tree-of-thoughts (Roger, 2022; Wei et al., 2023). Models might develop the steganography ability, i.e. encode messages inside CoT because it is possible to train a model via SFT to do steganography (Roger and Greenblatt, 2023; Motwani et al., 2024) which can be used to hide reasoning. Thus it is important to know under what conditions models might develop this ability and how to monitor it. There is a growing concern that the steganography might naturally develop and lead to AI takeover when models are trained using RL or RLHF (Ajeya Cotra, 2022; Roger and Greenblatt, 2023) though early research has not been able to demonstrate this (Roger, 2023). RLHF is the primary way to fine-tune current large language models (Gemini Team et al., 2023; OpenAI, 2023; Llama 2: Open Foundation and Fine-Tuned Chat Models | Research - AI at Meta, 2023), so the AI labs might unintentionally induce stenographic capabilities into their models. Thus it is important to address this threat model from the deception via steganography as currently it is unclear when and/or how models might develop this ability in the RL setup, if they are able at all. Another possibility is that models that already can hide messages might contaminate datasets they produce thus inducing the steganography in future models (see this comment).

Another promising direction is to investigate the steganography ability when several models are in one pipeline and some control or monitor others. This setup is highly likely for the current level of alignment of LLMs as discussed in (Holden Karnofsky, 2022; Greenblatt et al., 2023). Then the threat model for this might be, again, the rogue AI as described above or such configurations might become a risk factor for other threat models like value lock-in, enfeeblement, etc. (Allan Dafoe, 2020; Hendrycks and Mazeika, 2022). The AI Control approach (Greenblatt et al., 2023) involves using trusted weaker models to supervise untrusted stronger models. Collusion via steganography might open the possibility for deception by these models because they might be trained via RL together or share data. Steganographic capabilities might be a risk factor for the enfeeblement threat model because models might be penalized to avoid sycophantic reasoning in CoT or a like by another models (overseers), yet still produce untruthful but sycophantic answer because the rewards come from users (Roger and Greenblatt, 2023).

Importance

Steganography is one of the ways agents might collude in multi-agent, multi-LLM, setup: models can collude in overt ways, modify the log of their communication to conceal collusion, hack their monitoring, develop other communication channels, use backdoors, etc. (Liu et al., 2019; Hubinger et al., 2024; Karpov, 2024). Probably, among the above listed ways to collude, the steganography is the most available as it doesn’t require external actions, hacking, modifications of weights, etc. RLHF is used widely and we already see evidence of basic steganographic capabilities in GPT-4 (Motwani et al., 2024). It might be enough to assess models after RL training using evaluation frameworks (Motwani et al., 2024) or modify their output to prevent this threat (Roger and Greenblatt, 2023), but this might prove too difficult or expensive.

Tractability

It's not clear how tractable it is to induce steganographic capabilities in models via RL. Only a few people have tried on smaller models, but no one has yet demonstrated positive results: (Roger, 2023) and as per my discussions with other researchers (Zygimantas Straznickas, Felix Binder). GPT-4, and probably other models, currently unlikely use the steganography in their CoT reasoning (Felix Binder, 2023). However, we think it's possible because larger models can do steganography to some degree if prompted and they can be fine-tuned to do it better (Roger, 2023; Motwani et al., 2024). Also this project requires fine-tuning larger models with billions of parameters, which requires compute and poses technical difficulties (eg. fitting models into memory, slow training time, etc.).

Neglectedness

To my best knowledge, this is quite a neglected problem because only 2 researchers have tried this RL for steganography, with the aim to investigate the conditions, so far in the last months, (Roger, 2023) and Zygimantas Straznickas. A few researchers worked on the steganography in LLMs (de Witt et al., 2023; Roger and Greenblatt, 2023; Motwani et al., 2024).

Theory of Change

Category / area: honest AI (outer alignment), limited AI.

Specifying conditions under which LLMs might learn to hide messages (steganography) via plausible RL training environments, is highly likely to be useful for immediate research in AI control methods and oversight, as well as for longer term research into understanding risks from scheming AIs. Also this research is applicable to preventing misaligned behavior in current systems already deployed by the leading organizations. This includes leading AI labs (OpenAI, Anthropic, DeepMind etc.), and researchers interested in alignment of deceitful AIs (Carlsmith, Fabien Roger, others). Successful demonstration of the induced steganography behavior resulting from RLHF would provide empirical evidence to the theoretical concerns already discussed by others (Casper et al., 2023). This might raise more concern for the safety of RLHF in general. Specifically, engineers and researchers at the labs are more likely to work on mitigations to disincentivize the development of steganographic capabilities under RL training regimes. Impeding the ability to hide messages in the plain sight is the necessary condition to analyze models output and thus is likely to reduce the risks from systems that run with LLMs inside, or use such LLMs for AI control. Thus this is likely to reduce the risk from unaligned deceitful LLM based agents (or one rogue AI), or from the collusion of LLMs. As was shown above, this deception is a key step in the AI takeover scenario, so reducing its possibility is likely to reduce this x-risk.

This project aims to discover how RLHF or other RL training methods might induce steganographic capabilities because currently, to the best of my knowledge, we don’t know what rewards, setup, size of a model, etc. could lead to models that use the steganography to hide their reasoning. Importantly, this project doesn’t address the basic possibility of the use of steganography by LLMs because that has already been demonstrated (Roger and Greenblatt, 2023; Motwani et al., 2024). The goal is to show that current conditions under which LLMs are tried are likely to lead to the development of the steganography and to specify those conditions, e.g. this might be the configuration where CoT is penalized while a final answer is rewarded (this is the sycophancy example, see (Roger and Greenblatt, 2023)).

References

Ajeya Cotra (2022) Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover — AI Alignment Forum. Available at: https://www.alignmentforum.org/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to (Accessed: 14 February 2024).
Allan Dafoe (2020) AI Governance: Opportunity and Theory of Impact. Available at: https://www.allandafoe.com/opportunity (Accessed: 29 November 2023).
Bostrom, N. (2014) Superintelligence: paths, dangers, strategies. First edition. Oxford: Oxford University Press.
Carlsmith, J. (2022) ‘Is Power-Seeking AI an Existential Risk?’ arXiv. Available at: http://arxiv.org/abs/2206.13353 (Accessed: 19 October 2023).
Carlsmith, J. (2023) ‘Scheming AIs: Will AIs fake alignment during training in order to get power?’ arXiv. Available at: http://arxiv.org/abs/2311.08379 (Accessed: 1 February 2024).
Casper, S. et al. (2023) ‘Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback’. arXiv. Available at: http://arxiv.org/abs/2307.15217 (Accessed: 9 December 2023).
Christiano, P., Cotra, A. and Xu, M. (2021) Eliciting Latent Knowledge, Google Docs. Available at: https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit?usp=sharing&usp=embed_facebook (Accessed: 9 October 2023).
Fabien Roger (2023) The Translucent Thoughts Hypotheses and Their Implications — AI Alignment Forum. Available at: https://www.alignmentforum.org/posts/r3xwHzMmMf25peeHE/the-translucent-thoughts-hypotheses-and-their-implications (Accessed: 23 January 2024).
Felix Binder (2023) ‘Towards a Steganography Evaluation Protocol’, 27 October. Available at: https://ac.felixbinder.net/research/2023/10/27/steganography-eval.html.
Gemini Team et al. (2023) ‘Gemini: A Family of Highly Capable Multimodal Models’. arXiv. Available at: http://arxiv.org/abs/2312.11805 (Accessed: 14 February 2024).
Greenblatt, R. et al. (2023) ‘AI Control: Improving Safety Despite Intentional Subversion’. Available at: https://doi.org/10.48550/ARXIV.2312.06942.
Hendrycks, D. and Mazeika, M. (2022) ‘X-Risk Analysis for AI Research’. arXiv. Available at: http://arxiv.org/abs/2206.05862 (Accessed: 12 October 2023).
Holden Karnofsky (2022) How might we align transformative AI if it’s developed very soon? — LessWrong. Available at: https://www.lesswrong.com/posts/rCJQAkPTEypGjSJ8X/how-might-we-align-transformative-ai-if-it-s-developed-very (Accessed: 9 February 2024).
Hubinger, E. et al. (2019) Risks from Learned Optimization in Advanced Machine Learning Systems, arXiv.org. Available at: https://arxiv.org/abs/1906.01820v3 (Accessed: 9 October 2023).
Hubinger, E. et al. (2024) ‘Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training’. arXiv. Available at: https://doi.org/10.48550/arXiv.2401.05566.
Karnofsky, H. (2021) The ‘most important century’ blog post series, Cold Takes. Available at: https://www.cold-takes.com/most-important-century/ (Accessed: 7 October 2023).
Karpov (2024) ‘How important is AI hacking as LLMs advance?’ Available at: https://www.lesswrong.com/posts/vmQ6KhsDAk5xGtZRx/how-important-is-ai-hacking-as-llms-advance (Accessed: 14 February 2024).
Liu, Y. et al. (2019) ‘ABS: Scanning Neural Networks for Back-doors by Artificial Brain Stimulation’, in Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. CCS ’19: 2019 ACM SIGSAC Conference on Computer and Communications Security, London United Kingdom: ACM, pp. 1265–1282. Available at: https://doi.org/10.1145/3319535.3363216.
Llama 2: Open Foundation and Fine-Tuned Chat Models | Research - AI at Meta (2023). Available at: https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/ (Accessed: 10 October 2023).
Motwani, S.R. et al. (2024) ‘Secret Collusion Among Generative AI Agents’. arXiv. Available at: http://arxiv.org/abs/2402.07510 (Accessed: 13 February 2024).
Ngo, R., Chan, L. and Mindermann, S. (2022) ‘The alignment problem from a deep learning perspective’. arXiv. Available at: http://arxiv.org/abs/2209.00626 (Accessed: 19 October 2023).
OpenAI (2023) GPT-4 Technical Report, arXiv.org. Available at: https://arxiv.org/abs/2303.08774v3 (Accessed: 10 October 2023).
Ord, T. (2020) The precipice: existential risk and the future of humanity. london New York (N.Y.): Bloomsbury academic.
Roger, F. (2022) By Default, GPTs Think In Plain Sight — LessWrong. Available at: https://www.lesswrong.com/posts/bwyKCQD7PFWKhELMr/by-default-gpts-think-in-plain-sight (Accessed: 12 January 2024).
Roger, F. (2023) ‘Some negative steganography results’. Available at: https://www.lesswrong.com/posts/EEvsL9cpgDAxAhTzt/some-negative-steganography-results (Accessed: 31 January 2024).
Roger, F. and Greenblatt, R. (2023) ‘Preventing Language Models From Hiding Their Reasoning’. arXiv. Available at: https://doi.org/10.48550/arXiv.2310.18512.
Thane Ruthenis (2023) Current AIs Provide Nearly No Data Relevant to AGI Alignment — LessWrong. Available at: https://www.lesswrong.com/posts/HmQGHGCnvmpCNDBjc/current-ais-provide-nearly-no-data-relevant-to-agi-alignment (Accessed: 5 February 2024).
Wei, J. et al. (2023) ‘Chain-of-Thought Prompting Elicits Reasoning in Large Language Models’. arXiv. Available at: https://doi.org/10.48550/arXiv.2201.11903.
de Witt, C.S. et al. (2023) ‘Perfectly Secure Steganography Using Minimum Entropy Coupling’. arXiv. Available at: https://doi.org/10.48550/arXiv.2210.14889.