This is the introduction chapter of my thesis, lightly adapted for the blog. I’m sharing it here because I think it does a pretty good job of explaining why the work I’m doing matters, or at least why I believe it does.

Most of the ways we deal with long inputs in language models involve some form of information loss. RAG retrieves only the parts it thinks are relevant, summarizers compress things away, and the model itself degrades as you fill up its context window. Recursive Language Models take a different approach, as they let the model write code to operate on the input from outside. I find this idea really compelling, but the original work only tested it with prompting and distillation, limited recursion, and ran all sub-calls synchronously. My thesis is about training these models with reinforcement learning, with a particular focus on whether we can elicit robust recursive behavior in small models with very few trainable parameters.

If it works, this would produce the first cost-aware RL-trained recursive language models and a clearer picture of the minimum adaptation budget needed to make recursion emerge. LoRA is central to this project: not only as a practical constraint, but as an experimental lens to study how little parameter movement is sufficient to unlock RLM behavior. If recursive reasoning is the right abstraction for scaling beyond context windows, it shouldn’t stay locked behind proprietary models and prohibitive compute budgets.

Introduction

The impact of Large Language Models (LLMs) across diverse fields cannot be overstated [1, 2]. LLMs have evolved from simple next-token prediction systems into sophisticated assistants for business [3], coding [4], and research [5]. Recently, the agentic paradigm has enabled LLMs to interact continuously with their environment in a loop, allowing models to react to environmental changes and distribute tasks and responsibilities across multiple agents [6]. Despite these advances, LLMs still exhibit several fundamental limitations: they generate false or misleading information [7], reproduce existing human biases [8], and are constrained by limited context windows [9, 10]. The transformer architecture restricts models to learning “in context” through a fixed-size window, without mechanisms for updating internal parameters after deployment. This limitation is compounded by two additional factors: first, performance degrades as more tokens are added to the context, a phenomenon known as “context rot” [11], second, inference costs scale linearly with context [12]. Together, these factors make the processing of large amounts of text both qualitatively and economically prohibitive.

While context window sizes have increased substantially in recent years [10], with some models now supporting millions of tokens [13], current agentic applications often require LLMs to process and interact with billions of tokens over extended periods. Two prominent approaches have emerged to cope with context window constraints: Retrieval Augmented Generation (RAG) [14, 15] and context compaction [16, 17]. RAG enables models to retrieve relevant chunks of information from external data sources at inference time, selectively enriching the prompt with pertinent contextual information rather than requiring all information to fit within the context window. Context compaction techniques, conversely, address the problem by condensing information through summarization or selective filtering, offering efficiency gains at the cost of information loss. While current agentic frameworks and RAG have enabled the development of complex and capable agentic systems, these solutions are not scalable in the long term.

Examining current agentic architectures through the lens of the “bitter lesson” [18, 19] reveals that these approaches may not optimally leverage models and their probable future evolution. Agent frameworks require human intervention to embed models within human-designed loops or graphs, with architectures often schematically organized from a human perspective, dividing agent roles as one would assign tasks to employees [20]. RAG and context compaction, meanwhile, fundamentally assume that portions of the context are less important than others and can be either omitted or summarized away.

Recursive Language Models (RLMs) [21] offer a compelling alternative to these paradigms. RLMs treat the user prompt as part of an external environment that they can control programmatically. Their defining characteristic is the ability to manipulate input through symbolic handles without copying text into their context window, enabling models to recurse on portions of the prompt. In practice, this is implemented by providing the model with a persistent REPL environment where the initial prompt is loaded as a variable, and by instructing or training the model to generate code that facilitates understanding and transformation of chunks of the prompt. The model builds intermediate values and the final response into new variables, potentially invoking sub-RLMs within loops. RLMs have achieved impressive results on long-context benchmarks [21, 22], and their applications in fields such as coding [23] and literature synthesis appear particularly promising. Figure 1.1 illustrates all three approaches.

Three approaches to processing long inputs: RAG selects relevant chunks but discards the rest, compaction compresses all chunks but loses detail, RLMs operate on the full input programmatically from outside the context window, recursing as needed.

However, the true capabilities of RLMs remain understudied. We have identified three key limitations in the original RLM work:

Training methodology: The original paper primarily studies RLMs using prompting rather than explicit training, and when training is employed, it relies on distillation. Both approaches are limited, prompting prevents models from learning from their mistakes, while distillation only enables smaller models to learn from more capable ones without explicitly improving model capabilities beyond the teacher’s knowledge.
Recursion depth constraints: The original paper restricts maximum recursion depth to 1, imposing an artificial human constraint on RLMs. Following the general principle of the approach, it would be beneficial to allow models to determine optimal recursion depth based on task requirements.
Synchronous execution: All calls to sub-RLMs in the original implementation are synchronous, leading to prohibitively long inference times. Allowing models to autonomously choose between synchronous and asynchronous calls could yield substantial efficiency gains.

In light of these limitations, we have devised a research program to study the efficient training and inference of recursive language models, with explicit attention to parameter efficiency. Making RLMs more accessible is crucial for lowering barriers to AI adoption, particularly as interactions with these models become increasingly ubiquitous and costly.

Our approach employs Reinforcement Learning from Verifiable Rewards (RLVR) with the Dr. GRPO [24] algorithm, with Low-Rank Adaptation (LoRA) [25] as the primary adaptation mechanism, to train cost-aware RLMs. The goal is to create models optimized to maximize answer accuracy while minimizing computational cost and wall-clock inference time, and to characterize how few trainable parameters are needed to elicit stable recursive behavior.

This research aims to contribute to two broader contexts in LLM development. First, it advances approaches for enabling models to work effectively over long tasks involving extensive text or extended time periods, allowing agents to operate longer without suffering from context degradation. Second, it participates in the larger effort to make LLM customization more accessible and cost-effective [26], including identifying viable low-parameter adaptation regimes for smaller models. Reinforcement learning with LoRA has garnered considerable recent interest [27, 28], with results suggesting strong performance improvements can emerge from extremely small updates [29], reinforcing our focus on minimal-parameter elicitation of recursive behavior.

References

[1] Brown, T. B., Mann, B., Ryder, N., et al. [Language models are few-shot learners](https://arxiv.org/abs/2005.14165). (2020)

[2] Bubeck, S., Chandrasekaran, V., Eldan, R., et al. [Sparks of artificial general intelligence: Early experiments with GPT-4](https://arxiv.org/abs/2303.12712). (2023)

[3] Noy, S., Zhang, W. [Experimental evidence on the productivity effects of generative artificial intelligence](https://doi.org/10.1126/science.adh2586). Science (2023)

[4] Chen, M., Tworek, J., Jun, H., et al. [Evaluating large language models trained on code](https://arxiv.org/abs/2107.03374). (2021)

[5] Singh, A., Chang, J. C., Anastasiades, C., et al. [Ai2 scholar QA: Organized literature synthesis with attribution](https://arxiv.org/abs/2504.10861). (2025)

[6] Yao, S., Zhao, J., Yu, D., et al. [ReAct: Synergizing reasoning and acting in language models](https://arxiv.org/abs/2210.03629). (2023)

[7] Ji, Z., Lee, N., Frieske, R., et al. [Survey of hallucination in natural language generation](https://doi.org/10.1145/3571730). ACM Computing Surveys, Association for Computing Machinery (ACM) (2023)

[8] Bender, E., Gebru, T., McMillan-Major, A., et al. [On the dangers of stochastic parrots: Can language models be too big?](https://doi.org/10.1145/3442188.3445922) (2021)

[9] Vaswani, A., Shazeer, N., Parmar, N., et al. [Attention is all you need](https://arxiv.org/abs/1706.03762). (2023)

[10] Dai, Z., Yang, Z., Yang, Y., et al. [Transformer-XL: Attentive language models beyond a fixed-length context](https://arxiv.org/abs/1901.02860). (2019)

[11] Hong, K., Troynikov, A., Huber, J. [Context rot: How increasing input tokens impacts LLM performance](https://research.trychroma.com/context-rot). Context Rot: How Increasing Input Tokens Impacts LLM Performance \| Chroma Research (2025)

[12] Kwon, W., Li, Z., Zhuang, S., et al. [Efficient memory management for large language model serving with PagedAttention](https://arxiv.org/abs/2309.06180). (2023)

[13] Pichai, S. [A new era of intelligence with Gemini 3](https://blog.google/products-and-platforms/products/gemini/gemini-3/#note-from-ceo). Google Blog (The Keyword) (2025)

[14] Lewis, P., Perez, E., Piktus, A., et al. [Retrieval-augmented generation for knowledge-intensive NLP tasks](https://arxiv.org/abs/2005.11401). (2021)

[15] Borgeaud, S., Mensch, A., Hoffmann, J., et al. [Improving language models by retrieving from trillions of tokens](https://arxiv.org/abs/2112.04426). (2022)

[16] Rae, J. W., Potapenko, A., Jayakumar, S. M., et al. [Compressive transformers for long-range sequence modelling](https://arxiv.org/abs/1911.05507). (2019)

[17] Ge, T., Hu, J., Wang, L., et al. [In-context autoencoder for context compression in a large language model](https://arxiv.org/abs/2307.06945). (2024)

[18] Sutton, R. The bitter lesson. (2019)

[19] Pham, M. Why most agent harnesses are not bitter lesson pilled [tweet]. (2026)

[20] Hong, S., Zhuge, M., Chen, J., et al. [MetaGPT: Meta programming for a multi-agent collaborative framework](https://arxiv.org/abs/2308.00352). (2024)

[21] Zhang, A. L., Kraska, T., Khattab, O. [Recursive language models](https://arxiv.org/abs/2512.24601). (2026)

[22] Symbolica. [SotA ARC-AGI-2 results with REPL agents](https://www.symbolica.ai/blog/arcgentica). Symbolica AI, Inc. (2026)

[23] Anthropic. [Improved web search with dynamic filtering](https://claude.com/blog/improved-web-search-with-dynamic-filtering). Claude Blog (2026)

[24] Liu, Z., Chen, C., Li, W., et al. [Understanding R1-zero-like training: A critical perspective](https://arxiv.org/abs/2503.20783). (2025)

[25] Hu, E. J., Shen, Y., Wallis, P., et al. [LoRA: Low-rank adaptation of large language models](https://arxiv.org/abs/2106.09685). (2021)

[26] Longpre, S., Akiki, C., Lund, C., et al. [Economies of open intelligence: Tracing power & participation in the model ecosystem](https://arxiv.org/abs/2512.03073). (2025)

[27] Schulman, J., Lab, T. M. [LoRA without regret](https://doi.org/10.64434/tml.20250929). Thinking Machines Lab: Connectionism (2025)

[28] kalomaze. [RL learning with LoRA: A diverse deep dive](https://kalomaze.bearblog.dev/rl-lora-ddd/). kalomaze's kalomazing blog (2025)

[29] Morris, J. X., Mireshghallah, N., Ibrahim, M., et al. [Learning to reason in 13 parameters](https://arxiv.org/abs/2602.04118). (2026)