← Back to blog

What I Read This Week 2026-W06


Introduction

This is the first in a series of posts where I document the papers I like reading while preparing my thesis. The main goal of this is to be an exercise in agency and consistency, and to force myself to engage more deeply with the literature by summarizing key findings and adding my own commentary.

This week focused on reinforcement learning for large language models, with particular attention to parameter-efficient training methods like LoRA. The intersection of RL and LoRA has become increasingly active lately and I feel like it could be an interesting direction for my work.

The papers below are organized thematically. I start with an overview of reinforcement learning techniques for LLMs, then I move to foundational LoRA work to build a technical baseline before reaching the core of this week’s reading: the intersection of RL and parameter-efficient tuning. I conclude with a couple of tangentially related papers that caught my attention. For each paper, I summarize the main contributions and occasionally add brief commentary on what I found notable or questionable.

RL for LLMs in General

A Technical Survey of RL Techniques for LLMs  [1]

There are dozens of surveys on RL for LLMs. I chose this one because it’s recent and reasonably well-cited. The authors provide a broad overview of RL applications to language models, covering RLHF, RLVR, and off-policy methods.

In general RL is a learning paradigm where agents learn through trial and error while interacting with an environment. It’s been successful in robotics and game-playing, but LLMs are quite weird. First, the state space is enormous, essentially all possible conversation contexts. Second, at each step the action space is the entire vocabulary. Compare this to something like Atari Breakout and you see why LLMs are peculiar from an RL perspective.

The useful part of the paper is the overview of RL algorithms applied to LLMs. I’ll focus on the interesting parts and skip the rest.

PPO (Proximal Policy Optimization)

is extremely popular and has been applied successfully to any domain you could think of, including LLMs. OpenAI famously used it for RLHF in one early application. The process for RLHF can be summarized in 3 steps: (1) collect human preferences over response pairs, (2) train a reward model on these preferences, (3) optimize the LLM to maximize reward.

PPO might work well in LLMs because it includes policy constraints that prevent the model from deviating too much from what was learned during pretraining. In general the objective of the algorithm can be mathematically expressed as:

LPPO(θ)=Et[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)] L_{PPO}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A}_t \right) \right]

If you’re like me this looks terrifying but the intuition is straightforward. Standard RL can cause models to collapse after a bad update, they turn into gibberish generators and never recover. PPO has a safety mechanism. The ratio rt(θ)r_t(\theta) measures how much the model’s behavior is changing (new policy probability divided by old policy probability). The advantage A^t\hat{A}_t indicates whether an action was better or worse than average. The clip function enforces a hard limit (typically ϵ0.2\epsilon \approx 0.2) on how much the policy can change in a single update, preventing catastrophic over-correction. The min operation takes the pessimistic estimate to be conservative.

GRPO (Group Relative Policy Optimization)

is a more recent algorithm published by DeepSeek’s research team. It’s been successfully applied to improve LLM performance on math and reasoning tasks. The cool thing is that GRPO works similarly to PPO but eliminates the need for a separate reward model.

In PPO, a response is rewarded based on whether it performs better than predicted by a value function (the “critic”). This requires training and running a second large model alongside the policy, which is computationally expensive. DeepSeek’s approach is more efficient: instead of training a critic to predict response quality, the LLM generates multiple responses (typically 4-8) to the same prompt, then rewards those that perform better than the group average and penalizes the rest.

The objective is:

LGRPO(θ)=E(s,{ai})πθold[1Gi=1Gmin(ri(θ)A^iGR,clip(ri(θ),1ϵ,1+ϵ)A^iGR)] L_{GRPO}(\theta) = \mathbb{E}_{(s,\{a_i\})\sim\pi_{\theta_{old}}} \left[ \frac{1}{G} \sum_{i=1}^{G} \min \left( r_i(\theta)\hat{A}_i^{GR}, \text{clip}(r_i(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A}_i^{GR} \right) \right]

The formula is nearly identical to PPO with two key differences. First, it averages over a group of GG responses generated for the same prompt. Second, the advantage A^iGR\hat{A}_i^{GR} is computed relative to the group rather than a learned baseline:

A^iGR=r(ai)μσ \hat{A}_i^{GR} = \frac{r(a_i) - \mu}{\sigma}

where r(ai)r(a_i) is the reward for answer ii, μ\mu is the mean reward across the group, and σ\sigma is the standard deviation. This is essentially a z-score measuring how many standard deviations better or worse each response is compared to the group average.

The practical advantage is twofold. First, eliminating the critic model frees up memory that can be used for longer chain-of-thought generations, which reasoning tasks require. Second, for problems with clear correct answers (like mathematics), comparing responses against each other directly identifies which reasoning paths succeed without needing a separate model to estimate prompt value. The tradeoff is computational: you need to generate multiple responses per prompt, but this is cheaper than maintaining a second large model.

DPO (Direct Preference Optimization)

is a different approach to RLHF that removes the need for reward modeling by directly optimizing the policy to human preferences.

Traditional RLHF requires three stages: supervised fine-tuning on demonstrations, training a reward model on preference data, and then using RL to optimize the policy against that reward model. DPO collapses this into two stages by skipping the reward model entirely.

The intuition is that you can reframe the reward modeling and RL optimization steps into a single classification problem over preference pairs. Given human preferences stating that response ywy_w is better than yly_l for prompt xx, DPO optimizes:

LDPO(θ)=E(x,yw,yl)[logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))] L_{DPO}(\theta) = -\mathbb{E}_{(x,y_w,y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right) \right]

where πθ\pi_\theta is the policy being trained, πref\pi_{ref} is a reference policy (typically the supervised fine-tuned model), β\beta controls the deviation from the reference policy, and σ\sigma is the sigmoid function.

The objective directly increases the likelihood of preferred responses relative to the disliked ones while staying close to the reference policy. The β\beta parameter serves a similar role to PPO’s clipping, to prevent the model from drifting too far from the supervised policy.

DPO is simpler to implement than PPO (no actor-critic setup, no value function), more stable (no adversarial training dynamics), and more computationally efficient (no reward model to train and run). The tradeoff is that DPO is inherently offline, so it only learns from the fixed preference dataset and can’t explore or improve through interaction. PPO can in principle continue improving through online data collection, though in practice this is rarely done due to the cost of generating new human preferences.

Summary of the most interesting elements. PPO and GRPO are on-policy RL methods (reward signal + policy constraint). DPO is offline preference optimization: it replaces reward modeling + RL with direct training on preference pairs.

Other Stuff and Final Thoughts

The paper also covers off-policy algorithms and various LLM-specific techniques I’ve never heard of before, but parts of it seem AI generated, which made it quite boring to read.

As an introduction to RL for LLMs, it’s pretty good. The coverage is definitely extensive, they touch on practically every algorithm in the space. The issue is depth. Each technique gets a brief treatment that assumes you either already know the details or will look them up elsewhere. More worked examples or intuitive explanations would have been helpful, especially for the less standard methods. The paper works better as a roadmap of what exists rather than an overview on how anything actually works.

The taxonomy and categorization of methods is useful---it’s nice to see all the methods organized in one place. But I found myself wanting either deeper technical detail or clearer intuition, and the paper sits uncomfortably in between. If you’re new to the area, read the algorithm sections and use them as jumping-off points to find better resources on specific methods. If you already know RL, skim it for some application-specific techniques you might have missed. A summary of the methods I discussed is displayed in Figure 1.

Agent-R1: Training Powerful LLM Agents with End-to-End RL  [2]

In this paper the authors try to formalize a bit what an LLM-based agent is, how it is different from a normal LLM chat and how to train one using RL. The paper can be broadly divided into 2 parts: first the authors formalize LLM chats and agent behaviour through a Markov Decision Process, then they explain their general framework for training agents with RL.

Personally I didn’t really love the first half, it’s kinda hard to argue that normal chats and agent behavior are that different as the underlying model is still the same and so is its logic. For instance you can have a look at Table 1, where I put the differences they highlighted between chats and and some comments.

ComponentStatic LLMLLM AgentMy comment
State (S)Captures only the current text sequence.Captures the full history of multiturn interactions and environmental feedback.But how is this any different from a chat interaction with an LLM? If you imagine the user as part of the environment then the next prompt is the stochastic environmental feedback, so is a chat an agent? mmmm
Action (A)Generating the next token.Generating tokens that can also function as commands to invoke external tools.I mean yea I guess but I don’t think that makes it quite a different system?
State Transition (P)Deterministic: appending a token determines the next state.Stochastic: the next state depends on non-deterministic feedback from environment.This one doesn’t make sense cause again user interaction is stochastic as well and I don’t think anyone is calling chatbots agents aaaa
Reward (R)Receives a single, sparse reward at the end of the generation.Receives dense process rewards for intermediate steps in addition to a final reward.This one is the biggest difference imo as you can reward correct tool calling in intermediate steps with agents.

I guess this might be a useful mental model for some but I didn’t get much value out of it.

On the other hand I found the second half of the paper quite useful. It details the specifics of the framework they use to train agents and it makes a lot of sense. The architecture is quite standard; the idea is that you have a model in a loop that interacts via Tools with an external Environment. To train the agent they use a two-tiered reward structure that rewards the model both for the single tool calls (e.g. some rewards if the structure of the call is correct), and for the final output of the system. This is quite intuitive and makes a lot of sense, though I do wonder if this incentivizes the model to make useless tool calls(?). They also specify that they implemented an action mask so the model sees exactly which tokens it is getting rewarded for and doesn’t include the ones coming from environmental feedback. A summary of the framework is described in Figure 2.

Standard agent-in-a-loop setup: a model interacts with an external environment via tools. Training uses a two-tier reward (tool-call level + final-output level), with an action mask so only agent tokens receive credit assignment (not environment feedback).

Understanding R1-Zero-Like Training: A Critical Perspective  [3]

This one is really really good, the paper is focused on two aspects, first it tries to understand how much of the R1 success was due to the post-training vs existing behavior in base models, then it explain an error in the GRPO algorithm that caused massive waste of tokens.

A Base Model Is All You Need

The authors take six base models (without post-training) and test their ability to answer questions with or without a chat template. A chat template is a wrapper applied to user prompts that instructs the model on how to interpret the incoming text. For instance, the Qwen Math chat template looks like:

<|im_start|>system
Please reason step by step, and put your final answer within \boxed{}.
<|im_end|>
<|im_start|>user
{question}
<|im_end|>
<|im_start|>assistant

The study yields two main findings:

  • The Qwen family without any chat template achieves 100% coherence, whereas employing a template reduces performance to 60%. This suggests the models may have been heavily pretrained with the chat template already incorporated.

  • DeepSeek Base V3 exhibits poor chat coherence without a template but improves sharply when one is applied. This makes it a good proxy for studying base models and analyzing whether it already exhibits the famous “Aha” moment observed in R1.

Following these experiments, the authors test the models on a math benchmark to analyze whether the model already shows signs of self-reflection. Interestingly, all examined models (including DeepSeek V3) demonstrate self-reflection without post-training. However, they note that self-reflection is not necessarily correlated with better benchmark performance.

GRPO Is Biased

In the second half, the authors analyze the GRPO algorithm and identify a bias in its formulation that leads to unnecessarily long responses, particularly for incorrect answers.

As a reminder the standard GRPO objective is:

JGRPO(πθ)=EqpQ,{oi}i=1Gπθold(q)[1Gi=1G1oit=1oimin(πθ(oi,tq,oi,<t)πθold(oi,tq,oi,<t)A^i,t,clip(,1ϵ,1+ϵ)A^i,t)]J_{\text{GRPO}}(\pi_\theta) = \mathbb{E}_{q \sim p_Q, \{o_i\}_{i=1}^G \sim \pi_{\theta}^{\text{old}}(\cdot|q)} \Bigg[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min\left(\frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta}^{\text{old}}(o_{i,t}|q, o_{i,<t})} \hat{A}_{i,t}, \text{clip}(\dots, 1-\epsilon, 1+\epsilon) \hat{A}_{i,t}\right) \Bigg]

where the advantage is computed as:

A^i,t=R(q,oi)mean({R(q,o1),,R(q,oG)})std({R(q,o1),,R(q,oG)})\hat{A}_{i,t} = \frac{R(q, o_i) - \text{mean}(\{R(q, o_1), \ldots, R(q, o_G)\})}{\text{std}(\{R(q, o_1), \ldots, R(q, o_G)\})}

The authors identify two problematic normalization terms:

Response-level length bias (1/oi1/|o_i|): For incorrect responses, dividing by length means longer responses receive smaller penalties per token. This causes the model to favor generating lengthy incorrect responses, as they’re penalized less than shorter incorrect ones.

Question-level difficulty bias (std normalization): Normalizing by the standard deviation across responses gives disproportionate weight to questions that are either too easy or too hard (where all responses tend to be correct or incorrect, leading to low std). This creates an unintended bias toward certain question difficulties during optimization.

The authors note that this length bias also appears in many popular open-source PPO implementations, suggesting it’s a widespread issue that predates GRPO. The proposed solution is GRPO Done Right (Dr. GRPO), which simply removes the problematic parts of the formulation and achieves the same performance without the wasted tokens.

LoRA Fundamentals

LoRA: Low-Rank Adaptation of Large Language Models  [4]

This one is kind of a classic, here they introduce Low-Rank Adaptation (LoRA), a Parameter-Efficient Fine-Tuning (PEFT) method to customize LLMs. The intuition for this one comes from previous work on fine-tuning that shows that learned over-parametrized models reside on a low intrinsic dimension. The intuition behind the method is that fine-tuning can mostly be described as a “low-rank update,” so instead of training the full weights of the model, one can achieve very similar performance by just training the rank decomposition matrices of the dense layers. In practice, this means that the base weights of the model stay intact, and you add a new matrix ΔW\Delta W obtained as the multiplication of two smaller matrices AA and BB with rank rr. Thus, for h=W0xh = W_0x, the modified forward pass can be written as:

h=W0x+ΔWx=W0x+BAx h = W_0x + \Delta W x = W_0x + BAx

The full process and intuition is summarized in Figure 3.

LoRA parameterizes the weight update as a low-rank factorization ΔW=αrBA\Delta W = \tfrac{\alpha}{r} B A with ARr×dinA\in\mathbb{R}^{r\times d_{in}} and BRdout×rB\in\mathbb{R}^{d_{out}\times r}. The bottleneck dimension rr controls the maximum rank of the update.

There are quite a few pros to using LoRA: for instance, the adapters are easy to swap and store, and LoRA does not affect inference speed. Also, we might expect that some tasks might require smaller or higher ranks depending on complexity, and the cool thing is that LoRA training roughly converges to full fine-tuning as rank size increases. Another question is where to apply these changes, in this paper they recommend applying them to all projections of the self-attention module, but we will see later that this is a point of contention…

LoRA Learns Less and Forgets Less  [5]

This paper digs into whether LoRA actually matches full fine-tuning performance, and the answer is a bit complicated. The authors train Llama-2-7B on code and math datasets with both methods and find that low-rank LoRA (r = 16, 64) clearly underperforms, but high-rank LoRA (r = 256) can get pretty close in instruction fine-tuning settings. For continued pretraining though, even high ranks struggle to catch up.

The really interesting bit is the forgetting analysis. LoRA consistently forgets way less of the base model’s general capabilities compared to full fine-tuning. This makes intuitive sense if you remember we’re constraining the update to be low-rank, so the model can’t drift as far from its original weights, but this forgetting mitigation is actually stronger than what you get from standard regularization techniques like dropout or weight decay. They also show that LoRA maintains more diversity in generated solutions, whereas full fine-tuning tends to collapse onto a smaller set of outputs.

They also do an SVD analysis on the weight updates ΔW\Delta W learned during full fine-tuning and find that these perturbations have rank 10-100×\times higher than typical LoRA settings. So the original LoRA intuition that fine-tuning is inherently low-rank doesn’t really hold for complex tasks like code and math. The full model is learning something fundamentally higher-dimensional than what LoRA can express.

The learning-forgetting tradeoff curves are fascinating too, because you can basically tune the rank to navigate between “learn the new task well” and “don’t forget the old stuff,” which could be really useful depending on what you’re optimizing for.

RL but LoRA

LoRA Without Regret  [6]

“Ermmm, Lorenzo, actually this is not a paper but a blogpos-”, shut up this is super cool and I wanted to include it so it counts. This is such a great blogpost I think anyone who touches LoRA should read. It came out like two weeks after I had finished a big PEFT project and I just wish we had it before then because it would have helped massively understand why what we were doing wasn’t working. The main focus of the paper is identifying the scenarios where LoRA works well, so without regret, and explaining how different hyperparameters affect learning in these conditions. Furthermore, they have a great section at the end explaining how RL works with LoRA and why they work so well together.

The key finding is that LoRA can actually match full fine-tuning performance when you get the details right. They tested this across supervised fine-tuning and RL experiments with Llama 3 and Qwen models, sweeping learning rates and ranks from 1 to 512. The main insight is that there’s a “low-regret regime” where LoRA performs identically to full fine-tuning, and this regime covers most post-training scenarios.

LoRA for SFT

For supervised learning, they find that high-rank LoRA matches full fine-tuning on instruction-following and reasoning datasets until you hit capacity limits. The learning curves basically overlap and loss decreases linearly with log(steps) just like full fine-tuning. Lower ranks eventually fall off this curve when the adapter runs out of capacity, but the threshold correlates pretty cleanly with rank.

One weird thing they discovered is that LoRA is less tolerant of large batch sizes than full fine-tuning. The performance gap grows with batch size independent of rank, meaning it might be an intrinsic characteristic. Though both methods achieve best loss at smaller batches anyway, so maybe it doesn’t matter that much.

If you remember earlier I told you that the OG LoRA paper applied it only to the attention blocks, but it turns out this configuration significantly underperforms, and it’s much better to apply it to all layers, especially the MLP/MoE blocks. In fact, attention-only LoRA doesn’t even improve performance (that much) beyond MLP-only LoRA, even when you match parameter counts by using higher rank. This makes intuitive sense given that MLP layers contain most of the parameters.

LoRA for RL

I think this is the best part of the blog. LoRA fully matches full fine-tuning even at rank 1 for policy gradient algorithms on math reasoning tasks. This is obviously crazy because at rank one the number of parameters is infinitesimally smaller than what you’d train with full FT. They explain this with an information-theoretic argument: supervised learning provides O(tokens) bits per episode, but policy gradient methods only provide O(1) bits per episode from the advantage function. When training on 10K problems with 32 samples each, you only need to absorb  320K bits total. Rank-1 LoRA already has 3M parameters, almost 10x more capacity than needed. They validated this on larger-scale experiments with Qwen3 on DeepMath and saw the same advanced reasoning behaviors (backtracking, self-verification) emerge in both LoRA and full fine-tuning.

Hyperparameters

In the last section they write some tips and tricks for navigating the sea of hyperparameter options. At first glance, LoRA seems to have four hyperparameters to tune: the scaling factor α\alpha, learning rates for AA and BB, and the initialization scale of AA. But due to some training dynamics only two of these actually matter:

  • Learning rate: They found that optimal LoRA learning rate is consistently 10×\times higher than full fine-tuning across 14 different models. This holds for both supervised learning and RL, making hyperparameter transfer way easier. For short training runs you should bump this up even more ( 15×\times full fine-tuning), converging to 10×\times for longer runs.

  • Initialization scale of AA: They use uniform distribution with scale 1/din1/\sqrt{d_{\text{in}}} following the Huggingface implementation. Matrix BB is initialized to zero, which creates an implicit learning rate schedule since BB starts at zero, updates to AA have negligible effect at first. As BB grows during training, the effective learning rate increases.

  • Effective degrees of freedom: The cool theoretical insight is that you can reduce the four hyperparameters to just two effective ones: αinitALRB\alpha \cdot \text{init}_A \cdot \text{LR}_B (controls initial update size) and initA/LRA\text{init}_A/\text{LR}_A (controls how fast AA evolves). The other combinations don’t affect learning dynamics when using Adam with ϵ=0\epsilon = 0.

They also calculate that LoRA uses only 23\frac{2}{3} the FLOPs of full fine-tuning per training step, so if you plotted performance against compute instead of steps, LoRA would show even clearer advantages.

Learning to Reason in 13 Parameters  [7]

I saw this one on X and it left me flabbergasted, here the authors show that you can train an 8B model to do high-level mathematical reasoning by updating as few as 13 parameters. For context, a standard Llama-3-8B LoRA at rank 1 usually requires at least 3 million parameters. We are talking about a 100,000×100,000\times reduction in the number of trained weights while still hitting 91% accuracy on GSM8K.

TinyLoRA and Weight Tying

To get the parameter count this low, they introduced a method called TinyLoRA. It builds on the logic of LoRA-XS, which uses the singular value decomposition (SVD) of the base weights. Instead of training a full matrix, or even the rank-decomposition matrices, they use a tiny trainable vector vv and project it through fixed random tensors PP.

The “secret sauce” here is extreme weight tying, where they force the same 13 parameters to be shared across every single layer and every single projection in the entire model. The update rule looks like:

W=W+UΣ(i=1uviPi)V W' = W + U\Sigma \left( \sum_{i=1}^u v_i P_i \right) V^\top

where U,Σ,VU, \Sigma, V are frozen from the base weight SVD, and vv is the only thing that moves.

RL vs. SFT

The most important takeaway is that this only works with RL. If you try to do Supervised Fine-Tuning with 13 parameters, the model fails miserably. The authors provide a clean information-theoretic explanation for this:

  • SFT is “noisy”: When you train on a human-written solution, the model tries to absorb everything like style, phrasing, and specific tokens. This is a high-entropy signal that requires many bits to store, so you need more parameters.

  • RL is “clean”: In RL (they use GRPO), the only signal is the reward. The model generates its own reasoning, and the reward signal just says “yes” or “no” to the final result. This is a sparse, verifiable signal that filters out the noise and focuses purely on the underlying logic.

They call this “Signal Separation.” In RL, the reward acts as a filter that helps the model distinguish between noise and information.

Scaling Laws

There is also a very cool scaling aspect as larger models are easier to “program” with fewer parameters. As the model size increases, the number of parameters needed to reach a certain performance threshold actually decreases.

This suggests that reasoning isn’t being “learned” from scratch during RL but it’s already latent in the pre-trained weights.

One final detail they found is that FP32 precision matters when you’re working with such a tiny budget. Even though it’s twice the bytes of BF16, that extra precision is necessary when those 13 “slots” are the only thing holding the logic for the entire system.

Additional Papers

Recursive Language Models  [8]

Here the authors tackle the problem of getting LLMs to handle contexts way beyond their window limits. The core idea is pretty simple: they treat the prompt itself as part of the external environment and let the model interact with it programmatically.

In practice this is done by loading the huge prompts in a REPL environment and giving the model the ability to peek into it, slice it up, and recursively call itself on chunks.

This is different from just having sub-agents in a few ways. First, the user prompt lives in the environment, not in the model’s context. Second, recursion happens symbolically through code, not verbally. Third, you can do Ω(P)\Omega(|P|) or even Ω(P2)\Omega(|P|^2) work on a prompt PP by writing loops that launch sub-calls programmatically.

The setup is illustrated in Figure 4. Given a prompt PP, the RLM initializes a REPL with PP as a variable and a function to invoke sub-RLMs. In each iteration, the root model generates code to explore PP, executes it, and sees metadata about the output (not the full output, to avoid context overflow). The model can call itself recursively on any slice of PP, and when it sets a Final variable, the process terminates.

RLM workflow: the prompt lives in a REPL as a variable, the root model writes code to explore it and launches recursive sub-calls, and results are stored symbolically. The root only sees metadata, never the full prompt.

This works really well mainly for 3 reasons:

  • Symbolic handle to the prompt. The model gets a variable containing the prompt, so it can manipulate it without copying text into its context window. This sidesteps the fundamental limitation of treating long inputs as token sequences.

  • Programmatic output. Instead of generating the output autoregressively (which is bounded by the model’s output length), the RLM builds up the answer in variables through code and sub-calls. This enables essentially unbounded output lengths.

  • Symbolic recursion. The model can invoke sub-RLMs inside loops, enabling Ω(P)\Omega(|P|) or Ω(P2)\Omega(|P|^2) semantic work. Prior sub-agent approaches require verbalizing each delegation, which is impractical for programmatic loops over prompt slices.

Training a Native RLM

The authors also train the first natively recursive language model by fine-tuning Qwen3-8B on 1,000 filtered trajectories from Qwen3-Coder-480B acting as an RLM. This simple recipe yields RLM-Qwen3-8B, which outperforms base Qwen3-8B by 28.3% on average across four long-context tasks, despite being trained on an unrelated domain.

Results

The evaluation tests on four tasks with varying complexity: needle-in-haystack (constant), linear aggregation, quadratic aggregation, and code understanding. The results are pretty convincing.

On tasks with 6-11M token contexts, RLM(GPT-5) achieves 91% accuracy while the base model scores 0% because it can’t even fit the input. The average cost is $0.99, which is actually cheaper than what you’d pay just to ingest those tokens into a model that could handle them.

The real pattern shows up when you scale context length. As inputs grow from 8K to 250K+ tokens, GPT-5 degrades catastrophically on complex tasks, dropping to near 0% on problems that require processing pairs of entries. Meanwhile, RLM(GPT-5) maintains 58% performance on the same tasks. For simpler problems the gap is smaller, but RLMs consistently win.

Limitations and Stuff

The main limitation is that they only explore synchronous sub-calls with max recursion depth of 1. Deeper recursion or asynchronous execution could potentially reduce costs and latency significantly. Also, the trajectories suggest that models aren’t yet fully optimized for this paradigm, they make redundant sub-calls and sometimes fail to use the answers they’ve already computed.

Still, this feels like a fresh new direction for scaling context. It definitely feels a bit like a gimmick but the performance is real so I’m interested to see future work on this.

Improving Evidence Synthesis with AI  [9]

I know this is supposed to be a list of stuff I mostly liked, but I hated this one like I truly despised it. I hate that it’s made by a company that wants to sell this as product, I hate that they’re aura farming with university names and I hate that they have no methodology for this. Booooooooooo.

Concluding Thoughts

This was way more work to put together than I initially planned. I think trying to write this all in one go at the end of the week isn’t the best idea and I should probably write these right after I read the papers. Anyway it was fun and I aim to keep doing this for the foreseeable future.

Ciaooo.

References

[1] Srivastava, S. S., Aggarwal, V. [A technical survey of reinforcement learning techniques for large language models](https://arxiv.org/abs/2507.04136). (2025)

[2] Cheng, M., Ouyang, J., Yu, S., et al. [Agent-R1: Training powerful LLM agents with end-to-end reinforcement learning](https://arxiv.org/abs/2511.14460). (2025)

[3] Liu, Z., Chen, C., Li, W., et al. [Understanding R1-zero-like training: A critical perspective](https://arxiv.org/abs/2503.20783). (2025)

[4] Hu, E. J., Shen, Y., Wallis, P., et al. [LoRA: Low-rank adaptation of large language models](https://arxiv.org/abs/2106.09685). (2021)

[5] Biderman, D., Portes, J., Ortiz, J. J. G., et al. [LoRA learns less and forgets less](https://arxiv.org/abs/2405.09673). (2024)

[6] Schulman, J., Lab, T. M. [LoRA without regret](https://doi.org/10.64434/tml.20250929). Thinking Machines Lab: Connectionism (2025)

[7] Morris, J. X., Mireshghallah, N., Ibrahim, M., et al. [Learning to reason in 13 parameters](https://arxiv.org/abs/2602.04118). (2026)

[8] Zhang, A. L., Kraska, T., Khattab, O. [Recursive language models](https://arxiv.org/abs/2512.24601). (2026)

[9] Mehr, A., Howard, J. L., Nouroozi, C., et al. Improving evidence synthesis with artificial intelligence.