New technique helps LLMs rein in CoT lengths, optimizing reasoning without exploding compute costs

[ad_1]

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Reasoning through chain-of-thought (CoT) — the process by which models break problems into manageable “thoughts” before deducting answers — has become an integral part of the latest generation of frontier large language models (LLMs).

However, the inference costs of reasoning models can quickly stack up as models generate excess CoT tokens. In a new paper, researchers at Carnegie Mellon University propose an LLM training technique that gives developers more control over the length of the CoT.

Called length controlled policy optimization (LCPO), the technique conditions the model to provide correct answers while also keeping its “thoughts” within a predetermined token budget. Experiments show that models trained on LCPO provide a smooth tradeoff between accuracy and costs and can surprisingly outperform larger models on equal reasoning lengths. LCPO can help dramatically reduce the costs of inference in enterprise applications by saving thousands of tokens in each round of conversation with an LLM.

LLM performance leads to longer CoTs

Reasoning models such as OpenAI o1 and DeepSeek-R1 are trained through reinforcement learning (RL) to use test-time scaling and generate CoT traces before producing an answer. Empirical evidence shows that when models “think” longer, they tend to perform better on reasoning tasks.

For example, R1 was initially trained on pure RL without human-labeled examples. One of the insights was that as the model’s performance improved, it also learned to generate longer CoT traces.

While in general, long CoT chains result in more accurate responses, they also create a compute bottleneck in applying reasoning models at scale. There is currently very little control over the test-time compute budget, and sequences can easily stretch to tens of thousands of tokens without providing significant gains. There have been some efforts to control the length of reasoning chains, but they usually degrade the model’s performance.

Length controlled policy optimization (LCPO) explained

The classic RL method trains LLMs only to achieve the correct response. LCPO changes this paradigm by introducing two training objectives: 1) obtain the correct result and 2) keep the CoT chain bounded within a specific token length. Therefore, if the model produces the correct response but generates too many CoT tokens, it will receive a penalty and be forced to come up with a reasoning chain that reaches the same answer but with a smaller token budget.

“LCPO-trained models learn to satisfy length constraints while optimizing reasoning performance, rather than relying on hand-engineered heuristics,” the researchers write.

They propose two flavors of LCPO: (1) LCPO-exact, which requires the generated reasoning to be exactly equal to the target length, and (2) LCPO-max, which requires the output to be no longer than the target length.

To test the technique, the researchers fine-tuned a 1.5B-parameter reasoning model (Qwen-Distilled-R1-1.5B) on the two proposed LCPO schemes to create the L1-max and L1-exact models. Training was based on mathematical problems with distinct and verifiable results. However, the evaluation included math problems as well as out-of-distribution tasks such as the measuring massive multitask language understanding (MMLU) technique and the graduate-level Google-proof Q&A benchmark (GPQA).

Their findings show that L1 models can precisely balance token budget and reasoning performance, smoothly interpolating between short, efficient reasoning and longer, more accurate reasoning by prompting the model with different length constraints. Importantly, on some tasks, the L1 models can reproduce the performance of the original reasoning model at a lower token budget.

Compared to S1 — the only other method that constrains the length of CoT — L1 models shows up to 150% performance gains on different token budgets.

“This substantial difference can be attributed to two key factors,” the researchers write. “(1) L1 intelligently adapts its CoT to fit within specified length constraints without disrupting the reasoning process, while S1 often truncates mid-reasoning; and (2) L1 is explicitly trained to generate high-quality reasoning chains of varying lengths, effectively distilling reasoning patterns from longer chains to shorter ones.”

L1 also outperforms its non-reasoning counterpart by 5% and GPT-4o by 2% on equal generation length. “As to the best of our knowledge, this is the first demonstration that a 1.5B model can outperform frontier models such as GPT-4o, despite using the same generation length,” the researchers write.

Interestingly, the model’s CoT shows that it learns to adjust its reasoning process based on its token budget. For example, on longer budgets, the model is more likely to generate tokens associated with self-correction and verification (that is, “but” and “wait”) and conclusion drawing (“therefore” and “so”).

*Models trained on LCPO adjust their reasoning chain based on their token budget (source: arXiv)*

Beyond improved length control in the standard math reasoning setting, the L1 models generalize surprisingly well to out-of-distribution tasks, including GPQA and MMLU.

This new line of research on models that can adjust their reasoning budget can have important uses for real-world applications, giving enterprises the ability to scale reasoning models without runaway expenses. It’s a powerful alternative to simply deploying larger, more expensive models — and could be a crucial factor in making AI more economically viable for high-volume, real-world applications.

The researchers have open sourced the code of LCPO and the weights for the L1 models.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

[ad_2]
Source link

LLM performance leads to longer CoTs

Length controlled policy optimization (LCPO) explained

Join the LifeWave X39 Movement – Help Others, Earn Income

Soft Play Bus Essex for Teen & Kids Club Events – A Unique Experience on Wheels

Why Parents Love Booking a Pamper Bus for Their Child’s Party

Hidden Gems: Affordable Luxury Rentals Near Grace Bay Beach and Long Bay Beach

How Norwegian Travelers Are Combining Exploration with Online Gaming

More like this
Related

Join the LifeWave X39 Movement – Help Others, Earn Income

Soft Play Bus Essex for Teen & Kids Club Events – A Unique Experience on Wheels

Why Parents Love Booking a Pamper Bus for Their Child’s Party

Hidden Gems: Affordable Luxury Rentals Near Grace Bay Beach and Long Bay Beach

About us

Company

The latest

Join the LifeWave X39 Movement – Help Others, Earn Income

Soft Play Bus Essex for Teen & Kids Club Events – A Unique Experience on Wheels

New technique helps LLMs rein in CoT lengths, optimizing reasoning without exploding compute costs

LLM performance leads to longer CoTs

Length controlled policy optimization (LCPO) explained

More like thisRelated

About us

Company

The latest

More like this
Related