Together with: | |
|
|
| Top News | | Google researchers show SRL gives 7B models better structured reasoning by rewarding partial step correctness OpenRouter releases Sonoma Dusk Alpha with 2M token context | | For years, small open-source models have hit a wall on hard reasoning tasks. They could memorize solutions or copy examples, but they couldn't reason.
Now, a research team at Google Cloud AI Research has introduced Supervised Reinforcement Learning (SRL), a training method that finally helps models learn to think step by step instead of guessing the right answer at the end.
The Problem Fine-tuning and reinforcement learning both struggled on complex problems like AIME and AMC math benchmarks. Supervised Fine-Tuning (SFT) forced models to imitate human demonstrations token by token, leading to overfitting. Reinforcement Learning with Verifiable Rewards (RLVR) only rewarded correct final answers, offering no signal when all attempts failed. Smaller models like 7B-parameter LLMs often produced random reasoning paths, learning nothing from failure. The Insight The researchers reframed reasoning as a sequence of actions, not just text generation. Each "action" represents one decision or logical move in a solution. Before each action, the model generates a short reasoning trace, an inner monologue. The system compares the model's action to expert examples and gives a smooth similarity reward at every step. The Breakthrough When they trained Qwen2.5-7B-Instruct on the s1k dataset, SRL changed the game. AIME24 greedy accuracy: rose from 13.3 % to 16.7 %. AMC23 greedy accuracy: reached 57.5 % with an SRL → RLVR pipeline. Models showed better structured reasoning, not just longer outputs. The Impact SRL also worked on agentic software engineering data, 5,000 trajectories and 134,000 reasoning steps. It improved planning and verification behaviors. It trained effectively on 7B models with moderate compute. It produced stable reasoning without overfitting token sequences. You can use SRL by breaking down expert solutions into step-level "actions," training a model to predict and reflect at each step, and rewarding it for partial correctness. A simple shift that teaches models how to think, not just answer. |
| | LEARN MORE |
|
|
| | Stop stitching tools. Build Voice AI with one API. | AssemblyAI combines transcription, speaker detection, PII redaction, topic detection, and LLM integration into one platform. Build intelligent voice products in minutes without juggling 5 different vendors.
Test faster, scale easier, and ship with higher accuracy out of the box.
See how you can: Use one API for transcription + understanding + intelligence Deploy globally across 99+ languages Power voice features behind apps like Granola, Dovetail, Ashby Replace multiple point-solutions with one production-ready Voice AI stack | | SIGN UP FREE | partner with us |
|
|
| Signals | | | 4,027 Likes | | | 2,486 Likes | | | 947 Likes | | | 829 Likes | | | 2,048 Likes | | |
| |
|
| How Meta, OpenAI, and Google Put Agents Into Production | Join the free virtual event Agents in Production on November 18. Hear how top teams move beyond prototypes to deploy agents that personalize, moderate, and collaborate at scale.
Talks cover multi-agent systems for trust + safety, enterprise orchestration, and lessons from launching thousands of agents in live systems. | Register for free now ↗️ |
|
|
| Reads | | | 1,329 Likes | This blog shows how OpenAI built OWL to detach Chromium from the main process. You learn how Atlas boots instantly, stays responsive with many tabs, and becomes a real agent platform. | | | 1,893 Likes | Torvalds explains how Git became powerful because it was optimized for real workflow constraints, not theory. You learn what fundamentally mattered in the original architecture, and which tradeoffs still matter today. | | | 4,328 Likes | This case shows how stateful containers break assumptions. You see how missing guardrails, SSH settings, and log rotation caused 800GB waste. You learn prevention rules and exact remediation steps.
| | |
| |
| | Top Lecture | |
Rich Sutton, pioneer of RL, explains temporal difference learning | | "If you want to understand reinforcement learning, understand Temporal Difference."
This is Sutton explaining temporal-difference learning from first principles, why it scales, why prediction matters, and why simple supervised learning breaks on multi-step credit assignment.
Core ideas covered: Why scalable AI must hinge on computation, not hand-design Why multi-step prediction needs TD, not single-step supervised learning How TD uses bootstrapping, learn from later predictions, not from labels How this powered TD-Gammon and Atari RL systems Why TD is not a niche technique but a core concept in RL itself | | WATCH NOW |
|
|
| |
0 Comments
VHAVENDA IT SOLUTIONS AND SERVICES WOULD LIKE TO HEAR FROM YOUš«µš¼š«µš¼š«µš¼š«µš¼