🔍 Search

Open
🏆Anthropic releases Opus 4.5 leading SWE-bench and BrowseComp

🏆Anthropic releases Opus 4.5 leading SWE-bench and BrowseComp

Andrew Ng's Agentic Paper Reviewer, Anthropic on task duration, ChatGPT gets Voice mode and Shopping, Tencent's Open OCR
Stay updated with today's top AI news, papers, and repos.
Signup | Work With Us | Follow on X |Read on Web
AlphaSignal Logo

Hey James

Welcome to AlphaSignal, the most read source of news by AI engineers and researchers.

Every day, we identify and summarize the top 1% of news, papers, models, and repos, so you're always up to date.

Here's today's roundup:

Summary

Read time: 3 min 25 sec

Top News

▸ Anthropic launches Opus 4.5 with top SWE-bench engineering performance

Top News

▸ Andrew Ng introduces tool that accelerates paper revisions with rapid feedback

Top Paper

▸ Anthropic studies 100k Claude chats to see how much work AI handles

Signals

▸ OpenAI adds shopping research to ChatGPT for fast product comparisons
▸ Tencent presents HunyuanOCR, an open-source 1B end-to-end OCR model
▸ OpenAI rolls out real-time ChatGPT Voice directly inside chat on all platforms
▸ Black Forest Labs releases FLUX.2, an open-weight multi-reference image model
▸ Google ships interactive images in Gemini for deeper, visual academic learning
Top News
Anthropic introduces Opus 4.5, adding effort control for precise reasoning across complex tasks
17,493 Likes
claude opus 4.5

Claude Opus 4.5 arrives as the model that finally treats software engineering like a real system, not a guessing game.

The setup is familiar: frontier models improve fast, yet real bugs, multi-step tasks, and ambiguous specs still push them off course. The problem shows up when a model floods you with tokens but never reaches a working fix.

Opus 4.5 introduces a clear insight: control how much the model thinks. The effort parameter acts like a compute dial, letting you choose fast responses or deeper reasoning. The breakthrough comes from how well this works in practice.

Key Results

  • Uses 76% fewer tokens than Sonnet 4.5 at matching performance.
  • Beats Sonnet 4.5 by 4.3 points at high effort.

  • Leads Aider Polyglot, BrowseComp-Plus, and SWE-bench Multilingual.

How to Use It

  • Call claude-opus-4-5-20251101 with the Claude API.
  • Adjust effort for shallow or deep reasoning.

  • Use Claude Code planning for structured, multi-step fixes.

TRY NOW
Top News
Andrew Ng builds tool that compresses six-month research feedback cycles into near-instant iterations
5,529 Likes
andrew Ng

Andrew Ng just introduced an agentic paper reviewer that reads your research, searches arXiv, and returns grounded comments in minutes. It started as a weekend idea, sparked by a student who spent three years trapped in six-month review loops. Now it aims to cut that wait to near-zero.

The idea was to let an agent read your paper, pull the right prior work, and judge it across clear criteria.The surprise came when the system, trained on ICLR 2025 reviews, matched human-level agreement with a 0.42 AI-human correlation compared to the 0.41 human-human baseline.

Researchers can upload a PDF, choose a venue, and iterate immediately. The takeaway: reviewing no longer needs a six-month feedback cycle.

TRY NOW
Top News
Anthropic uses 100k Claude transcripts to estimate human-only versus AI-assisted task duration
1,149 Likes
Grok 4 Fast Benchmark

The story starts with a basic question: how much real work do people hand to Claude? Anthropic sampled 100,000 anonymized chats and asked Claude to estimate how long each task would take without AI. This gave them a direct way to measure the size and difficulty of everyday workloads.

They found a consistent pattern. Claude estimated that many tasks would take about 90 minutes for a human, yet users finished them in a fraction of that time. This gap revealed how much speed AI already adds in practice.

Anthropic then linked each chat to O*NET tasks and wage data. This showed which kinds of work see big accelerations and which tasks remain slow.

Key Findings

  • 80% average time reduction across the dataset.
  • Management tasks estimated at 2.0 hours drop to short sessions.

  • Large variation across fields like legal, education, and healthcare.

READ MORE
Signals
1 OpenAI launches shopping research to ChatGPT for interactive filtering, side-by-side comparisons, and tailored picks 6,028 Likes
2 Tencent unveils HunyuanOCR, a lightweight end-to-end OCR model covering detection, recognition, and complex documents 1,142 Likes
3 OpenAI ships unified Voice mode on web and mobile with optional classic mode 4,836 Likes
4 Black Forest Labs announces open-weight FLUX.2 with 10-reference support, sharp text, and photoreal editing 2,482 Likes
5 Google introduces dynamic educational visuals in Gemini to help users study scientific systems interactively 4,639 Likes
Looking to promote your company, product, service, or event to 250,000+ AI developers?
WORK WITH US
unsubscribe_me(): return True
{"AlphaSignal": "214 Barton Springs Rd, Austin, USA"}

Post a Comment

0 Comments

Users_Online! 🟢

FOUNDER/AUTHOR

FOUNDER/AUTHOR VHAVENDA I.T SOLUTIONS