Saturday, May 2, 2026

9 articles — 🔴 2 critical , 🟡 4 important , 🟢 3 interesting

← Previous day Next day →

🤖 Models (4)

🟡 🤖 Models May 2, 2026 · 3 min read

Latent-GRPO: Stable RL Optimization for Latent Reasoning — 7.86 Points on GSM8K-Aug and 4.27 Points on AIME With 3-4× Shorter Reasoning Chains

Editorial illustration: compression of a reasoning network into a condensed latent space

Researchers introduce Latent-GRPO, a stabilized RL approach for latent reasoning in which reasoning steps are compressed into continuous representations. They identify three fundamental problems with directly applying GRPO in latent space — invalid latent states, misalignment between the reward signal and token updates, and invalid averaged states — and address them through a combination of invalid-sample advantage masking, one-sided noise sampling and optimal correct-path first-token selection. Results: +7.86 Pass@1 on GSM8K-Aug and +4.27 points on AIME, with 3-4× shorter reasoning chains.

🟡 🤖 Models May 2, 2026 · 2 min read

GitHub is retiring GPT-5.2 and GPT-5.2-Codex from Copilot on June 1, 2026 — migration to GPT-5.5 and GPT-5.3-Codex

Editorial illustration: GitHub Copilot dashboard s novim modelom koji zamjenjuje stari

GitHub announces the retirement of GPT-5.2 and GPT-5.2-Codex from all Copilot experiences on June 1, 2026. Chat, inline edit, ask and agent mode, and code completion users will move to GPT-5.5, while Codex users will move to GPT-5.3-Codex. The exception is Copilot Code Review, where GPT-5.2-Codex remains available. Enterprise administrators must manually enable the new models in model policies before the deadline.

🟡 🤖 Models May 2, 2026 · 3 min read

NIST CAISI evaluation of DeepSeek V4 Pro: 8-month lag behind frontier US models across 9 benchmarks in 5 domains

Editorial illustration: vaga koja uspoređuje AI modele iznad geopolitičke karte

The Center for AI Standards and Innovation at NIST (CAISI) has published an independent evaluation of the Chinese model DeepSeek V4 Pro across 9 benchmarks in 5 domains (cybersecurity, software engineering, natural sciences, abstract reasoning, mathematics). Key finding: V4 lags 8 months behind frontier US models, particularly on reasoning and agentic tasks that DeepSeek did not include in its own technical report. Cost of use is lower than GPT-5.4 mini in 5 of 7 tests.

🟢 🤖 Models May 2, 2026 · 2 min read

KellyBench: AI agents managing a betting bankroll through the Premier League season — all leading models lost money

Editorial illustration: nogometni stadion s digitalnom analizom kvota

KellyBench is a new benchmark for testing sequential decision-making: AI agents manage a betting bankroll through the entire 2023/24 Premier League season, using statistics, lineups, and market odds. All leading models tested lost money, and Claude Opus 4.6 scored 26.5% on the expert rubric for strategy sophistication.

🤝 Agents (2)

💬 Community (1)

🛡️ Security (2)

← Previous day Next day →