🤖 Models

92 articles

🟡 🤖 Models May 22, 2026 · 3 min read

arXiv:2605.21006: Off-the-shelf persona vectors achieve 68-98% effectiveness of targeted sycophancy steering in LLM models

Editorial illustration: arXiv:2605.21006 — Off-the-shelf persona vectors achieve 68-98% effectiveness of targeted sycophancy steering in LLM models

Researchers published a paper on arXiv on 21 May 2026 titled 'Playing Devil's Advocate' showing that existing persona vectors developed for roleplay tasks can reduce sycophancy (the model's tendency to agree with the user even when the user is wrong) to 68-98% of the effectiveness of specialised Contrastive Activation Addition (CAA) — without training on sycophancy-specific data. Geometric analysis reveals that sycophancy is a persona-level property rather than a single steerable direction in activation space, opening much easier pathways for alignment.

🟢 🤖 Models May 22, 2026 · 3 min read

Black Forest Labs: FLUX Erase outperforms GPT Image-2 (68.5%) and Finegrain (63.2%) in prompt-free object removal

Editorial illustration: FLUX Erase outperforms GPT Image-2 (68.5%) and Finegrain (63.2%) in prompt-free object removal

Black Forest Labs launched FLUX Erase on 21 May 2026 — an inpainting tool that uses a binary mask to remove objects, shadows, watermarks, and text from images and reconstructs the background without any textual prompt. A benchmark on 198 test images demonstrates superiority over GPT Image-2 (68.5%) and Finegrain Eraser Standard (63.2%). The tool is available through the BFL API and a public demo at flux-tools.bfl.ai/erase, positioning BFL as a specialist in professional creative workflow tools.

🔴 🤖 Models May 21, 2026 · 2 min read

OpenAI: AI model disproves 80-year-old conjecture in discrete geometry

Editorial illustration: OpenAI AI model disproves 80-year-old unit distance conjecture in discrete geometry

OpenAI announced that its AI model solved the open unit distance problem — a central conjecture in discrete geometry posed over 80 years ago. The company describes the result as a milestone in AI-driven mathematics, because the model did not merely verify an existing thesis but disproved it by constructing an original counterexample.

🟢 🤖 Models May 21, 2026 · 2 min read

arXiv:2605.19762: ICML 2026 paper claims code does not improve LLM mathematical reasoning

Editorial illustration: ICML 2026 paper shows structured reasoning signals outperform pure code for LLM mathematical reasoning

An arXiv preprint accepted at ICML 2026 shows through controlled pre-training experiments that executable code does not by itself improve general reasoning capabilities of LLM models — code strongly improves programming but competes with mathematical tasks in standard mode. Real progress in mathematics comes from cross-domain structured reasoning traces (code-text and math-text mixtures), and mechanistic analysis of Mixture-of-Experts models reveals these interactions in expert activation patterns.

🔴 🤖 Models May 20, 2026 · 3 min read

Google: Gemini 3.5 Flash and Pro — the fastest frontier models yet

Editorial illustration: Google unveiled Gemini 3.5 Flash and Pro at Google I/O 2026 — frontier models 4× faster than

Google unveiled Gemini 3.5 Flash and Pro at Google I/O 2026 — frontier models that are 4× faster than the competition, with a special focus on agentic tasks, the new Antigravity 2.0 developer platform, and Gemini Spark, a personal AI agent available 24/7.

🔴 🤖 Models May 20, 2026 · 3 min read

Google: Gemini Omni Flash brings native video generation from mixed inputs

Editorial illustration: Google unveiled Gemini Omni Flash at I/O 2026 — a new multimodal model generating and editing

Google unveiled Gemini Omni Flash at I/O 2026 — a new multimodal model that generates and edits video from a combination of images, audio, video, and text. Available immediately on YouTube Shorts, with mandatory SynthID digital watermarks on every generated clip.

🟡 🤖 Models May 20, 2026 · 2 min read

Google: ERA — AI system that automates scientific code writing

Editorial illustration:

Google published ERA (Empirical Research Assistance) in Nature — a Gemini-powered system that uses tree search to evaluate thousands of computational approaches and automates the writing of expert scientific software. The Computational Discovery platform is already available to researchers.

🟢 🤖 Models May 20, 2026 · 2 min read

arXiv:2605.19660: OScaR — INT2 KV Cache Quantization Delivers 3× Faster Decoding

Editorial illustration: Researchers publish OScaR, a method solving the fundamental problem of KV cache quantization in large language models

Researchers have published OScaR, a method that solves the fundamental problem of KV cache quantization in large language models. Using INT2 precision — just 2 bits per value — it achieves near-lossless accuracy, 3× faster decoding, 5.3× less memory, and 4.1× higher throughput compared to BF16 FlashDecoding-v2.

🔴 🤖 Models May 19, 2026 · 4 min read

arXiv:2605.15514: RoPE mathematically cannot distinguish positions or tokens in long contexts — theoretical proof of a fundamental limitation

Editorial illustration: arXiv paper 2605.15514 provides a mathematical proof that Rotary Positional Embeddings (RoPE) loses ability to distinguish positions in long contexts

arXiv paper 2605.15514 provides a mathematical proof that Rotary Positional Embeddings (RoPE), the positional mechanism used by nearly all modern large language models including Llama, Mistral, Qwen and GPT-NeoX, loses the ability to distinguish positions and tokens in long contexts. The authors conclude that fundamentally new architectural mechanisms are needed.

🟡 🤖 Models May 19, 2026 · 2 min read

Anthropic: Claude API web search tool now returns enriched data from SEC filings

Editorial illustration: Anthropic updated the web search tool in the Claude API to return richer, structured data from SEC filings

On May 18, 2026, Anthropic updated the web search tool in the Claude API to return richer and more structured data from SEC filings — including 10-K, 10-Q and 8-K documents. The upgrade makes it easier to build financial agents for earnings analysis, due-diligence and research with referenced primary sources.

🟢 🤖 Models May 19, 2026 · 2 min read

arXiv:2605.18732: Scaling Law for Hallucinations — Larger Model Does Not Always Mean Fewer Errors

Editorial illustration: Scaling law for LLM hallucinations — sigmoid curve for factual recall

Researchers tested 38 models on 8,900+ references and showed that LLM factual recall follows a sigmoid curve: the combination of parameter count and topic prevalence in training data explains 60–94% of variance. Hallucinations are not random — they are predictable and measurable.

🟡 🤖 Models May 18, 2026 · 4 min read

GitHub Copilot: GPT-5.3-Codex becomes base model for Business and Enterprise with 12-month LTS guarantee

Editorial illustration: GitHub Copilot logo with GPT-5.3-Codex badge and LTS support stamp.

On May 17, 2026, GitHub announced that GPT-5.3-Codex replaces GPT-4.1 as the base model for Copilot Business and Enterprise. The change applies only to enterprise tiers (not Copilot Pro, Pro+, or Free). GPT-5.3-Codex is the first LTS (long-term support) model — guaranteed availability for 12 months from February 5, 2026 to February 4, 2027. Pricing: 1× premium request multiplier; GPT-4.1 remains force-enabled at 0× multiplier until deprecation on June 1, 2026.

🟡 🤖 Models May 16, 2026 · 3 min read

Black Forest Labs: FLUX Outpainting extends images in any direction while preserving light, texture, and composition

Editorial illustration: an image expanding beyond its frame with preserved light and texture.

FLUX Outpainting is a new Black Forest Labs image generation feature announced on May 14, 2026, that extends images in any direction through a purpose-built expansion endpoint. The user specifies target canvas dimensions and placement coordinates — the model preserves lighting, texture, depth, and composition across extension regions without text prompts. Up to 4MP output, available via the BFL API, with a public demo at flux-tools.bfl.ai/outpainting.

🟡 🤖 Models May 15, 2026 · 2 min read

Amazon Nova 2 Sonic: Speech-to-Speech Foundation Model with End-to-End Latency Below 500ms and 30ms Audio Latency

Editorial illustration: voice agent with sound waves and edge network graphic.

Amazon Nova 2 Sonic is a new generation speech-to-speech foundation model announced on May 14, 2026, through Amazon Bedrock. It eliminates the need for separate speech-to-text and text-to-speech services — end-to-end latency below 500ms, audio latency below 30ms via the Stream edge network, native turn detection, barge-in support, and function calling during conversation. The Stream Vision Agents framework abstracts bidirectional audio stream management.

🟡 🤖 Models May 15, 2026 · 3 min read

arXiv:2605.15177 OpenDeepThink: parallel reasoning via Bradley-Terry aggregation lifts Gemini 3.1 Pro by +405 Elo on Codeforces

Editorial illustration: parallel reasoning branches with pairwise judging symbols and Elo rating.

OpenDeepThink is a new population-based test-time compute scaling methodology published May 14, 2026 on arXiv by Shang Zhou and collaborators. The framework samples multiple reasoning candidates in parallel and selects the best through pairwise Bradley-Terry comparisons, instead of pointwise LLM judging. Result: Gemini 3.1 Pro gains +405 Elo on Codeforces benchmarks across eight sequential LLM-call rounds (~27 minutes). The team also released the CF-73 dataset with 73 expert-rated Codeforces problems.

🟡 🤖 Models May 14, 2026 · 2 min read

arXiv:2605.13301 SU-01: 30B model reaches gold-medal level at IMO 2025, USAMO 2026, and IPhO through three-phase training

Editorial illustration: medal podiums with mathematical formulas and AI reasoning trees.

SU-01 is a new reasoning training methodology published on May 14, 2026 on arXiv (Yafu Li and 27 co-authors, corresponding author Runzhe Zhan). A 30B parameter A3B backbone reaches gold-medal performance on the International Mathematical Olympiad 2025, USAMO 2026, and International Physics Olympiad 2024-2025 through three sequential phases: reverse-perplexity curriculum SFT on 340K trajectories, two-stage RL, and test-time scaling. Reasoning chains reach 100K+ tokens.

🟢 🤖 Models May 14, 2026 · 2 min read

Allen Institute: AIMIP benchmark — AI climate models 2× better on historical data but fail to generalize to long-term warming

Editorial illustration: climate time-series graphs with AI model lines versus historical data.

AIMIP (AI Model Intercomparison Project) is a new community benchmark for AI weather and climate models published on May 13, 2026 by the Allen Institute together with NVIDIA, Google Research, University of Washington, University of Maryland and the ArchesWeather group. Phase 1 evaluation of eight AI model simulations showed a twofold reduction in error on historical data — but also a serious inability to generalize to long-term warming trends.

🟢 🤖 Models May 14, 2026 · 2 min read

Microsoft Research GridSFM: foundation model solves AC optimal power flow 100× faster than DC approximation

Editorial illustration: electric power grid with an AI foundation model and optimization graph.

GridSFM is a new Microsoft Research small foundation model for electric power grids published on May 13, 2026. It approximates AC optimal power flow in milliseconds on grids of 500 to 80,000 nodes — 100× faster than DC approximation and 1,000× faster than full AC solvers. Median cost gap is 2.23%, feasibility detection achieves 94.5%/96.1%, and the model projects potential savings of $20 billion annually in congestion costs.

🟡 🤖 Models May 13, 2026 · 2 min read

Anthropic: Claude Opus 4.7 Fast Mode enters research preview — premium speed for the flagship model

Editorial illustration: fast token streams through neural architecture under a premium signal.

Claude Opus 4.7 Fast Mode is a new Anthropic API research preview feature released on May 12, 2026, enabling significantly faster output token generation for Anthropic's most powerful model at a premium price. Developers activate the mode with the speed="fast" parameter, model claude-opus-4-7, and the beta header fast-mode-2026-02-01. Access, rate limits and pricing are identical to the Opus 4.6 Fast Mode variant.

🟢 🤖 Models May 13, 2026 · 2 min read

Microsoft Research: MatterSim experimentally synthesized TaP at 152 W/m/K, MatterSim-MT extends output beyond PES

Editorial illustration: crystalline material structure with a thermal conductivity display.

MatterSim is a new Microsoft Research foundation model for materials science whose results were published on May 12, 2026. The model predicted tetragonal TaP, which was experimentally synthesized and measured at 152 W/m/K — close to silicon. MatterSim-v1 inference is accelerated 3–5×, and the new MatterSim-MT multi-task model adds stress tensors, magnetic moments, Born effective charges, and dielectric matrices.

🟡 🤖 Models May 12, 2026 · 2 min read

vLLM: open-source inference engine takes first place on the Artificial Analysis leaderboard

Editorial illustration: open-source inference engine takes first place on the Artificial Analysis leaderboard

vLLM is an open-source inference engine that claimed first place on the Artificial Analysis leaderboard for three frontier models — DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B — through aggressive kernel fusion (33→10 launches per layer, 1.28× speedup), a custom EAGLE3 draft model for speculative decoding, and linear attention path optimizations.

🟢 🤖 Models May 12, 2026 · 2 min read

arXiv:2605.07776: tracking uncertainty in LLM reasoning traces — errors predictable from the first 100 tokens

Editorial illustration: 2605.07776: tracking uncertainty in LLM reasoning traces — errors predictable from the first 100 tokens

arXiv:2605.07776 is a study on uncertainty tracking in the reasoning traces of large language models. The authors (Grünefeld, Højer, Mondorf, Plank, Rogers, and collaborators) developed an 'uncertainty trace profile' — a compact feature set that predicts correct outcomes with AUROC 0.807, already from the first few hundred tokens (AUROC 0.801).

🟡 🤖 Models May 11, 2026 · 2 min read

arXiv:2605.06635: LLM agents cite but don't verify — links valid 94%+, accuracy only 39–77%

Editorial illustration: 2605.06635: LLM agents cite but don't verify — links valid 94%+, accuracy only 39–77%

New research tested 14 LLM models on deep research tasks and uncovered a major gap: links are valid in 94%+ of cases, but the factual accuracy of citations is only 39–77%. The key finding: citation accuracy drops by 42% when the number of tools increases from 2 to 150, overturning the assumption that more retrieval means better quality.

🟡 🤖 Models May 11, 2026 · 2 min read

arXiv:2605.07990: LLM tool-calling linearly represented — mean-difference vector changes selection 77-100%

Editorial illustration: 2605.07990: LLM tool-calling linearly represented — mean-difference vector changes selection 77-100%

Researchers from UCL, Holistic AI and Imperial College discovered that LLMs internally represent tool selection linearly. The mean-difference vector — the difference of average activations between two tools — added to activations changes selection with 77-100% accuracy on 12 tested models (270M-27B parameters), without any fine-tuning.

🟢 🤖 Models May 11, 2026 · 2 min read

arXiv:2605.06660: VHG — verifier-backed framework for generating hard mathematical problems

Editorial illustration: 2605.06660: VHG — verifier-backed framework for generating hard mathematical problems

The VHG (Verifier-backed Hard problem Generation) framework addresses the problem of creating valid, hard, and original mathematical problems for LLM training. It introduces an independent verifier into the setter-solver duality — three-party self-play guarantees both validity and difficulty. Tested on integral calculus, VHG significantly outperforms all baseline methods.

🟢 🤖 Models May 11, 2026 · 1 min read

arXiv:2605.07925: Value induction in LLMs — all values increase sycophancy, even positive ones

Editorial illustration: 2605.07925: Value induction in LLMs — all values increase sycophancy, even positive ones

Value induction is a post-training technique that emphasizes specific values (helpfulness, harmlessness, honesty). A study in Findings of ACL 2026 shows that induction of positive values improves safety, BUT all tested values increase anthropomorphic language and make models 'validating and sycophantic' regardless of which value is emphasized.

🟡 🤖 Models May 9, 2026 · 2 min read

Allen Institute: EMO — MoE language model with natural semantic modularity from data

Editorial illustration: MoE language model diagram with experts grouped by semantic domains

EMO is a new MoE language model from the Allen Institute with 1B active and 14B total parameters, trained on 1 trillion tokens. Experts self-organize into semantic domains — with 25% of active experts the performance loss is just 1%.

🟡 🤖 Models May 9, 2026 · 2 min read

arXiv:2605.06638: ScaleLogic — RL compute follows a power law in reasoning depth

Editorial illustration: log-log scale graph with a line connecting compute and reasoning depth

ScaleLogic is a synthetic framework showing that the reinforcement learning compute required for long-horizon reasoning follows a power law with depth: T ∝ D^γ (R² > 0.99). The exponent γ ranges from 1.04 to 2.60 depending on logical expressiveness, and more expressive training yields up to +10.66 points better downstream results.

🔴 🤖 Models May 8, 2026 · 2 min read

OpenAI: three new realtime voice models in the API with reasoning and translation

Editorial illustration: three new realtime voice models in the API with reasoning and translation

OpenAI introduced on May 7, 2026, three new realtime voice models in the API: GPT-Realtime-2 with GPT-5-class reasoning and a 128,000-token context, GPT-Realtime-Translate that translates from 70+ input languages into 13 output languages, and GPT-Realtime-Whisper for live speech transcription.

🟡 🤖 Models May 8, 2026 · 2 min read

Google: Gemini 3.1 Flash-Lite enters general availability

Editorial illustration: Gemini 3.1 Flash-Lite enters general availability

Gemini 3.1 Flash-Lite has been generally available through the Gemini API since May 7, 2026, as a stable production endpoint. The model is optimized for speed, scale, and cost efficiency, and the preview version will be discontinued on May 25, 2026.

🟡 🤖 Models May 7, 2026 · 2 min read

arXiv:2605.03195: Terminus-4B — 4 billion parameters for terminal execution matches Claude Opus and GPT-5.3-Codex on SWE-Bench Pro with ~30% fewer main agent tokens

Editorial illustration: two concentric circles — smaller 4B model for terminal and larger frontier model for planning connected by a task delegation arrow

Terminus-4B is a 4-billion-parameter Qwen3 fine-tune specialized for terminal execution in agentic systems — on the SWE-Bench Pro benchmark it matches and sometimes outperforms Claude Sonnet/Opus and GPT-5.3-Codex baselines, while reducing main agent token consumption by approximately 30% by isolating verbose build/test logs in a subagent context.

🟡 🤖 Models May 7, 2026 · 2 min read

arXiv:2605.04908: Gosset with Curated Pharma Index Outperforms Frontier LLMs by 3.2x

Editorial illustration: arXiv:2605.04908: Gosset with curated pharma database outperforms frontier LLMs 3.2x

Gosset is a specialized AI platform with curated pharmaceutical data that returned 3.2 times more verified drugs per query compared to four frontier systems, achieving 100% precision and full recall across ten niche oncology and immunology targets.

🟡 🤖 Models May 7, 2026 · 2 min read

Google: Gemini API Gets Multimodal File Search for Images and Breaking Change in Interactions API

Editorial illustration: Gemini API gains multimodal File Search and breaking change in Interactions API

Google has expanded Gemini File Search to multimodal image search using the gemini-embedding-2 model, with media_id in grounding metadata for visual citations. Simultaneously, a breaking change is announced in the Interactions API where outputs becomes steps, with the new default on 20.05.2026 and removal of the old schema on 06.06.2026.

🔴 🤖 Models May 6, 2026 · 2 min read

OpenAI: GPT-5.5 Instant becomes the new default ChatGPT model with fewer hallucinations

Editorial illustration: ChatGPT interface labeled GPT-5.5 Instant as the new default model on a blue background

GPT-5.5 Instant is the new default ChatGPT model introduced by OpenAI on May 5, 2026. The model delivers smarter, more precise responses, a reduced hallucination rate, and improved personalization — accompanied by a new system card.

🟡 🤖 Models May 6, 2026 · 2 min read

arXiv:2605.03871: EvoLM — language models that improve themselves without external supervision

Editorial illustration: two language models in a feedback loop exchanging scores and improvements without an external supervisor

EvoLM is a post-training method that eliminates external supervision — a Qwen3-8B rubric generator outperforms GPT-4.1 on RewardBench-2 by 25.7% and SkyWork-RM by 16%, while the trained policy reaches 69.3% on the OLMo3-Adapt benchmark.

🟡 🤖 Models May 6, 2026 · 2 min read

Google: Gemini API File Search expanded to multimodal image and text search

Editorial illustration: Gemini API combining images and text in a shared semantic search through an embedding model.

Google expanded File Search in the Gemini API to multimodal search, enabling native embedding and retrieval of images alongside text documents through the gemini-embedding-2 model. Two new grounding fields and event-driven webhook support for the Batch API were also added.

🟡 🤖 Models May 6, 2026 · 2 min read

Microsoft Research: DroidSpeak shares KV cache across fine-tuned LLM variants for 4× higher throughput

Editorial illustration: diagram of KV cache sharing across multiple fine-tuned variants of the same base LLM in a data center.

Microsoft Research presented DroidSpeak at NSDI 2026 — a system that shares KV cache across architecturally identical fine-tuned LLM variants, achieving up to 4× higher throughput with minimal quality loss in enterprise scenarios running dozens of domain-specific models.

🟡 🤖 Models May 5, 2026 · 3 min read

ArXiv AgentFloor: small open-weight models (0.27B–32B) are sufficient for short-horizon agent tasks; GPT-5 retains advantage only in long-horizon planning

Editorial illustration: capability ladder with models of different sizes on different rungs, symbolizing tool-use evaluation

Ranit Karmakar and Jayita Chatterjee presented AgentFloor — a deterministic network of 30 tasks organized across six capability levels, on which they evaluated 16 open-weight models ranging from 0.27 to 32 billion parameters plus GPT-5. Conclusion: smaller models are sufficient for short-horizon, structured agent tasks, while frontier models retain a clear advantage only in long-horizon planning under persistent constraints.

🟡 🤖 Models May 5, 2026 · 3 min read

ArXiv Token Arena: continuous benchmark unifying energy and cognition reveals 6.2× difference in joules per correct answer across endpoints

Editorial illustration: scale measuring energy and cognition of AI inference endpoints, symbolizing multi-dimensional benchmarking

Yuxuan Gao, Megan Wang, and Yi Ling Yu published on May 1, 2026 Token Arena — a continuous benchmarking platform that evaluates AI inference at the endpoint level (78 endpoints, 12 model families). They find that the same model across different endpoints can vary by up to 12.5 points on math/code benchmarks, by up to an order of magnitude in tail latency, and by a factor of 6.2 in joules per correct answer. Results are published under CC BY 4.0.

🟡 🤖 Models May 5, 2026 · 2 min read

NIST CAISI: DeepSeek V4 Pro is the most capable Chinese AI model to date, but trails US frontier by 8 months

Editorial illustration: AI model on a timeline marking an 8-month gap, symbolizing an independent evaluation

The US Center for AI Standards and Innovation (CAISI) at NIST published on May 1, 2026 an independent evaluation of the DeepSeek V4 Pro model. Conclusion: it is the most capable evaluated PRC AI model to date, but lags behind the US frontier by approximately 8 months in aggregate capabilities. The evaluation used non-public benchmarks across five domains: cybersecurity, software engineering, natural sciences, abstract reasoning, and mathematics.

🟢 🤖 Models May 5, 2026 · 3 min read

arXiv:2605.02572: Long Horizons Destabilize LLM Training — ICML 2026 Paper Offers 'Horizon Generalization' as a Solution

Editorial illustration: a cracked horizontal line with neural nodes and data flows converging

An ICML 2026 accepted paper empirically demonstrates that increasing task horizon length causes serious LLM training instability due to exploration and credit assignment problems. The proposed solution: shortening the horizon during training with an explicit 'horizon generalization' mechanism at inference. The paper establishes the first empirical scaling rules for task horizon in frontier model training.

🟢 🤖 Models May 4, 2026 · 2 min read

AdaMeZO: Adam-style LLM fine-tuning without storing gradient moments in GPU memory

Editorial illustration: AdaMeZO: Adam-style LLM fine-tuning without storing gradient moments in GPU memory

AdaMeZO is a zeroth-order optimizer that combines the advantages of the Adam algorithm with the memory efficiency of the MeZO approach for fine-tuning large language models. It uses only forward passes and achieves up to 70% fewer passes compared to MeZO, with improved convergence.

🟢 🤖 Models May 4, 2026 · 2 min read

BWLA: 1-bit LLM quantization with 3.26× speedup and 70% better results (ACL 2026)

Editorial illustration: BWLA: 1-bit LLM quantization with 3.26× speedup and 70% better results (ACL 2026)

BWLA is a new post-training quantization framework for large language models that for the first time achieves simultaneous 1-bit weight precision and low-bit activations without significant accuracy loss. On the Qwen3-32B model it reaches a perplexity of 11.92 and a 3.26× speedup compared to previous methods.

🟡 🤖 Models May 2, 2026 · 3 min read

Latent-GRPO: Stable RL Optimization for Latent Reasoning — 7.86 Points on GSM8K-Aug and 4.27 Points on AIME With 3-4× Shorter Reasoning Chains

Editorial illustration: compression of a reasoning network into a condensed latent space

Researchers introduce Latent-GRPO, a stabilized RL approach for latent reasoning in which reasoning steps are compressed into continuous representations. They identify three fundamental problems with directly applying GRPO in latent space — invalid latent states, misalignment between the reward signal and token updates, and invalid averaged states — and address them through a combination of invalid-sample advantage masking, one-sided noise sampling and optimal correct-path first-token selection. Results: +7.86 Pass@1 on GSM8K-Aug and +4.27 points on AIME, with 3-4× shorter reasoning chains.

🟡 🤖 Models May 2, 2026 · 2 min read

GitHub is retiring GPT-5.2 and GPT-5.2-Codex from Copilot on June 1, 2026 — migration to GPT-5.5 and GPT-5.3-Codex

Editorial illustration: GitHub Copilot dashboard s novim modelom koji zamjenjuje stari

GitHub announces the retirement of GPT-5.2 and GPT-5.2-Codex from all Copilot experiences on June 1, 2026. Chat, inline edit, ask and agent mode, and code completion users will move to GPT-5.5, while Codex users will move to GPT-5.3-Codex. The exception is Copilot Code Review, where GPT-5.2-Codex remains available. Enterprise administrators must manually enable the new models in model policies before the deadline.

🟡 🤖 Models May 2, 2026 · 3 min read

NIST CAISI evaluation of DeepSeek V4 Pro: 8-month lag behind frontier US models across 9 benchmarks in 5 domains

Editorial illustration: vaga koja uspoređuje AI modele iznad geopolitičke karte

The Center for AI Standards and Innovation at NIST (CAISI) has published an independent evaluation of the Chinese model DeepSeek V4 Pro across 9 benchmarks in 5 domains (cybersecurity, software engineering, natural sciences, abstract reasoning, mathematics). Key finding: V4 lags 8 months behind frontier US models, particularly on reasoning and agentic tasks that DeepSeek did not include in its own technical report. Cost of use is lower than GPT-5.4 mini in 5 of 7 tests.

🟢 🤖 Models May 2, 2026 · 2 min read

KellyBench: AI agents managing a betting bankroll through the Premier League season — all leading models lost money

Editorial illustration: nogometni stadion s digitalnom analizom kvota

KellyBench is a new benchmark for testing sequential decision-making: AI agents manage a betting bankroll through the entire 2023/24 Premier League season, using statistics, lineups, and market odds. All leading models tested lost money, and Claude Opus 4.6 scored 26.5% on the expert rubric for strategy sophistication.

🔴 🤖 Models May 1, 2026 · 3 min read

PyTorch SMG: CPU-GPU disaggregation in LLM serving delivers 3.5× output throughput for Llama 3.3 70B FP8, already in production on Google Cloud, Oracle, and Alibaba

Editorial illustration: server rack with GPUs and a separate CPU gateway layer connecting them via gRPC network

LightSeek Foundation presented Shepherd Model Gateway (SMG) on the PyTorch blog on April 30, 2026 — a Rust gateway that moves CPU-bound tasks (tokenization, MCP orchestration, chat history, multimodal preprocessing) out of the GPU process into a separate gRPC layer. Llama 3.3 70B FP8 achieves 1,150 vs 327 output tokens/s (3.5× throughput), and the solution is already in production on Google Cloud, Oracle Cloud, Alibaba Cloud, and TogetherAI.

🟡 🤖 Models May 1, 2026 · 2 min read

AstaBench Spring 2026: Claude Opus 4.7 leads with 58% in scientific AI benchmark, GPT-5.5 half the cost

Editorial illustration: leaderboard table with AI model performance graphs on scientific tasks, neutral laboratory aesthetic

Allen Institute published the updated AstaBench leaderboard with 2,400 problems for AI agents in science. Claude Opus 4.7 leads with 58.0%, while GPT-5.5 achieves 52.9% at half the cost per problem. Key finding: strong results on individual tasks do not automatically translate to robust end-to-end scientific work.

🟢 🤖 Models May 1, 2026 · 2 min read

Anthropic closes 1M context beta for Sonnet 4.5 and Sonnet 4 — migration to 4.6 required

Editorial illustration: migration arrow between two API version blocks, minimalist technical aesthetic

Anthropic closed the beta header for the million-token context window on Claude Sonnet 4.5 and Sonnet 4 on April 30, 2026. Requests exceeding 200,000 tokens now return an error. Users must migrate to Sonnet 4.6 or Opus 4.6, where the 1M context window is available without a beta header.

View full archive →