🔧 Hardware

19 articles

🟢 🔧 Hardware May 23, 2026 · 4 min read

AMD: Gluon block-level model enables GEMM kernels with 5.255 TFLOPS MXFP4 on Instinct MI355

Editorial illustration: GPU accelerator with matrix unit layout and pipeline flows

The AMD ROCm team published a tutorial for writing high-performance GEMM kernels in the Gluon programming model on the MI355 GPU. An optimized FP16 kernel achieves 1.489 TFLOPS at 98.75 percent MFMA efficiency, while extensions to BF8 (3.257 TFLOPS) and MXFP4 (5.255 TFLOPS) demonstrate relevance for modern AI workloads. The tutorial includes workgroup remapping and swizzle that reduces L2 cache misses from 5.3 M to 4.1 M.

🟡 🔧 Hardware May 21, 2026 · 2 min read

AMD: ROCm 7.13 brings MI350P GPU, multi-VF virtualisation and TheRock packaging

Editorial illustration: AMD ROCm 7.13 with MI350P GPU, multi-VF virtualisation and TheRock modular packaging

AMD released ROCm 7.13 on 20 May 2026 — a new version of its open-source AI compute stack that introduces support for the MI350P GPU, virtualisation of up to 8 isolated vGPUs per MI300X accelerator, an open-source ROCprof Trace decoder for transparent performance analysis, and modular TheRock packaging with domain-specific SDKs. The release is validated on Ubuntu 26.04 and RHEL 9.6, and includes VMware ESXi 9.1 support for MI350X and MI355X.

🟢 🔧 Hardware May 16, 2026 · 3 min read

AMD ROCm: BubbleFence partitions video streams using Vision Foundation model embeddings instead of metadata heuristics

Editorial illustration: video frames with embedding bubble visualization in 2D space.

BubbleFence is a new AMD ROCm AI tool announced on May 15, 2026, that solves the fundamental ML problem of semantically splitting video streams into train/validation/test sets without semantic leakage. Instead of classic metadata-based heuristics, BubbleFence uses vision foundation model embeddings (CLIP) and adaptive bubbles with LID weighting for partitioning. Demonstrated on autonomous driving (Zenseact Open Dataset) and Minecraft gameplay scenarios without configuration changes.

🟢 🔧 Hardware May 15, 2026 · 3 min read

AMD ROCm: Kimi-K2.5 W4A8 and W8A8 quantization on MI325X via Quark + FlyDSL + AITER inference stack

Editorial illustration: AMD MI325X GPU with W4A8 quantization layer and inference acceleration icons.

AMD ROCm Kimi-K2.5 quantization for MI325X is a new inference acceleration blueprint published May 14, 2026. It combines the AMD Quark quantization toolkit for converting Kimi-K2.5 models to W4A8 and W8A8 precision formats, the FlyDSL inference serving layer, and the AITER optimization stack. The approach positions a non-NVIDIA inference path for Chinese frontier models and demonstrates AMD's strategy to establish the MI325X as a viable alternative to H100/H200 for open-source LLM serving.

🟡 🔧 Hardware May 12, 2026 · 2 min read

AMD: Instinct MI355X outperforms NVIDIA B200 on ComfyUI workflows with PyTorch optimizations in ROCm 7.2.0

Editorial illustration: Instinct MI355X outperforms NVIDIA B200 on ComfyUI workflows with PyTorch optimizations in ROCm 7.2.0

AMD Instinct MI355X is a data center GPU that outperforms NVIDIA B200 in published benchmarks across three ComfyUI generative workflows — text-to-video Wan2.2 (1.44×), text-to-image FLUX.1-dev (1.42×), and 3D Hunyuan3D v2.1 (1.20×) — thanks to AOTriton gfx950 kernels, hipBLASLt GEMM tuning, and other ROCm 7.2.0 optimizations.

🟡 🔧 Hardware May 12, 2026 · 2 min read

NVIDIA: Fleet Intelligence — managed monitoring of large GPU fleets with cryptographic integrity verification

Editorial illustration: Fleet Intelligence — managed monitoring of large GPU fleets with cryptographic integrity verification

NVIDIA Fleet Intelligence is a managed service that monitors large fleets of NVIDIA data center GPUs in real time — power, temperature, performance, and ECC errors — with cryptographic GPU authenticity verification through the NVIDIA Remote Attestation Service. The service is free for owners of Vera Rubin, Blackwell, and Hopper GPUs.

🟡 🔧 Hardware May 11, 2026 · 2 min read

vLLM: TurboQuant study shows FP8 remains superior for KV-cache — 3bit-nc drops ~20 pp

Editorial illustration: TurboQuant study shows FP8 remains superior for KV-cache — 3bit-nc drops ~20 pp

TurboQuant is an aggressive KV-cache quantization method at 3-4 bits that the Red Hat AI team systematically compared against the FP8 standard. Results show FP8 retains throughput and accuracy, while 3bit-nc variants lose approximately 20 percentage points on demanding reasoning benchmarks like AIME25.

🔴 🔧 Hardware May 7, 2026 · 3 min read

NVIDIA: Spectrum-X Multipath Reliable Connection becomes OCP open standard for gigascale AI networks

Editorial illustration: parallel fiber optic paths between AI racks with MRC, Spectrum-X and OCP open standard labels

NVIDIA Spectrum-X Multipath Reliable Connection (MRC) is an RDMA transport protocol that distributes a single connection across multiple network paths and has now been published as an open specification through the Open Compute Project. MRC is already in production at OpenAI, Microsoft's Fairwater data center and Oracle's Abilene data center, and was developed in collaboration with AMD, Broadcom, Intel and Microsoft.

🟡 🔧 Hardware May 6, 2026 · 2 min read

AMD: FarSkip-Collective speeds up MoE inference by 18–34% on AMD GPUs

Editorial illustration: parallel data flows between AMD GPUs during MoE inference with no idle blocks.

The AMD ROCm team introduced FarSkip-Collective, a modified MoE architecture that eliminates GPU idle time during Expert Parallelism communication. Results: 18% lower TTFT for Llama-4 Scout, up to 1.34× speedup for DeepSeek-V3, and 11% faster Moonlight pre-training.

🟡 🔧 Hardware May 5, 2026 · 3 min read

ArXiv SAGA: workflow-atomic GPU scheduling for AI agents achieves 1.64× faster task completion on a 64-GPU cluster, accepted at HPDC 2026

Editorial illustration: GPU cluster with connected agent workflows as atomic units, symbolizing scheduling

The team of Dongxin Guo, Jikun Wu, and Siu Ming Yiu presented on May 1, 2026 SAGA — a workflow-atomic scheduler for AI agents on GPU clusters that treats the entire agent workflow as a single schedulable unit instead of individual LLM calls. The system achieves a 1.64× geometric mean reduction in task completion time on a 64-GPU cluster and 99.2% SLO attainment under multi-tenant load. The paper was accepted at HPDC 2026 in Cleveland (July 13–16, 2026).

🟢 🔧 Hardware April 25, 2026 · 3 min read

AMD Primus Projection: Tool for Predicting LLM Training Memory and Speed Before Running on Instinct GPU Clusters

Editorial illustration: AMD Primus Projection — LLM training prediction

AMD Primus Projection is a tool that predicts memory requirements and throughput for LLM training on Instinct GPU clusters before a run begins. It uses analytical formulas combined with real GPU benchmarking, and projections fall within ~10% of measured results on MI325X and MI355X accelerators for Llama and Mixtral models.

🟢 🔧 Hardware April 24, 2026 · 3 min read

Google at Cloud Next '26 unveils TPU 8i and TPU 8t: specialized chips for agentic AI computing

Editorial illustration: Google TPU 8i and 8t — specialized AI chips

Google at Cloud Next '26 unveiled two new generations of TPU chips: TPU 8i for AI agent inference and TPU 8t for training the most complex models. The move formalizes the split of Google's TPU line into two specialized branches within the 'agentic era' of computing.

🟡 🔧 Hardware April 23, 2026 · 2 min read

NVIDIA and Google Cloud announce collaboration for agentic AI and physical AI on shared infrastructure

Editorial illustration: AI čip — hardware

NVIDIA and Google Cloud announced a joint collaboration to accelerate agentic AI and physical AI workloads, combining NVIDIA GPU infrastructure with the Google Cloud platform for robotics, autonomous systems, and agents.

🟢 🔧 Hardware April 23, 2026 · 3 min read

Gemma 4 runs as a Vision Language Agent locally on Jetson Orin Nano Super

Editorial illustration: AI chip — hardware

NVIDIA and HuggingFace demonstrated Gemma 4 as a Vision Language Agent that autonomously decides when to use the camera and runs the entire pipeline, including speech-to-text and TTS, locally on an NVIDIA Jetson Orin Nano Super with 8 GB of memory, with no cloud dependency.

🔴 🔧 Hardware April 22, 2026 · 3 min read

Google unveils 8th-generation TPU chips: two specialized variants for the agentic AI era

Editorial illustration: Two specialized 8th-generation TPU chips for training and inference of agentic AI workloads

At Cloud Next '26, Google introduced the eighth generation of its TPU chips in two specialized variants — TPU 8t for model training and TPU 8i for agentic inference. This is the first generation purpose-built for autonomous AI agents and multi-step reasoning.

🟡 🔧 Hardware April 21, 2026 · 3 min read

AWS G7e Blackwell Instances: Qwen3-32B on SageMaker for $0.41 per Million Tokens — 4× Cheaper Inference

Editorial illustration of a data center with NVIDIA Blackwell GPUs and GDDR7 memory modules

AWS G7e instances are new SageMaker GPU instances with the NVIDIA RTX PRO 6000 Blackwell chip and 96 GB GDDR7 memory, delivering up to 2.3× better inference than G6e. The cost for Qwen3-32B drops from $2.06 to $0.79 per million output tokens, and with EAGLE speculative decoding down to $0.41.

🟡 🔧 Hardware April 16, 2026 · 2 min read

AWS: Speculative Decoding on Trainium Chips Accelerates LLM Inference Up to 3x

Amazon Web Services has published a detailed implementation of speculative decoding on AWS Trainium chips in combination with the vLLM framework, achieving up to 3x faster token generation for decode-heavy workloads. The technique uses a smaller draft model to predict the next N tokens, with a larger target model verifying them in a single pass, eliminating the bottleneck of sequential generation.

🟢 🔧 Hardware April 16, 2026 · 2 min read

NVIDIA: Blackwell Generates Tokens 35x Cheaper Than Hopper — Cost per Token Is the Only Metric

NVIDIA has published an analysis arguing that cost per token is the only relevant metric for AI infrastructure. A comparison of the Blackwell and Hopper generations shows that Blackwell costs twice as much per GPU hour but generates 65x more tokens per second, resulting in a 35x lower cost per million tokens — $0.12 versus $4.20 for Hopper.

🟡 🔧 Hardware April 10, 2026 · 2 min read

NVIDIA unveils RoboLab benchmark and a new wave of physical AI projects at National Robotics Week

As part of National Robotics Week 2026, NVIDIA has presented a series of new physical AI projects, including RoboLab — a benchmark for simulation-to-reality transfer, collaborations with Toyota Research Institute, Mimic Robotics and Doosan Robotics, and open resources for robot policy evaluation such as Isaac Lab-Arena.