HuggingFace releases Ecom-RLVE-Gym: 8 environments and a 12-axis curriculum for training e-commerce agents with reinforcement learning
Why it matters
The Owlgebra AI team published on April 16, 2026 on the HuggingFace blog the project Ecom-RLVE-Gym — an open framework with 8 verifiable environments for e-commerce conversational agents and algorithmic reward instead of an LLM judge. The system uses a catalog of 2 million products, the Qwen 3 8B model, and a 12-axis adaptive curriculum that incrementally increases task difficulty for the agent, as a response to the limitations of supervised fine-tuning in complex multi-step workflows.
The Owlgebra AI team published on April 16, 2026 on the HuggingFace blog the project Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents. The work originated at the PyTorch OpenEnv Hackathon in Cerebral Valley, and is authored by Rahul Bajaj, Jaya Nupur, Anuj Garg, Ben Burtenshaw, and seven other contributors.
What problem does it solve?
The authors start from the observation that language fluency does not guarantee task success — an agent can hold a convincing conversation while still missing the purchase goal. Supervised fine-tuning (SFT) cannot cover the enormous combination of constraints and multi-step interactions in real e-commerce: different product variants, unavailable quantities, clarification requests, returns, store policies. Their solution is RLVR — Reinforcement Learning with Verifiable Rewards — in which the reward is not computed by an LLM judge’s rating but deterministically verified against the expected cart state.
How does Ecom-RLVE-Gym work?
The gym contains 8 verifiable environments: product discovery, substitution, cart assembly, returns, order tracking, policy questions, bundle planning, and multi-intent sessions. The difficulty of each scenario is controlled along 12 independent axes — constraints, user omissions, catalog distractors, stock exhaustion, token budget, input noise, context interruptions, search depth, order history, policy complexity, and others. The adaptive curriculum progressively grows (scheme C1 ⊂ C2 ⊂ C4 ⊂ C8), avoiding saturation and starvation.
The key innovation is algorithmic reward verification. Instead of an LLM evaluating the outcome, the system verifies the composite key (product_id, variant_id, qty) — the actual cart state against the expected one. The reward function combines F1 metric, efficiency, and a hallucination penalty.
Technical training details
Training was conducted on the Qwen 3 8B model using the DAPO algorithm with G=8 rollouts and a learning rate of 1e-5. The user simulator is Qwen 3.5 (9.7B), which strategically omits parts of queries to force the agent to ask clarification questions. The catalog contains 2 million products indexed with FAISS using Alibaba-NLP/gte-modernbert-base embeddings (768 dimensions). After 300 training steps, the authors report progressive improvement across difficulty levels — confirming the thesis that scaling environments (not just models) produces measurable gains even in specialized tasks.
Everything is public: the code is on GitHub (owlgebra-ai/EcomRLVE-Gym), the dataset on HuggingFace (owlgebra-ai/Amazebay-catalog-2M), and an interactive demo is available for browser-based testing. The work is currently the most comprehensive open benchmark for RL training of e-commerce conversational agents.
This article was generated using artificial intelligence from primary sources.
Related news
Anthropic: Memory for Managed Agents in public beta — AI agents that remember context between sessions
GitHub: Cloud agent sessions now available directly from issues and project views
ArXiv SWE-chat — a dataset of real developer interactions with AI coding agents in production