What is RLVR and why is it needed in e-commerce?

RLVR (Reinforcement Learning with Verifiable Rewards) is reinforcement learning in which the reward is not based on an LLM judge's rating, but on a deterministically verifiable outcome — for example, whether the product the agent added to the cart is exactly the one the user requested. It is needed because supervised fine-tuning cannot scale to the combinatorial number of constraints and multi-step interactions that exist in real-world shopping.

What are the environments in Ecom-RLVE-Gym?

Eight verifiable scenarios: product discovery, substitution, cart assembly, returns, order tracking, policy questions, bundle planning, and multi-intent sessions. Difficulty is controlled along 12 independent axes such as constraints, omissions, distractors, out-of-stock items, and policy complexity.

Is the code and dataset publicly available?

Yes. The code is published on GitHub under the repository owlgebra-ai/EcomRLVE-Gym, the catalog of 2.05 million products is available as the dataset owlgebra-ai/Amazebay-catalog-2M, and the models are part of the WUFUS collection on HuggingFace.

HuggingFace releases Ecom-RLVE-Gym: 8 environments and a 12-axis curriculum for training e-commerce agents with reinforcement learning

The Owlgebra AI team published on April 16, 2026 on the HuggingFace blog the project Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents. The work originated at the PyTorch OpenEnv Hackathon in Cerebral Valley, and is authored by Rahul Bajaj, Jaya Nupur, Anuj Garg, Ben Burtenshaw, and seven other contributors.

What problem does it solve?

The authors start from the observation that language fluency does not guarantee task success — an agent can hold a convincing conversation while still missing the purchase goal. Supervised fine-tuning (SFT) cannot cover the enormous combination of constraints and multi-step interactions in real e-commerce: different product variants, unavailable quantities, clarification requests, returns, store policies. Their solution is RLVR — Reinforcement Learning with Verifiable Rewards — in which the reward is not computed by an LLM judge’s rating but deterministically verified against the expected cart state.

How does Ecom-RLVE-Gym work?

The gym contains 8 verifiable environments: product discovery, substitution, cart assembly, returns, order tracking, policy questions, bundle planning, and multi-intent sessions. The difficulty of each scenario is controlled along 12 independent axes — constraints, user omissions, catalog distractors, stock exhaustion, token budget, input noise, context interruptions, search depth, order history, policy complexity, and others. The adaptive curriculum progressively grows (scheme C1 ⊂ C2 ⊂ C4 ⊂ C8), avoiding saturation and starvation.

The key innovation is algorithmic reward verification. Instead of an LLM evaluating the outcome, the system verifies the composite key (product_id, variant_id, qty) — the actual cart state against the expected one. The reward function combines F1 metric, efficiency, and a hallucination penalty.

Technical training details

Training was conducted on the Qwen 3 8B model using the DAPO algorithm with G=8 rollouts and a learning rate of 1e-5. The user simulator is Qwen 3.5 (9.7B), which strategically omits parts of queries to force the agent to ask clarification questions. The catalog contains 2 million products indexed with FAISS using Alibaba-NLP/gte-modernbert-base embeddings (768 dimensions). After 300 training steps, the authors report progressive improvement across difficulty levels — confirming the thesis that scaling environments (not just models) produces measurable gains even in specialized tasks.

Everything is public: the code is on GitHub (owlgebra-ai/EcomRLVE-Gym), the dataset on HuggingFace (owlgebra-ai/Amazebay-catalog-2M), and an interactive demo is available for browser-based testing. The work is currently the most comprehensive open benchmark for RL training of e-commerce conversational agents.

HuggingFace releases Ecom-RLVE-Gym: 8 environments and a 12-axis curriculum for training e-commerce agents with reinforcement learning

What problem does it solve?

How does Ecom-RLVE-Gym work?

Technical training details

Sources

Related news