Amazon Nova 2 Lite with Reinforcement Fine-Tuning achieves 4.33/5.0 and outperforms Claude Sonnet 4.5 on automated legal contract review
Reinforcement Fine-Tuning (RFT) is a training method in which a language model acts as a judge (LLM-as-Judge) and provides feedback instead of expensive manual labeling. Amazon Nova 2 Lite achieved an aggregate score of 4.33/5.0 and perfect JSON validation of 1.00, outperforming Claude Sonnet 4.5 and Claude Haiku 4.5 on automated legal contract review.
On April 30, 2026, AWS published a detailed guide demonstrating how Reinforcement Fine-Tuning (RFT) via the Nova Forge SDK can align a specialized model with domain requirements without costly manual labeling. The demonstrated use case — automated legal contract review with a generated list of risks, comment types, and recommended actions in strictly structured JSON — places Amazon Nova 2 Lite ahead of larger Anthropic models in the same evaluation.
What is RFT and how does it differ from classic RLHF?
RFT (Reinforcement Fine-Tuning) is a form of Reinforcement Learning with AI Feedback (RLAIF) where the reward function is handled by another LLM acting as a judge. Instead of humans manually labeling thousands of “better/worse” response pairs, the judge model assigns multi-dimensional scores according to a predefined rubric, and the trained model learns to maximize those scores. The AWS implementation uses an off_policy_async rollout strategy with 8 generations per sample, up to 16,000 output tokens, a global batch size of 64, and a total of 516 steps.
Why does LLM-as-Judge outperform larger baseline models?
On a strictly structured legal comment extraction task, large generalist models tend to vary their output format, while a smaller model fine-tuned with a targeted rubric learns to produce output that passes schema validation precisely. AWS reports that Nova 2 Lite achieved 1.00/1.0 on JSON schema validation and 4.33/5.0 aggregate across three dimensions: TargetDocument_Grounding, Reference_Consistency, and Actionability. Claude Sonnet 4.5 and Claude Haiku 4.5 both fell below that level — meaning that rubric precision in the judge may matter more than the size of the baseline model.
Training configuration and infrastructure
The system runs in a serverless environment: judge and rollout calls are handled by Lambda with a 15-minute timeout and provisioned concurrency of 100, and checkpoints are saved every 32 steps. The authors (Hemanth Kumar Jayakumar, Ajit Kumar K.P., Bharathan Balaji, and Daniel Suarez) explicitly note that Boolean scoring of individual dimensions is more reliable than a 1–10 scale because it reduces judgment variance.
Implications for enterprise deployment
RFT with LLM-as-Judge enables teams without a budget for manual labeling to specialize smaller (and cheaper) models for narrowly defined domains such as legal, financial, or medical extraction. If the results are reproducible across other verticals, this signals that fine-tuning workflows are entering a phase where small specialist models can routinely outperform frontier baselines on targeted tasks.
Frequently Asked Questions
- What is Reinforcement Fine-Tuning (RFT) in the Nova Forge SDK?
- RFT is a form of Reinforcement Learning with AI Feedback (RLAIF) where an LLM judge assigns multi-dimensional scores to generated outputs, and the model learns to maximize those scores without the need for manually labeled data.
- How did Nova 2 Lite compare to Claude models in this evaluation?
- On the legal contract review task, Nova 2 Lite scored 4.33/5.0, outperforming both Claude Sonnet 4.5 and Claude Haiku 4.5, achieving the highest overall performance of all evaluated models.
- Which judge model was used during training?
- GPT OSS 120B was used as the judge model for training rollouts, while evaluation allows for a heavyweight tier (Nova Pro, Claude Opus, Claude Sonnet) or a lightweight tier (Nova 2 Lite, Claude Haiku).
This article was generated using artificial intelligence from primary sources.
Related news
DeepMind AI co-clinician: in blind evaluation of 98 primary care queries doctors preferred it over leading tools, zero critical errors in 97/98 cases
IBM Research and Dallara: AI surrogate model GIST evaluates racing car aerodynamics in 10 seconds instead of hours of classical CFD simulation
Anthropic Claude for Creative Work: Connectors for Blender, 50+ Adobe Creative Cloud Tools, Autodesk Fusion, Ableton, SketchUp, and Splice