arXiv:2606.20561: TimeProVe Reduces Long-Video Reasoning Inference Costs by 93%
TimeProVe is a framework that accelerates VLM inference over long videos by introducing a two-stage propose-then-verify approach. It reduces calls to expensive models by 75% and total inference cost by 93%, while outperforming the strongest competitor by 7.3 percentage points on the new OpenTSUBench benchmark.
This article was generated using artificial intelligence from primary sources.
Expensive Video QA Models Now Called 4× Less Frequently
Researchers Arkaprava Sinha, Dominick Reilly, Siddharth Krishnan, Hieu Le, and Srijan Das submitted on June 18, 2026 a paper introducing TimeProVe — a framework for efficient temporal reasoning over long videos. The paper was published on June 19 on arXiv (2606.20561).
How the Two-Stage “Propose Then Verify” Approach Works
TimeProVe splits the classic Video QA task into two stages. A lightweight module first generates answer hypotheses without calling the expensive model. The ACE (Action-based Candidate Evidence) module then selects relevant evidence and forwards it to the expensive vision-language model (VLM) exclusively for the verification phase. This approach reduces the number of VLM calls by 75% and total inference cost by 93% compared to methods that invoke the expensive model at every step.
OpenTSUBench: A New Benchmark for Everyday Activities
The authors simultaneously introduce OpenTSUBench — a benchmark for temporally grounded reasoning within the Activities of Daily Living (ADL) category. On this benchmark, TimeProVe surpasses the previously strongest baseline by 7.3 percentage points, confirming that cost reduction does not come at the expense of accuracy.
Why It Matters
Previous VLM approaches for long videos were either expensive (calling the model for every frame/segment) or sacrificed accuracy through coarse sampling. TimeProVe demonstrates that these two goals are not mutually exclusive: by intelligently dividing work between a lightweight and an expensive model, it is possible to achieve both better accuracy and dramatically lower costs, paving the way for practical VLM application over hour-long videos in real-world systems.
Frequently Asked Questions
- What is TimeProVe and how does it work?
- TimeProVe is a framework in which a lightweight module generates answer hypotheses, and an expensive VLM is called only for verification through the ACE (Action-based Candidate Evidence) module, dramatically reducing the number of costly model calls.
- What is OpenTSUBench?
- OpenTSUBench is a new benchmark for temporally grounded reasoning over everyday Activities of Daily Living (ADL), introduced by the authors alongside the TimeProVe method.
- By how much does TimeProVe outperform the previous best approach?
- TimeProVe achieves a 7.3 percentage point improvement over the strongest baseline on OpenTSUBench, while simultaneously reducing inference cost by 93%.
Sources
Related news
arXiv:2606.20560: DiffusionGemma as interpretable as Gemma 4 — 28.6× gap reduced to 1.1×
arXiv:2606.20543: Spatially Speculative Decoding accelerates image generation 13.3×
arXiv:2606.20008: VIMPO — Critic-Free Reinforcement Learning Beats GRPO on MATH-500 and AIME