CVPR 2026
ReViSe — Reasoning with Video Sparsity
We were recently notified by email of a potential issue with our previously reported VideoEspresso results, which we are currently investigating. We are actively working on this and anticipate a corrected evaluation by July 2026; the full code and the updated results table will be released afterward. In the meantime, for any questions regarding reproducing our results, please email cxu@u.northwestern.edu.
We present ReViSe (Reasoning with Video Sparsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, ReViSe selects a small set of informative frames, maintains a summary-as-state across rounds, and stops early when confident. It supports proprietary vision-language models (VLMs) in a “plug-and-play” setting and enables reinforcement fine-tuning for open-source models. For fine-tuning, we introduce EAGER (Evidence-Adjusted Gain for Efficient Reasoning), an annotation-free reward with three terms: (1) Confidence gain: after new frames are added, we reward the increase in the log-odds gap between the correct option and the strongest alternative; (2) Summary sufficiency: at answer time we re-ask using only the last committed summary and reward success; (3) Correct-and-early stop: answering correctly within a small turn budget is rewarded. Across multiple VQA benchmarks, ReViSe improves accuracy while reducing frames, rounds, and prompt tokens, demonstrating practical sparse video reasoning.
ReViSe operates analogously to a recurrent neural network: it maintains a compact state that propagates information from previous rounds to the VLM, without re-admitting raw frames or conversation history. This is the load-bearing design choice that keeps the context compact even as evidence accumulates.
Each round, the agent receives sampled video frames, the question, and its current summary
state. Every response begins with a transient <think> reasoning trace,
then commits a structured summary in the POHR format inside
<summarize>:
| P Previously seen | What frames have been inspected so far |
| O Observations | What was just observed in the current frames |
| H Hypotheses | How observations update the current belief |
| U Uncertainties | What remains unclear |
| R Reasons | Why specific new frames are needed, or why the question is now answerable |
On a non-final round the agent appends a <select> request for new frames;
on the final round it emits only <think> and <answer>,
reusing the last committed summary. Only the <summarize> state persists
between rounds — the <think> trace is transient and raw frames are
never re-admitted — keeping the context compact.
The ReViSe agent loop: sample frames → update the POHR summary → request targeted frames or commit an answer.
Wraps any VLM — including proprietary APIs like GPT-4o — as a frozen black box. No parameter updates needed.
GRPO with the EAGER reward, combining three annotation-free terms:
ReViSe trajectories on NExT-QA clips. Each round the agent sees a few frames plus
its carried summary, then either updates the summary and requests more frames, or — on the
final round — gives a brief <think> and the <answer>.
Pick an example, watch the original clip, and step through the rounds;
only the summary is carried between rounds.
Original clip
Example videos are from the NExT-QA dataset — J. Xiao, X. Shang, A. Yao, T.-S. Chua, “NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions,” CVPR 2021.
ReViSe matches or beats prior methods while using far fewer frames, and RL fine-tuning further improves both accuracy and efficiency. Notably, it surpasses the caption-based agent VideoAgent on EgoSchema (60.6 vs 60.2) and is best overall on Video-MME & LVBench — all captioner-free. Click a benchmark to expand its comparison table.
| Method | Backbone | Acc. (%) | Frames |
|---|---|---|---|
| VideoTree | GPT-4 | 73.5 | 56 |
| VideoAgent | GPT-4 | 71.3 | 8.2 |
| LLoVi | GPT-3.5 | 67.7 | 90 |
| ProViQ | GPT-3.5 | 64.6 | 60 |
| SeViLA | BLIP-2 | 63.6 | 32 |
| LVNet | GPT-4o | 61.1 | 12 |
| ReViSe | GPT-4o | 63.8 | 8.4 |
Caption-based methods (VideoTree, VideoAgent, LLoVi) reason over LLM-generated captions; ReViSe reasons directly on the frames with a single frozen VLM, captioner-free, matching SeViLA while beating LVNet at a far smaller frame budget.
| Setting | Qwen2-VL-7B | GPT-4o | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| D | T | C | Avg | Tok/sample | D | T | C | Avg | Tok/sample | |
| VideoAgent (captions) | 75.4 | 60.2 | 59.8 | 65.1 | 4810+3749 | 79.9 | 69.6 | 71.5 | 73.7 | 4025+3749 |
| ReViSe (frames) | 75.7 | 53.9 | 63.7 | 64.4 | 6647 | 78.0 | 67.8 | 76.6 | 74.1 | 3089 |
| ReViSe (frames + caption) | 75.5 | 56.7 | 61.3 | 64.5 | 12421+3749 | 81.8 | 68.7 | 75.8 | 75.4 | 4433+3749 |
| ReViSe (caption-only) | 69.5 | 50.7 | 55.8 | 58.7 | 6441+3749 | 76.7 | 61.7 | 69.7 | 69.4 | 2654+3749 |
D / T / C = Descriptive / Temporal / Causal questions; Tok/sample = text tokens per sample (the “+caption” term is the extra captioning cost). With GPT-4o, ReViSe (frames) reaches 74.1 Avg — above caption-based VideoAgent (73.7) — using far fewer tokens (3089 vs 4025+3749) and no captioner. Adding captions barely helps (75.4) at much higher token cost, and caption-only drops to 69.4: the frames carry the signal, captions are not the bottleneck.
| Method | Backbone | Acc. (%) | Frames |
|---|---|---|---|
| LVNet | GPT-4o | 68.2 | 12 |
| VideoTree | GPT-4 | 66.2 | 62.4 |
| MC-ViT-L | ViT-L | 62.6 | 128+ |
| VideoAgent | GPT-4 | 60.2 | 8.4 |
| LLoVi | GPT-3.5 | 57.6 | 180 |
| ReViSe | GPT-4o | 60.6 | 9.8 |
ReViSe surpasses VideoAgent (60.6 vs 60.2) without a captioner, at a fraction of the frames used by caption-/memory-heavy methods (128–180 for MC-ViT-L and LLoVi).
| Method | Video-MME (Acc / Frames) | LVBench (Acc / Frames) |
|---|---|---|
| Adaptive Keyframes (CVPR 25) | 58.4 / 32 | 59.3 / 32 |
| MDP3 (ICCV 25) | 59.6 / 8 | 59.0 / 8 |
| ReViSe | 60.7 / 7.38 | 61.5 / 5.62 |
Under the same backbone, ReViSe is best on both benchmarks while using the fewest frames.
| Method | Acc. (%) | Frames | Rounds | Time (s) |
|---|---|---|---|---|
| Direct Reasoning | 23.6 | 8.0 | 1.00 | 0.88 |
| Plug-and-Play | 31.7 | 5.3 | 1.74 | 1.22 |
| Supervised Format FT | 27.3 | 5.1 | 1.65 | 1.13 |
| Reinforced FT (ReViSe) | 51.3 | 3.9 | 1.32 | 0.62 |
RL fine-tuning yields +19.6pp over plug-and-play with fewer frames, fewer rounds, and ~2× faster inference (0.62 s vs 1.22 s).
VideoEspresso comparisons are being updated — see the note at the top of the page.
More rounds yield better accuracy at lower average frame budgets — the agent learns to stop early when confident.
@article{xu2026towards,
title={Towards Sparse Video Understanding and Reasoning},
author={Xu, Chenwei and Ye, Zhen and Wu, Shang and Li, Weijian and Wang, Zihan and Xia, Zhuofan and Lu, Lie and Maneriker, Pranav and Du, Fan and Li, Manling and others},
journal={arXiv preprint arXiv:2602.13602},
year={2026}
}