Towards Sparse Video Understanding and Reasoning

CVPR 2026

ReViSeReasoning with Video Sparsity

1 Northwestern University    2 Johns Hopkins University    3 Dolby Laboratories
⚠ Reproducibility note

We were recently notified by email of a potential issue with our previously reported VideoEspresso results, which we are currently investigating. We are actively working on this and anticipate a corrected evaluation by July 2026; the full code and the updated results table will be released afterward. In the meantime, for any questions regarding reproducing our results, please email cxu@u.northwestern.edu.

Abstract

We present ReViSe (Reasoning with Video Sparsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, ReViSe selects a small set of informative frames, maintains a summary-as-state across rounds, and stops early when confident. It supports proprietary vision-language models (VLMs) in a “plug-and-play” setting and enables reinforcement fine-tuning for open-source models. For fine-tuning, we introduce EAGER (Evidence-Adjusted Gain for Efficient Reasoning), an annotation-free reward with three terms: (1) Confidence gain: after new frames are added, we reward the increase in the log-odds gap between the correct option and the strongest alternative; (2) Summary sufficiency: at answer time we re-ask using only the last committed summary and reward success; (3) Correct-and-early stop: answering correctly within a small turn budget is rewarded. Across multiple VQA benchmarks, ReViSe improves accuracy while reducing frames, rounds, and prompt tokens, demonstrating practical sparse video reasoning.

Method

Summary-as-State

ReViSe operates analogously to a recurrent neural network: it maintains a compact state that propagates information from previous rounds to the VLM, without re-admitting raw frames or conversation history. This is the load-bearing design choice that keeps the context compact even as evidence accumulates.

Summary-as-state recurrence

Each round, the agent receives sampled video frames, the question, and its current summary state. Every response begins with a transient <think> reasoning trace, then commits a structured summary in the POHR format inside <summarize>:

P Previously seenWhat frames have been inspected so far
O ObservationsWhat was just observed in the current frames
H HypothesesHow observations update the current belief
U UncertaintiesWhat remains unclear
R ReasonsWhy specific new frames are needed, or why the question is now answerable

On a non-final round the agent appends a <select> request for new frames; on the final round it emits only <think> and <answer>, reusing the last committed summary. Only the <summarize> state persists between rounds — the <think> trace is transient and raw frames are never re-admitted — keeping the context compact.

ReViSe multi-round pipeline overview

The ReViSe agent loop: sample frames → update the POHR summary → request targeted frames or commit an answer.

1 · Plug-and-play

Wraps any VLM — including proprietary APIs like GPT-4o — as a frozen black box. No parameter updates needed.

2 · RL fine-tuning

GRPO with the EAGER reward, combining three annotation-free terms:

  • Confidence gain — reward the increased log-odds gap to the correct option after adding frames.
  • Summary sufficiency — re-ask using only the committed summary and reward success.
  • Correct-and-early stop — reward answering correctly within a small turn budget.

Qualitative Example

ReViSe trajectories on NExT-QA clips. Each round the agent sees a few frames plus its carried summary, then either updates the summary and requests more frames, or — on the final round — gives a brief <think> and the <answer>. Pick an example, watch the original clip, and step through the rounds; only the summary is carried between rounds.

Original clip

Q

Show the full system prompt (identical every round)

      
View

Example videos are from the NExT-QA dataset — J. Xiao, X. Shang, A. Yao, T.-S. Chua, “NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions,” CVPR 2021.

Results

ReViSe matches or beats prior methods while using far fewer frames, and RL fine-tuning further improves both accuracy and efficiency. Notably, it surpasses the caption-based agent VideoAgent on EgoSchema (60.6 vs 60.2) and is best overall on Video-MME & LVBench — all captioner-free. Click a benchmark to expand its comparison table.

NExT-QA full comparison + caption-effect analysis · ReViSe (frames) beats VideoAgent on GPT-4o
Full comparison
MethodBackboneAcc. (%)Frames
VideoTreeGPT-473.556
VideoAgentGPT-471.38.2
LLoViGPT-3.567.790
ProViQGPT-3.564.660
SeViLABLIP-263.632
LVNetGPT-4o61.112
ReViSeGPT-4o63.88.4

Caption-based methods (VideoTree, VideoAgent, LLoVi) reason over LLM-generated captions; ReViSe reasons directly on the frames with a single frozen VLM, captioner-free, matching SeViLA while beating LVNet at a far smaller frame budget.

Caption effect analysis — same backbone, ReViSe vs. VideoAgent on NExT-QA question types
SettingQwen2-VL-7BGPT-4o
DTCAvgTok/sampleDTCAvgTok/sample
VideoAgent (captions) 75.460.259.865.14810+3749 79.969.671.573.74025+3749
ReViSe (frames) 75.753.963.764.46647 78.067.876.674.13089
ReViSe (frames + caption) 75.556.761.364.512421+3749 81.868.775.875.44433+3749
ReViSe (caption-only) 69.550.755.858.76441+3749 76.761.769.769.42654+3749

D / T / C = Descriptive / Temporal / Causal questions; Tok/sample = text tokens per sample (the “+caption” term is the extra captioning cost). With GPT-4o, ReViSe (frames) reaches 74.1 Avg — above caption-based VideoAgent (73.7) — using far fewer tokens (3089 vs 4025+3749) and no captioner. Adding captions barely helps (75.4) at much higher token cost, and caption-only drops to 69.4: the frames carry the signal, captions are not the bottleneck.

EgoSchema ReViSe 60.6% @ 9.8 frames · surpasses VideoAgent (60.2), captioner-free
MethodBackboneAcc. (%)Frames
LVNetGPT-4o68.212
VideoTreeGPT-466.262.4
MC-ViT-LViT-L62.6128+
VideoAgentGPT-460.28.4
LLoViGPT-3.557.6180
ReViSeGPT-4o60.69.8

ReViSe surpasses VideoAgent (60.6 vs 60.2) without a captioner, at a fraction of the frames used by caption-/memory-heavy methods (128–180 for MC-ViT-L and LLoVi).

Video-MME & LVBench best accuracy and fewest frames on both (LLaVA-OV-7B)
MethodVideo-MME (Acc / Frames)LVBench (Acc / Frames)
Adaptive Keyframes (CVPR 25)58.4 / 3259.3 / 32
MDP3 (ICCV 25)59.6 / 859.0 / 8
ReViSe60.7 / 7.3861.5 / 5.62

Under the same backbone, ReViSe is best on both benchmarks while using the fewest frames.

RL fine-tuning · NExT-QA +19.6pp over plug-and-play, ~2× faster (Qwen2.5-VL-3B)
MethodAcc. (%)FramesRoundsTime (s)
Direct Reasoning23.68.01.000.88
Plug-and-Play31.75.31.741.22
Supervised Format FT27.35.11.651.13
Reinforced FT (ReViSe)51.33.91.320.62

RL fine-tuning yields +19.6pp over plug-and-play with fewer frames, fewer rounds, and ~2× faster inference (0.62 s vs 1.22 s).

VideoEspresso comparisons are being updated — see the note at the top of the page.

Accuracy vs. average frame budget Pareto frontier

More rounds yield better accuracy at lower average frame budgets — the agent learns to stop early when confident.

BibTeX

@article{xu2026towards,
  title={Towards Sparse Video Understanding and Reasoning},
  author={Xu, Chenwei and Ye, Zhen and Wu, Shang and Li, Weijian and Wang, Zihan and Xia, Zhuofan and Lu, Lie and Maneriker, Pranav and Du, Fan and Li, Manling and others},
  journal={arXiv preprint arXiv:2602.13602},
  year={2026}
}