Towards Sparse Video Understanding and Reasoning

Xu, Chenwei; Ye, Zhen; Wu, Shang; Li, Weijian; Wang, Zihan; Xia, Zhuofan; Lu, Lie; Maneriker, Pranav; Du, Fan; Li, Manling; Liu, Han

Towards Sparse Video Understanding and Reasoning

CVPR 2026

ReViSe — Reasoning with Video Sparsity

Chenwei Xu1 Zhen Ye2 Shang Wu1 Weijian Li1 Zihan Wang1 Zhuofan Xia1
Lie Lu3 Pranav Maneriker3 Fan Du3 Manling Li1 Han Liu1

1 Northwestern University 2 Johns Hopkins University 3 Dolby Laboratories

arXiv Code Video

Abstract

We present ReViSe (Reasoning with Video Sparsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, ReViSe selects a small set of informative frames, maintains a summary-as-state across rounds, and stops early when confident. It supports proprietary vision-language models (VLMs) in a “plug-and-play” setting and enables reinforcement fine-tuning for open-source models. For fine-tuning, we introduce EAGER (Evidence-Adjusted Gain for Efficient Reasoning), an annotation-free reward with three terms: (1) Confidence gain: after new frames are added, we reward the increase in the log-odds gap between the correct option and the strongest alternative; (2) Summary sufficiency: at answer time we re-ask using only the last committed summary and reward success; (3) Correct-and-early stop: answering correctly within a small turn budget is rewarded. Across multiple VQA benchmarks, ReViSe improves accuracy while reducing frames, rounds, and prompt tokens, demonstrating practical sparse video reasoning.

Method

Summary-as-State

ReViSe operates analogously to a recurrent neural network: it maintains a compact state that propagates information from previous rounds to the VLM, without re-admitting raw frames or conversation history. This is the load-bearing design choice that keeps the context compact even as evidence accumulates.

Each round, the agent receives sampled video frames, the question, and its current summary state. Every response begins with a transient <think> reasoning trace, then commits a structured summary in the POHUR format inside <summarize>:

P Previously seen	What frames have been inspected so far
O Observations	What was just observed in the current frames
H Hypotheses	How observations update the current belief
U Uncertainties	What remains unclear
R Reasons	Why specific new frames are needed, or why the question is now answerable

On a non-final round the agent appends a <select> request for new frames; on the final round it emits only <think> and <answer>, reusing the last committed summary. Only the <summarize> state persists between rounds — the <think> trace is transient and raw frames are never re-admitted — keeping the context compact.

The ReViSe agent loop: sample frames → update the POHUR summary → request targeted frames or commit an answer.

1 · Plug-and-play

Wraps any VLM — including proprietary APIs like GPT-4o — as a frozen black box. No parameter updates needed.

2 · RL fine-tuning

GRPO with the EAGER reward, combining three annotation-free terms:

Confidence gain — reward the increased log-odds gap to the correct option after adding frames.
Summary sufficiency — re-ask using only the committed summary and reward success.
Correct-and-early stop — reward answering correctly within a small turn budget.

Qualitative Example

ReViSe trajectories on NExT-QA clips. Each round the agent sees a few frames plus its carried summary, then either updates the summary and requests more frames, or — on the final round — gives a brief <think> and the <answer>. Pick an example, watch the original clip, and step through the rounds; only the summary is carried between rounds.

Original clip

Q

Show the full system prompt (identical every round)

View

Example videos are from the NExT-QA dataset — J. Xiao, X. Shang, A. Yao, T.-S. Chua, “NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions,” CVPR 2021.

Results

ReViSe matches or beats prior methods while using far fewer frames, and RL fine-tuning further improves both accuracy and efficiency. Notably, it surpasses the caption-based agent VideoAgent on EgoSchema (60.6 vs 60.2) and is best overall on Video-MME & LVBench — all captioner-free. Click a benchmark to expand its comparison table.

NExT-QA full comparison + caption-effect analysis · ReViSe (frames) beats VideoAgent on GPT-4o

Full comparison

Method	Backbone	Acc. (%)	Frames
VideoTree	GPT-4	73.5	56
VideoAgent	GPT-4	71.3	8.2
LLoVi	GPT-3.5	67.7	90
ProViQ	GPT-3.5	64.6	60
SeViLA	BLIP-2	63.6	32
LVNet	GPT-4o	61.1	12
ReViSe	GPT-4o	63.8	8.4

Caption-based methods (VideoTree, VideoAgent, LLoVi) reason over LLM-generated captions; ReViSe reasons directly on the frames with a single frozen VLM, captioner-free, matching SeViLA while beating LVNet at a far smaller frame budget.

Caption effect analysis — same backbone, ReViSe vs. VideoAgent on NExT-QA question types

Setting	Qwen2-VL-7B					GPT-4o
Setting	D	T	C	Avg	Tok/sample	D	T	C	Avg	Tok/sample
VideoAgent (captions)	75.4	60.2	59.8	65.1	4810+3749	79.9	69.6	71.5	73.7	4025+3749
ReViSe (frames)	75.7	53.9	63.7	64.4	6647	78.0	67.8	76.6	74.1	3089
ReViSe (frames + caption)	75.5	56.7	61.3	64.5	12421+3749	81.8	68.7	75.8	75.4	4433+3749
ReViSe (caption-only)	69.5	50.7	55.8	58.7	6441+3749	76.7	61.7	69.7	69.4	2654+3749

D / T / C = Descriptive / Temporal / Causal questions; Tok/sample = text tokens per sample (the “+caption” term is the extra captioning cost). With GPT-4o, ReViSe (frames) reaches 74.1 Avg — above caption-based VideoAgent (73.7) — using far fewer tokens (3089 vs 4025+3749) and no captioner. Adding captions barely helps (75.4) at much higher token cost, and caption-only drops to 69.4: the frames carry the signal, captions are not the bottleneck.

EgoSchema ReViSe 60.6% @ 9.8 frames · surpasses VideoAgent (60.2), captioner-free

Method	Backbone	Acc. (%)	Frames
LVNet	GPT-4o	68.2	12
VideoTree	GPT-4	66.2	62.4
MC-ViT-L	ViT-L	62.6	128+
VideoAgent	GPT-4	60.2	8.4
LLoVi	GPT-3.5	57.6	180
ReViSe	GPT-4o	60.6	9.8

ReViSe surpasses VideoAgent (60.6 vs 60.2) without a captioner, at a fraction of the frames used by caption-/memory-heavy methods (128–180 for MC-ViT-L and LLoVi).

Video-MME & LVBench best accuracy and fewest frames on both (LLaVA-OV-7B)

Method	Video-MME (Acc / Frames)	LVBench (Acc / Frames)
Adaptive Keyframes (CVPR 25)	58.4 / 32	59.3 / 32
MDP3 (ICCV 25)	59.6 / 8	59.0 / 8
ReViSe	60.7 / 7.38	61.5 / 5.62

Under the same backbone, ReViSe is best on both benchmarks while using the fewest frames.

RL fine-tuning · NExT-QA +19.6pp over plug-and-play, ~2× faster (Qwen2.5-VL-3B)

Method	Acc. (%)	Frames	Rounds	Time (s)
Direct Reasoning	23.6	8.0	1.00	0.88
Plug-and-Play	31.7	5.3	1.74	1.22
Supervised Format FT	27.3	5.1	1.65	1.13
Reinforced FT (ReViSe)	51.3	3.9	1.32	0.62

RL fine-tuning yields +19.6pp over plug-and-play with fewer frames, fewer rounds, and ~2× faster inference (0.62 s vs 1.22 s).

VideoEspresso comparisons are being updated — see the reproducibility note in our GitHub repository.

Accuracy vs. average frame budget Pareto frontier

More rounds yield better accuracy at lower average frame budgets — the agent learns to stop early when confident.

BibTeX

@InProceedings{Xu_2026_CVPR,
    author    = {Xu, Chenwei and Ye, Zhen and Wu, Shang and Li, Weijian and Wang, Zihan and Xia, Zhuofan and Lu, Lie and Maneriker, Pranav and Du, Fan and Li, Manling and Liu, Han},
    title     = {Towards Sparse Video Understanding and Reasoning},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
    pages     = {11357-11368}
}