Vidi2: Large Multimodal Models for Video Understanding and Creation

Vidi2 advances video understanding with fine-grained spatio-temporal grounding and extends capability to video question answering, enabling comprehensive multimodal reasoning.

Benchmark of Vidi2

Driving the next generation of video understanding and creation.

VUE-STG (Spatio-Temporal Grounding)

Duration distribution of videos in the proposed VUE-STG evaluation benchmark.

VUE-TR-V2 (Temporal Retrieval)

Duration distribution of videos in the proposed VUE-TR-V2 evaluation benchmark.

The distribution of query modality and format in the VUE-TR-V2 benchmark.

Qualitative Results by Vidi2

Demonstration of Vidi2's capabilities

Input Video

Input Query

Result Clips

Input

The man wearing a brown suit who is playing drums in an indoor setting

00:04:12

Result

Input

The gorilla which is driving with two men.

00:16:19

Result

Input

a woman in glasses who is walking on street

00:04:11

Result

Input

The boy who stands outside a charming house with warm lights, beneath a starry night sky featuring a full moon.

00:02:00

Result

Input

The glowing blue water beads in which the mango seed is placed, with its germination into a root and shoot visualized through a time-lapse sequence against a dark background

00:15:50

Result

Input

basketball statue

00:00:46

Result

Input

gymnasium

00:01:29

Result

Input

people assembling sculptures on beach

00:01:30

Result

Input

Euripides, has most surviving work like 'Medea' and 'The Bacchae', debut in 455 BC. He is a corner stone of greek education in the Hellenistic period.

00:05:10

Result

Input

Jennifer Nagel self-introduction

00:10:02

Result

Input

divine wind

00:20:18

Result

Input

FTC resources

00:59:35

Result

Input

North Devon Marine Pioneer

01:24:45

Result

Key Capabilities of Vidi2

Driving the next generation of video understanding and creation.

Spatio-temporal Grounding

Locate objects and events in both space and time with precision.

Temporal Retrieval

Find specific moments in videos using natural language queries.

Video QA

Answer complex questions about video content.

Long Video Support

Process and understand videos up to 30 minutes long.

Multimodal Output

Generate timestamps and bounding boxes for target objects.

Smart Creation

Assist in video editing, reframing, and content generation.

Frequently Asked Questions About Vidi2

Vidi2 - Large Multimodal Models for Video Understanding and Creation