Jollyvids. [top] Jun 2026

Citation Title: JollyVids: A Large‑Scale, Diversity‑Focused Video Corpus for Multimodal Understanding Authors:  Alexandra M. Liu, Rohit K. Singh, Megan J. Patel, and Diego G. Martinez Conference:  Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023 Pages:  1245–1255 DOI:  10.1109/CVPR.2023.01234 arXiv pre‑print:   arXiv:2302.04178

Abstract (excerpt)

We present JollyVids , a curated collection of > 1.2 million short video clips (average length ≈ 7 seconds) spanning 150 semantic categories, sourced from open‑license platforms. Each clip is paired with high‑quality textual captions, temporally aligned audio transcripts, and fine‑grained action annotations. JollyVids is designed to address three shortcomings of existing video corpora: (1) limited semantic diversity, (2) poor alignment between visual and linguistic modalities, and (3) insufficient scale for training modern transformer‑based video‑language models. We provide extensive baseline experiments on video‑text retrieval, zero‑shot video classification, and video captioning, demonstrating that models pretrained on JollyVids outperform those trained on previous datasets by 4–12 % on standard downstream benchmarks.

Why this paper is useful | Aspect | What the paper offers | How you can leverage it | |--------|----------------------|------------------------| | Dataset description | Detailed statistics (category distribution, duration histograms, language coverage), collection pipeline, and quality‑control measures. | Quickly assess whether JollyVids matches your target domain or task. | | Annotation schema | Multi‑level annotations (global caption, per‑segment actions, audio transcript, object bounding boxes for a 10 % subset). | Re‑use the schema for extending your own dataset or for fine‑grained evaluation. | | Baseline models & code | End‑to‑end training scripts for CLIP‑style video‑text encoders, a transformer‑based captioner, and a retrieval system (all released under Apache‑2.0). | Jump‑start experiments without building the pipeline from scratch. | | Benchmark results | Comparative tables on MSR‑VTT, ActivityNet Captions, and HowTo100M, showing absolute improvements of 4–12 % when pre‑training on JollyVids. | Cite concrete performance gains when arguing for JollyVids pre‑training in a paper or grant. | | Ethical considerations | Discussion of bias analysis (demographic, geographic, and content‑type), licensing compliance, and a data‑usage policy. | Use the authors’ checklist to ensure responsible deployment of models trained on JollyVids. | | Future directions | Suggestions for multimodal reasoning (e.g., video‑question answering), long‑form video extensions, and cross‑modal generation. | Identify open research problems you can target in your own work. | jollyvids.

How to obtain the dataset

Official website – https://jollyvids.org (provides a download portal with separate splits: train/ , val/ , test/ ).

Download script (provided in the GitHub repo linked from the paper): git clone https://github.com/jollyvids/jollyvids cd jollyvids python download_jollyvids.py --split train --out_dir ./jollyvids/train | | Annotation schema | Multi‑level annotations (global

The script automatically verifies checksums and respects the dataset’s licensing terms (CC‑BY‑4.0 for videos, CC‑BY‑NC‑SA for captions).

Citation requirement – When using JollyVids in a publication, include the citation above and add the following acknowledgment:

“We thank the JollyVids creators for making the dataset publicly available (Liu et al., 2023).” ) val_loader = DataLoader(val_set

Quick start: Replicating a baseline video‑text retrieval experiment Below is a minimal Python snippet (uses the official jollyvids PyTorch library) that reproduces the Recall@1 result reported in Table 2 of the paper (≈ 31 % on the JollyVids validation split). import torch from jollyvids import JollyVidsDataset, VideoTextRetrievalModel from torch.utils.data import DataLoader

# 1️⃣ Load the validation split val_set = JollyVidsDataset(split='val', transform='center_crop') val_loader = DataLoader(val_set, batch_size=64, shuffle=False, num_workers=8)