DL3DV-10K

• 10,510 videos with consistent capture standard.
• A comprehensive new-view synthesis benchmark.
• Each video has human created labels related to scene POI and complexities (light, surface materials, etc).
• Each video has calibrated camera poses.

• Generalizable NeRF research.
• Scene level view consistentence tracking.
• Vision language model (VLM) research.
• and more!

Dataset Statistics

10510

Video

10510 different scenes with consistent capture standard

4K/60 FPS

Quality

High quality/framerates videos for high quality new view synthesis

65

POI

6 primary point-of-interest (POIs) categories and 65 seconday POIs to cover diverse real-world scenarios

96

Complexity

96 complexity categories to cover real-world complexities (environment, materials, lightings, etc)

Fig. 1. Scene distribution by POI category. The angle of each class denotes its data proportion. Interior: primary POI category. Exterior: secondary POI category.

**Fig. 2.** Number of scenes within secondary POI category. The legend contains the mapping between the primary and secondary POI categories. We observe that *schools-universities* and *residential-area* are the predominant scenes in our DL3DV-10K dataset. In contrast, locations such as government and civic service facilities (e.g., *post office*, *police station*, *court house*, and *city hall*) are less frequently captured due to the challenges in accessing these areas for detailed video recording.

**Fig. 3.** We show the distribution of scene category (the primary POI locations) by complexity indices, including environmental setting, light conditions, reflective surface, and transparent materials. Attributes in light conditions include: natural light (*nlight*), artificial light (*alight*), and a combination of both (*mlight*). Reflection class includes *more*, *medium*, *less*, and *none*. Transparency class likewise

Descriptive Alt Text — **Fig. 4.** We show the distribution of video duration and frequency metric in 10,510 videos. The minimum duration for video shooting with consumer mobile devices is set at 60 secs, while for drone cameras, it’s at least 45 secs. In our dataset, the median video duration is 69.5 secs. Furthermore, the median value of the frequency metric, determined by the average image intensity, stands at 2.6e-06. Based on this median value, we categorize scenes into high frequency (*high freq*) and low frequency (*low freq*) classes.

All Team Members

Contribution & Acknowledgement

Dataset Contribution

Lu Ling: proposed and led the project. Designed the pipeline of the dataset, including data acquisition and data processing.

Yichen Sheng: worked on the data processing.

Yichen Sheng, Lu Ling, Wentian Zhao, Kun Wan, Cheng Xin, Zixun Yu, Zhi Tu, Qianyu Guo, Yawen Lu, Xuanmao Li, Aniruddha Mukherjee, Rohan Ashok, Xingpeng Sun, Xiangrui Kong: collected and labeled the data.

Paper Contribution

Lu Ling: worked on paper writing, conducted part of the experiments.

Yichen Sheng: conducted part of the experiments.

Lantao Yu, Zixun Yu, Yawen Lu, Qianyu Guo, Kun Wan, Cheng Xin: worked on proofreading.

Bedrich Benes, Gang Hua, Aniket Bera, Hao Kang, Tianyi Zhang: provided advisory input on the research framing and manuscript development.

Acknowledgement

We extend our heartfelt gratitude to our esteemed colleagues: Zhaopeng Wang, Jinghua Wu, Yueting Zhao, Haomeng Zhang, Aaditya Kharel, Izel Avila, Rahul Nahar, Mayesha Monjur, and Neel Acharya. Your invaluable contributions were instrumental in our endeavor to compile the DL3DV-10K dataset.