DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

  • • 10,510 videos with consistent capture standard.
  • • A comprehensive new-view synthesis benchmark.
  • • Each video has human created labels related to scene POI and complexities (light, surface materials, etc).
  • • Each video has calibrated camera poses.
  • • Generalizable NeRF research.
  • • Scene level view consistentence tracking.
  • • Vision language model (VLM) research.
  • • and more!

Dataset Statistics


10510

Video

10510 different scenes with consistent capture standard

4K/60 FPS

Quality

High quality/framerates videos for high quality new view synthesis

65

POI

6 primary point-of-interest (POIs) categories and 65 seconday POIs to cover diverse real-world scenarios

96

Complexity

96 complexity categories to cover real-world complexities (environment, materials, lightings, etc)

Fig. 1. Scene distribution by POI category. The angle of each class denotes its data proportion. Interior: primary POI category. Exterior: secondary POI category.
...
Fig. 2. Number of scenes within secondary POI category. The legend contains the mapping between the primary and secondary POI categories. We observe that schools-universities and residential-area are the predominant scenes in our DL3DV-10K dataset. In contrast, locations such as government and civic service facilities (e.g., post office, police station, court house, and city hall) are less frequently captured due to the challenges in accessing these areas for detailed video recording.
...
Fig. 3. We show the distribution of scene category (the primary POI locations) by complexity indices, including environmental setting, light conditions, reflective surface, and transparent materials. Attributes in light conditions include: natural light (nlight), artificial light (alight), and a combination of both (mlight). Reflection class includes more, medium, less, and none. Transparency class likewise
Descriptive Alt Text
Fig. 4. We show the distribution of video duration and frequency metric in 10,510 videos. The minimum duration for video shooting with consumer mobile devices is set at 60 secs, while for drone cameras, it’s at least 45 secs. In our dataset, the median video duration is 69.5 secs. Furthermore, the median value of the frequency metric, determined by the average image intensity, stands at 2.6e-06. Based on this median value, we categorize scenes into high frequency (high freq) and low frequency (low freq) classes.

Team

Acknowledgement

We extend our heartfelt gratitude to our esteemed colleagues: Zhaopeng Wang, Jinghua Wu, Yueting Zhao, Haomeng Zhang, Aaditya Kharel, Izel Avila, Rahul Nahar, Mayesha Monjur, and Neel Acharya. Your invaluable contributions were instrumental in our endeavor to compile the NeRF-Verse dataset.