Autonomous Vehicles translate -- datasets & bechmarks

Computer Vision for Autonomous Vehicles:
Problems, Datasets and State-of-the-Art
自动驾驶技术的计算机视觉：问题，数据和前沿技术
数据集和基准

datasets & Benchmarks 数据集和基准

Datasets have played a key role in the progress of many research fields by providing problem specific examples with ground truth. They allow quantitative evaluation of approaches providing key insights about their capacities and limitations. |In particular, several of these datasets Geiger et al. (2012b); Scharstein & Szeliski (2002); Baker et al. (2011); Everingham et al. (2010); Cordts et al. (2016) also provide online evaluation servers which allow for a fair comparison on held-out test sets and provide researchers in the field an up-to-date overview over the state-of-the-art. |This way, current progress and remaining challenges can be easily identified by the research community. In the context of autonomous vehicles, the KITTI dataset Geiger et al. (2012b) and the Cityscapes dataset Cordts et al.(2016) have introduced challenging benchmarks for reconstruction, motion estimation and recognition tasks, and contributed to closing the gap between laboratory settings and challenging real-world situations. |Only a few years ago, datasets with a few hundred annotated examples were considered sufficient for many problems. The introduction of datasets with many hundred to thousands of labeled examples, however, has led to spectacular breakthroughs in many computer vision disciplines by training high-capacity deep models in a supervised fashion. |However, collecting a large amount of annotated data is not an easy endeavor, in particular for tasks such as optical flow or semantic segmentation. This initiated a collective effort to produce that kind of data in several areas by searching for ways to automate the process as much as possible such as through semi-supervised learning or synthesization.
数据集在许多研究领域进展方面发挥了关键作用，提供了真实的(ground truth)问题特例。它们允许通过提供有关其能力与局限的核心信息，数据集还可以对方法进行量化评估。
特别地，这些数据集中的几个比如 Geiger（2012b）;Scharstein＆Szeliski（2002）; Baker（2011）;Everinghamet al（2010）; Cordts（2016）也提供在线评估服务器允许在延期测试（held-out）中进行公平的比较，而且为该领域的研究人员提供更新的目前最好的算法。
这种方式可以让研究团队很容易地确定目前的进展和剩下的挑战。在自主车辆的环境中，KITTI 数据集 Geiger（2012b）和 Cityscapes 数据集 Cordts （2016）为重建、运动估计和识别任务引入了挑战性的基准，因此缩小了实验室设置与挑战现实世界的情况之间的差距。
几年前，有数百个注释例子的数据集对于解决很多问题是足够的。然而，有数百到数千个有标签的例子的数据集的引入，通过以监督的方式训练大容量深度模型，已经使得许多计算机视觉学科的重大突破。
然而，收集大量的注释数据不是一个容易的事情，特别是对于诸如光流或者语义分割的任务。这使得集体努力通过搜索尽可能多的方式来自动化过程，例如通过半监督学习或合成，从而在多个领域产生了这种数据。

Real-World Datasets 真实数据集

While several algorithmic aspects can be inspected using synthetic data, real-world datasets are necessary to guarantee performance of algorithms in real situations. For example, algorithms employed in practice need to handle complex objects and environments while facing challenging environmental conditions such as direct lighting, reflections from specular surfaces, fog or rain. The acquisition of ground truth is often labor intensive because very often this kind of information cannot be directly obtained with a sensor but requires tedious manual annotation. |For example, (Scharstein & Szeliski (2002),Baker et al. (2011)) acquire dense pixel-level annotations in a controlled lab environment whereas Geiger et al. (2012b); Kondermann et al. (2016) provide sparse pixel-level annotations of real street scenes using a LiDAR laser scanner.
Recently, crowdsourcing with Amazon’s Mechanical Turk9 have become very popular to create annotations for large scale datasets, e.g., Deng et al. (2009); Lin et al. (2014); Leal-Taix´e et al. (2015); Milan et al. (2016). However, the annotation quality obtained via Mechanical Turk is often not sufficient to be considered as reference and significant efforts in post-processing and cleaning-up the obtained labels is typically required. |In the following, we will first discuss the most popular computer vision datasets and benchmarks addressing tasks relevant to autonomous vision. Thereafter, we will focus on datasets particularly dedicated to autonomous vehicle applications.
虽然可以使用合成数据检查几个算法方面，但实际数据集对于确保算法在实际情况下的性能是必要的。例如，在实践中使用的算法需要处理复杂的对象和环境，同时面对挑战性的环境条件，例如直接照明，镜面反射，雾或雨。获取 ground truth 通常是劳动密集型的，因为这种信息通常不能用传感器直接获得，而是需要繁琐的手动注释。
例如，（Scharstein＆Szeliski（2002），Baker（2011））在受控实验室环境中获得了密集的像素级注释，而 Geiger 等人（2012B）; Kondermann 等人（2016）使用 LiDAR 激光扫描仪提供实际街景场景的稀疏像素级注解。
最近，亚马逊的 Mechanical Turk 的众包已经变得非常受欢迎，为大型数据集创建注释，例如 Deng（2009）;Lin（2014）; Leal-Taix’e（2015）; Milan（2016）。然而，通过 Mechanical Turk 获得的注释质量通常不太合适被认为是参考，并且通常需要在后处理中最初的重大努力和清理所获得的标签中也是非常需要的。
在下文中，我们将首先讨论最流行的计算机视觉数据集和基准，以解决与自主视觉相关的任务。此后，我们将专注于数据集，尤其致力于自动驾驶车辆的应用。
Stereo and 3D Reconstruction: The Middlebury stereo benchmark introduced by Scharstein & Szeliski (2002) provides several multi-frame stereo data sets for comparing the performance of stereo matching algorithms. |Pixel-level ground truth is obtained by hand labeling and reconstructing planar components in piecewise planar scenes. Scharstein & Szeliski (2002) further provide a taxonomy of stereo algorithms that allows the comparison of design decisions and a test bed for quantitative evaluation. |Approaches submitted to their benchmark website are evaluated using the root mean squared error and the percentage of bad pixels between the estimated and ground truth disparity maps.
Scharstein & Szeliski (2003) and Scharstein et al. (2014) introduced novel datasets to the Middlebury benchmark comprising more complex scenes and including ordinary objects like chairs, tables and plants. In both works a structured lighting system was used to create ground truth. |For the latest version Middlebury v3, Scharstein et al. (2014) generate highly accurate ground truth for high-resolution stereo images with a novel technique for 2D subpixel correspondence search and self-calibration of cameras as well as projectors. This new version achieves significantly higher disparity and rectification accuracy than those of existing datasets and allows a more precise evaluation. An example depth map from the dataset is illustrated in Figure 1.
The Middlebury multi-view stereo (MVS) benchmark11 by Seitz et al. (2006) is a calibrated multi-view image dataset with registered ground truth 3D models for the comparison of MVS approaches. The benchmark played a key role in the advances of MVS approaches but is relatively small in size with only two scenes. |In contrast, the TUD MVS dataset12 by Jensen et al. (2014) provides 124 different scenes that were also recorded in controlled laboratory environment. Reference data is obtained by combining structured light scans from each camera position and the resulting scans are very dense, each containing 13.4 million points on average. For 44 scenes the full 360 degree model was obtained by rotation and scanning four times with 90 degree intervals. In contrast to the datasets so far, Sch¨ops et al. (2017) provide scenes that are not carefully staged in a controlled laboratory environment and thus represent real world challenges. Sch¨ops et al. (2017) recorded high-resolution DSLR imagery as well as synchronized low-resolution stereo videos in a variety of indoor and outdoor scenes. A high-precision laser scanner allows to register all images with a robust method. The high-resolution images enable the evaluation of detailed 3D reconstruction while the low-resolution stereo images are provided to compare approaches for mobile devices.
立体与 3D 重建类数据集：由 Scharstein＆Szeliski（2002）引入的 Middlebury 立体声基准测试仪提供了多个立体声数据集，用于比较立体匹配算法的性能。
通过在分段平面场景中手工标记和重建平面构成获得像素级地面真值。Scharstein 和 Szeliski（2002）进一步提供立体声算法的分类法，允许通过比较设计决策和测试台来进行定量评估。
使用均方误差以及估计值和地面真实视差图之间坏像素的百分比来评估提交给其基准网站的方法。
Scharstein & Szeliski (2003) 和 Scharstein et al. (2014)为 Middlebury 基准引入了一种新颖的数据集，这个数据及包含更多复杂的场景和普通的物体，比如椅子、桌子、植物等对象。在这两个工作中，均使用一个结构化的照明系统来创造地面实况。
对于最新版本的 Middlebury v3，Scharstein（2014）采用新颖的 2D 子像素对应搜索和相机自动校准技术以及投影机为高分辨率立体图像生成高精度的地面实况。与现有数据集相比，该新版本的差异和整改精度明显提高，可以进行更精确的评估。 Figure 1 是来自数据集的示例深度图：
Seitz 等人的 Middlebury 多视点立体声（MVS）基准测试（2006）是注册地面真相 3D 模型用于比较 MVS 方法一种校准的多视图图像数据集。基准测试在 MVS 方法的进步中发挥了关键作用，但只有两个场景，尺寸相对较小。相比之下，Jensen 等人的 TUD MVS 数据集（2014 年）提供了 124 个不同的场景，这些场景也被记录在受控实验室环境中。参考数据通过组合来自每个摄像机位置的结构光扫描获得，并且所得到的扫描非常密集，平均每个包含 13.4million 个点。对于 44 个场景，通过以 90 度的间隔旋转和扫描四次获得完整的 360 度模型。与迄今为止的数据集相比，Sch¨ops 等人（2017 年）提供了在受控实验室环境中未仔细分级的场景，从而代表了现实世界的挑战。Sch¨ops et al. (2017) 录制了高分辨率 DSLR 单反相机图像以及各种室内和室外场景中同步的低分辨率立体视频。高精度激光扫描仪允许以强大的方法注册所有图像。高分辨率图像可以评估详细的 3D 重建，同时提供低分辨率立体图像来比较移动设备的方法。
Optical Flow: The Middlebury flow benchmark13 by Baker et al. (2011) provides sequences with non-rigid motion, synthetic sequences and a subset of the Middlebury stereo benchmark sequences (static scenes) for the evaluation of optical flow methods. For all non-rigid sequences, ground truth flow is obtained by tracking hidden fluorescent textures sprayed onto the objects using a toothbrush. The dataset comprises eight different sequences with eight frames each. Ground truth is provided for one pair of frames per sequence.
Besides the limited size, real world challenges like complex structures, lighting variation and shadows are missing as the dataset necessitates laboratory conditions which allow for manipulating the light source between individual captures. In addition, it only comprises very small motions of up to twelve pixels which do not admit the investigation of challenges provided by fast motions. Compared to other datasets, however, the Middlebury dataset allows to evaluate sub-pixel precision since it provides very accurate and dense ground truth. Performance is measured using the angular error (AEE) and the absolute end point error (EPE) between the estimated flow and the ground truth.
Janai et al. (2017) present a novel optical flow dataset comprising of complex real world scenes in contrast to the laboratory setting in Middlebury. High-speed video cameras are used to create accurate reference data by tracking pixel through densely sampled space-time volumes. This method allows to acquire optical flow ground truth in challenging everyday scenes in an automatic fashion and to augment realistic effects such as motion blur to compare methods in varying conditions. Janai et al. (2017) provide 160 diverse real-world sequences of dynamic scenes with a significantly larger resolution (1280X1024 Pixels) than previous optical datasets and compare several state of-the-art optical techniques on this data.

光流类数据集：Baker 等人的“Middlebury 流量标准” （2011）提供了具有非刚性运动序列，合成序列和 Middlebury 立体声基准序列（静态场景）的子集的序列，用于评估光流方法。对于所有非刚性序列，通过使用 toothbrush 牙刷追踪在物体上喷洒的隐藏的荧光纹理来获得地面真实流。数据集包含八个不同的序列，每个序列具有八个帧。每个序列提供一对帧的地面实况。
除了有限的大小之外，由于数据集需要实验室条件，允许在各个捕获之间操纵光源，所以缺少像复杂结构，照明变化和阴影这样的真实世界挑战。此外，它只包含最多十二个像素的非常小的运动，不承认对快速运动提供的挑战的调查。然而，与其他数据集相比，Middlebury 数据集可以评估子像素精度，因为它提供了非常精确和密集的地面实例。使用角度误差（AEE）和估计流量与地面实数之间的绝对终点误差（EPE）来测量性能。
Janai 等人（2017）提出了一个新颖的光流数据集，其中包括复杂的现实世界场景，与 Middlebury 的实验室设置相反。高速视频摄像机用于通过密集采样的时空容量跟踪像素来创建精确的参考数据。该方法允许以自动方式在挑战性的日常场景中获取光流场地真相，并且增加诸如运动模糊的现实效果以在不同条件下比较方法。 Janai 等人（2017 年）提供了 160 个不同的现实世界动态场景序列，具有比以前的光学数据集显着更大的分辨率（1280x1024 像素），并比较了这些数据的几种最先进的光学技术。

Object Recognition and Segmentation: The availability of large-scale, publicly available datasets such as ImageNet (Denget al. (2009)), PASCAL VOC (Everingham et al. (2010)), Microsoft COCO (Lin et al.(2014)), Cityscapes (Cordts et al.(2016)) and TorontoCity (Wang et al. (2016)) have had a major impact on the success of deep learning in object classification, detection, and semantic segmentation tasks.
The PASCAL Visual Object Classes (VOC) challenge14 by Everingham et al. (2010) is a benchmark for object classification, object detection, object segmentation and action recognition. It consists of challenging consumer photographs collected from Flickr with high quality annotations and contains large variability in pose, illumination and occlusion. Since its introduction, the VOC challenge has been very popular and was yearly updated and adapted to the needs of the community until the end of the program in 2012. Whereas the first challenge in 2005 had only 4 different classes, 20 dierent object classes
were introduced in 2007. Over the years, the benchmark grew in size reaching a total of 11,530 images with 27,450 ROI annotated objects in 2012.
In 2014, Lin et al. (2014) introduced the Microsoft COCO dataset15 for the object detection, instance segmentation and contextual reasoning. They provide images of complex everyday scenes containing common objects in their natural context. The dataset comprises 91 object classes, 2.5 million annotated instances and 328k images in total. Microsoft COCO is significantly larger in the number of instances per class than the PASCAL VOC object segmentation benchmark. All objects are annotated with per-instance segmentations in an extensive crowd worker effort. Similar to PASCAL VOC, the intersection-overunion metric is used for evaluation.
对象识别与分割类数据集：大量的公开数据集，如 ImageNet（Deng 等人（2009）），PASCAL VOC（Everingham 等（2010）），Microsoft COCO（Lin 等人（2014）），Cityscapes（Cordts （2016））和 TorontoCity（Wang 等人（2016 年））对物体分类，目标检测和语义分割任务中深入学习的成功产生了重大影响。
由 Everingham 等人（2010）提供的 PASCAL 视觉对象类（VOC）挑战是对象分类，物体检测，物体分割和动作识别的基准。它由具有高质量标注的 Flickr 收集的有挑战性的消费者照片组成，并且包含姿势，照明和遮挡的大变化。自从介绍以来，VOC 的挑战一直很受欢并且逐年更新并适应社区的需求直到 2012 年计划结束。而 2005 年的第一个挑战只有 4 个不同的类，2007 年引入了 20 个不同的对象类。多年来，基准规模在 2012 年达到总共 11,530 张图像当中共有 27,450 张 ROI 注释物体。
2014 年，Lin 等（2014）介绍了 Microsoft COCO 数据集，用于物体检测，实例分割和上下文推理。它们在自然环境中提供包含常见对象的复杂日常场景的图像。数据集总共包括 91 个对象类，250 万个注释实例和 328k 个图像。 Microsoft COCO 在 PASCAL VOC 对象分割基准测试中每个类的实例数显著增加。所有物体都在广泛的人群工作人员的努力下对每个实例进行标注。与 PASCAL VOC 类似，IOU 度量用于评估。
Tracking: Leal-Taix´e et al. (2015); Milan et al. (2016) present the MOTChallenge16 which addresses the lack of a centralized benchmark for multi object tracking. The benchmark contains 14 challenging video sequences in unconstrained environments filmed with static and moving cameras and subsumes many existing multi-object tracking benchmarks such as PETS (Ferryman & Shahrokni (2009)) and KITTI (Geiger et al.(2012b)). The annotations for three object classes are provided: moving or standing pedestrians, people that are not in an upright position and others. They use the two popular tracking measures, Multiple Object Tracking Accuracy (MOTA) and Multiple Object Tracking Precision (MOTP) introduced by Stiefelhagen et al. (2007) for the evaluation of the approaches. Detection ground truth provided by the authors allows to analyze the performance of tracking systems independent of a detection system. Methods using a detector and methods using the detection ground truth can be compared separately on their website. - Aerial Image Datasets: The ISPRS benchmark17 (Rottensteiner et al. (2013, 2014)) provides data acquired by airborne sensors for urban object detection and 3D building reconstruction and segmentation. It consists of two datasets: Vaihingen and Downtown Toronto. The object classes considered in the object detection task are building, road, tree, ground, and car. The Vaihingen dataset provides three areas with various object classes and a large test site for road detection algorithms. The Downtown Toronto dataset covers an area of about 1.45 km2 in the central area of Toronto, Canada. Similarly to Vaihingen, there are two smaller areas for object extraction and building reconstruction, as well as one large area for road detection. For each test area, aerial images with orientation parameters, digital surface model (DSM), orthophoto mosaic and airborne laser scans are provided. The quality of the approaches is assessed using several metrics for detection and reconstruction. In both cases completeness, correctness and quality is assessed on a per-area level and a per-object level.
追踪：Leal-Taix’e（2015），Milan（2016）提出了 MOTChallenge16，解决了多对象跟踪缺乏集中的基准。该基准测试包含 14 个具有静态和移动摄像机拍摄的无约束环境的挑战性视频序列，并包含许多现有的多对象跟踪基准，如 PETS（Ferryman＆Shahrokni（2009））和 KITTI（Geiger 等（2012b））。提供三个对象类的注释：移动或站立的行人，不在直立位置的人等。他们使用 Stiefelhagen 等人介绍的两个流行的跟踪措施，多目标跟踪精度（MOTA）和多对象跟踪精度（MOTP）。（2007）评估方法。作者提供的检测基准真实性可以分析独立于检测系统的跟踪系统的性能。使用检测器的方法和使用检测基准的方法可以在其网站上单独进行比较。 - 空中图像数据集：ISPRS benchmark17（Rottensteiner 等（2013，2014））提供了用于城市物体检测和 3D 建筑重建和分割的机载传感器获取的数据。它包括两个数据集：Vaihingen 和多伦多市区。对象检测任务中考虑的对象类是建筑，道路，树木，地面和汽车。 Vaihingen 数据集提供了三个不同对象类别的区域和一个用于道路检测算法的大型测试站点。多伦多市中心数据集在加拿大多伦多的中部地区面积约 1.45 平方公里。与 Vaihingen 类似，有两个较小的对象提取和建筑重建区域，以及一个大面积的道路检测。对于每个测试区域，提供具有取向参数，数字表面模型（DSM），正射影像马赛克和机载激光扫描的航空图像。使用检测和重建的几个度量来评估方法的质量。在这两种情况下，完整性，正确性和质量都在每个面积水平和每个物体水平上进行评估。
Autonomous Driving: In 2012, Geiger et al. (2012b, 2013) have introduced the KITTI Vision Benchmark18 for stereo, optical flow, visual odometry/SLAM and 3D object detection (Figure). The dataset has been captured from an autonomous driving platform and comprises six hours of recordings using high-resolution color and grayscale stereo cameras, a Velodyne 3D laser scanner and high-precision GPS/IMU inertial navigation system. The stereo and optical flow benchmarks derived from this dataset comprise 194 training and 195 test image pairs at a resolution of 1280 376 pixels and sparse ground truth obtained by projecting accumulated 3D laser point clouds onto the image. Due to the limitations of the rotating laser scanner used as reference sensor, the stereo and optical flow benchmark is restricted to static scenes with camera motion.
To provide ground truth motion fields for dynamic scenes, Menze & Geiger (2015) have annotated 400 dynamic scenes, fitting accurate 3D CAD models to all vehicles in motion in to order to obtain flow and stereo ground truth for these objects. The KITTI flow and stereo benchmarks use the percentage of erroneous (bad) pixels to assess the performance of the submitted methods. Additionally, Menze & Geiger (2015) combined the stereo and flow ground truth to form a novel 3D scene flow benchmark. For evaluating scene flow, they combine classical stereo and optical flow measures.
The visual odometry / SLAM challenge consists of 22 stereo sequences, with a total length of 39.2 km. The ground truth pose is obtained using GPS/IMU localization unit which was fed with RTK correction signals. The translational and rotational error averaged over a particular trajectory length is considered for evaluation.
For the KITTI object detection challenge, a special 3D labeling tool has been developed to annotate all 3D objects with 3D bounding boxes for 7481 training and 7518 test images. The benchmark for the object detection task was separated into a vehicle, pedestrian and cyclist detection tasks, allowing to focus the analysis on the most important problems in the context of autonomous vehicles. Following PASCAL VOC Everingham et al. (2010), the intersection-over-union (IOU) metric is used for evaluation. For an additional evaluation, this metric has been extended to capture both 2D detection and 3D orientation estimation performance. A true 3D evaluation is planned to be released shortly.
The KITTI benchmark was extended by Fritsch et al. (2013) to the task of road/lane detection. In total, 600 diverse training and test images have been selected for manual annotation of road and lane areas. Mattyus et al. (2016) used aerial images to enhance the KITTI dataset with fine grained segmentation categories such as parking spots and sidewalk as well as the number and location of road lanes. The KITTI dataset has established itself as one of the standard benchmarks in all of the aforementioned tasks, in particular in the context of autonomous driving applications.
2012 年，Geiger 等（2012b，2013）推出了用于立体声，光流，视觉测距/ SLAM 和 3D 物体检测的 KITTI Vision Benchmark18（图）。数据集已从自主驾驶平台捕获，包括使用高分辨率彩色和灰度立体相机的六小时录音，Velodyne 3D 激光扫描仪和高精度 GPS / IMU 惯性导航系统。从该数据集派生的立体声和光流基准测试包括 194 次训练和 195 个测试图像对，分辨率为 1280？通过将累积的 3D 激光点云投影到图像上获得的 376 个像素和稀疏的地面真实。由于用作参考传感器的旋转激光扫描仪的局限性，立体声和光学流量基准仅限于具有摄像机运动的静态场景。
为了为动态场景提供地面真相运动场，Menze＆Geiger（2015）已经注明了 400 个动态场景，将精确的 3D CAD 模型适用于所有运动的车辆，以获得这些物体的流动和立体声地面实况。 KITTI 流量和立体声基准使用错误（不良）像素的百分比来评估提交的方法的性能。此外，Menze＆Geiger（2015）结合了立体声和流动地面的真相，形成了一种新颖的 3D 场景流动基准。为了评估场景流，它们结合了古典立体声和光学流量测量。
视觉测距/ SLAM 挑战包括 22 个立体声序列，总长 39.2 公里。使用馈送有 RTK 校正信号的 GPS / IMU 定位单元获得地面真实姿势。考虑在特定轨迹长度上平均的平移和旋转误差进行评估。
对于 KITTI 对象检测挑战，已经开发了一种特殊的 3D 标签工具，用于通过 3D 边界框注释所有 3D 对象，用于 7481 个训练和 7518 个测试图像。物体检测任务的基准被分为车辆，行人和骑车人员检测任务，允许将分析集中在自主车辆的上下文中最重要的问题。按照 PASCAL VOC Everingham 等（2010），交叉联合（IOU）度量用于评估。为了进一步评估，该指标已扩展到捕获 2D 检测和 3D 定向估计性能。计划即将发布真正的 3D 评估。
由 Fritsch 等人扩展了 KITTI 基准。（2013 年）到道路/车道检测任务。总共选择了 600 多种不同的训练和测试图像，用于手动注释道路和车道区域。 Mattyus 等人（2016）使用航空图像来增强 KITTI 数据集，并提供诸如停车位和人行道之类的细粒度细分类别，以及道路的数量和位置。 KITTI 数据集已经成为所有上述任务的标准基准之一，特别是在自主驾驶应用的上下文中。
Complementary to other datasets, the HCI benchmark19 proposed in Kondermann et al. (2016) specifically includes realistic, systematically varied radiometric and geometric challenges. Overall, a total of 28,504 stereo pairs with stereo and flow ground truth is provided. In contrast to previous datasets, ground truth uncertainties have been estimated for all static regions. The uncertainty estimate is derived from pixel-wise error distributions for each frame which are computed based on Monte Carlo sampling. Dynamic regions are manually masked out and annotated with approximate ground truth for 3,500 image pairs.
The major limitation of this dataset is that all sequences were recorded in a single street section, thus lacking diversity. On the other hand, this enabled better control over the content and environmental conditions. In contrast to the mobile laser scanning solution of KITTI, the static scene is scanned only once using a high-precision laser scanner in order to obtain a dense and highly accurate ground truth of all static parts. Besides the metrics used in KITTI and Middlebury, they use semantically meaningful performance metrics such as edge fattening and surface smoothness for evaluation Honauer et al. (2015). The HCI benchmark is rather new and not established yet but the controlled environment allows to simulate rarely occurring events such as accidents which are of great interest in the evaluation of autonomous driving systems.
The Caltech Pedestrian Detection Benchmark20 proposed by Dollar et al. (2009) provides 250,000 frames of sequences recorded by a vehicle while driving through regular traffic in an urban environment. 350,000 bounding boxes and 2,300 unique pedestrians were annotated including temporal correspondence between bounding boxes and detailed occlusion labels. Methods are evaluated by plotting the miss rate against false positives and varying the threshold on detection confidence.
The Cityscapes Dataset21 by Cordts et al. (2016) provides a benchmark and large-scale dataset for pixel-level and instancelevel semantic labeling that captures the complexity of realworld urban scenes. It consists of a large, diverse set of stereo video sequences recorded in streets of different cities. High quality pixel-level annotations are provided for 5,000 images while 20,000 additional images have been annotated with coarse labels obtained using a novel crowd sourcing platform. For two semantic granularities, i.e., classes and categories, they report mean performance scores and evaluate the intersection-overunion metric at instance-level to assess how well individual instances are represented in the labeling.
The TorontoCity benchmark presented byWang et al. (2016) covers the greater Toronto area with 712 km2 of land, 8,439 km of road and around 400,000 buildings. The benchmark covers a large variety of tasks including building height estimation (reconstruction), road centerline and curb extraction, building instance segmentation, building contour extraction, semantic labeling and scene type classification. The dataset was captured from airplanes, drones, and cars driving around the city to provide different perspectives.
与其他数据集的补充，在 Kondermann 等人提出的 HCI 基准 19。（2016）具体包括现实的，有系统地变化的辐射和几何挑战。总的来说，共提供了 28,504 立体声和流动地面真相的立体声对。与以前的数据集相比，所有静态区域的地面真实不确定度已被估计。不确定性估计是根据基于蒙特卡洛取样计算的每个帧的像素误差分布得出的。手动屏蔽动态区域并用 3,500 个图像对的近似地面实例进行注释。
这个数据集的主要限制是所有序列记录在单个街区，因此缺乏多样性。另一方面，这能够更好地控制内容和环境条件。与 KITTI 的移动激光扫描解决方案相比，静态场景仅使用高精度激光扫描仪扫描一次，以获得所有静态部件的致密和高精度的地面实况。除了 KITTI 和 Middlebury 使用的指标之外，他们使用语义有意义的性能指标，如边缘育肥和表面平滑度评估 Honauer 等。（2015 年）。 HCI 基准相当新，尚未建立，但受控环境允许模拟很少发生的事件，例如对自主驾驶系统的评估感兴趣的事故。
美国加州大学提出的 Caltech 行人检测基准 20 （2009）提供了车辆记录的 25 万帧序列，同时在城市环境中经常进行交通。包括 350,000 个边界框和 2,300 个独特的行人，包括边界框和详细遮挡标签之间的时间对应关系。通过绘制误差率与误报率并在检测置信度上改变阈值来评估方法。
由 Cordts 等人的 Cityscapes Dataset21 （2016）为像素级和实例级语义标注提供了基准和大型数据集，捕捉到现实城市场景的复杂性。它由不同城市的街道上记录的大型，多样化的立体视频序列组成。为 5,000 张图像提供了高质量的像素级注释，而使用新颖的人群采购平台获得的粗略标签已经注明了 20,000 张附加图像。对于两个语义粒度，即类别和类别，他们报告平均绩效评分，并评估实例级别的交叉点平均度量，以评估在标签中表示个体实例的程度。
Wang 等人提出的多伦多城市基准（2016 年）覆盖多伦多地区，712 平方公里的土地，8,439 公里的道路和大约 40 万个建筑物。该基准涵盖了建筑高度估计（重建），道路中心线和路缘提取，建筑物实例分割，建筑轮廓提取，语义标注和场景类型分类等各种任务。数据集被从飞机，无人驾驶飞机和汽车驾驶在城市周围捕获，以提供不同的观点。
Long-Term Autonomy: Several datasets such as KITTI or Cityscapes focus on the development of algorithmic competences for autonomous driving but do not address challenges of long-term autonomy, as for examples environmental changes over time. To address this problem, a novel dataset for autonomous driving has been presented by Maddern et al. (2016). They collected images, LiDAR and GPS data while traversing 1,000 km in central Oxford in the UK during one year. This allowed them to capture large variations in scene appearance due to illumination, weather and seasonal changes, dynamic objects, and constructions. Such long-term datasets allow for in-depth investigation of problems that detain the realization of autonomous vehicles such as localization in different times of the year.
长期自动：几个数据集，如 KITTI 或 Cityscapes，着重于开发自主驾驶的算法能力，但不能解决长期自主的挑战，例如随着时间的推移环境变化。为了解决这个问题，Maddern 等人提出了一个用于自主驾驶的新型数据集。（2016）。他们在一年内在英国牛津中心穿过 1000 公里的地方收集图像，LiDAR 和 GPS 数据。这允许他们捕获由于照明，天气和季节变化，动态对象和结构而导致的场景外观的大变化。这些长期数据集允许深入调查在一年中的不同时期扣留自主车辆的实现问题，例如本地化。

Synthetic Data

The generation of ground truth for real examples is very labor intensive and often not even possible at large scale when pixel-level annotations are required. On the other hand, pixel-level ground truth for large-scale synthetic datasets can be easily acquired. However, the creation of realistic virtual world is time-consuming. The popularity of movies and video games have led to an industry creating very realistic 3D content which nourishes the hope to replace real data completely using synthetic datasets. Consequently, several synthetic datasets have been proposed, recently, but it remains an open question whether the realism and variety attained is sufficient to replace real world datasets. Besides, creating realistic virtual content is a time consuming and expensive process itself and the trade-off between real and synthetic (or augmented) data is not clear yet.
为真实的例子生成地面真相是非常劳动密集型的，并且在需要像素级注释时通常甚至不可能大规模地实现。另一方面，可以轻松获取大规模合成数据集的像素级地面实况。然而，创造现实的虚拟世界是耗时的。电影和视频游戏的普及导致了行业创造了非常逼真的 3D 内容，这些内容丰富了使用合成数据集完全替代实际数据的希望。因此，最近已经提出了几个合成数据集，但是现实主义和品种是否足以替代现实世界数据集仍然是一个悬而未决的问题。此外，创建逼真的虚拟内容是一个耗时且昂贵的过程本身，真实和合成（或增强）数据之间的权衡尚不清楚。
MPI Sintel: The MPI Sintel Flow benchmark22 presented by Butler et al. (2012) takes advantage of the open source movie Sintel, a short animated film, to render scenes of varying complexity with optical flow ground truth. In total, Sintel comprises 1,628 frames. Different datasets obtained using different passes of the rendering pipeline vary in complexity shown in Figure 3. The albedo pass has roughly piecewise constant colors without illumination effects while the clean pass introduces illumination of various kinds. The final pass adds atmospheric effects, blur, color correction and vignetting. In addition to the average endpoint error, the benchmark website provides different rankings of the methods based on speed, occlusion boundaries, and disocclusions.
Flying Chairs and Flying Things: The limited size of optical flow datasets hampered the training of deep high-capacity models. To train a convolutional neural network, Dosovitskiy et al.(2015) thus introduced a simple synthetic 2D dataset of flying chairs rendered on top of random background images from Flickr. As the limited realism and size of this dataset proved insufficient to learn highly accurate models, Mayer et al. (2016) presented another large-scale dataset consisting of three synthetic stereo video datasets: FlyingThings3D, Monkaa, Driving. FlyingThings3D provides everyday 3D objects flying along randomized 3D trajectories in a randomly created scene. Inspired by the KITTI dataset a driving dataset has been created which uses car models from the same pool as FlyingThings3D and additionally highly detailed tree and building models from 3D Warehouse. Monkaa is an animated short movie similar to Sintel used in the MPI Sintel benchmark.
Game Engines: Unfortunately, data from animated movies is very limited since the content is hard to change and such movies are rarely open source. In contrast, game engines allow for creating an infinite amount of data. One way to create virtual worlds using a game engine is presented by Gaidon et al. (2016) which introduces the Virtual KITTI dataset23. They present an efficient real-to-virtual world cloning method to create realistic proxy worlds. A cloned virtual world allows to vary conditions such as weather or illumination and to use different camera settings. This way, the proxy world can be used for virtual data augmentation to train deep networks. Virtual KITTI contains 35 photo-realistic synthetic videos with a total of 17,000 high resolution frames. They provide ground truth for object detection, tracking, scene and instance segmentation, depth and optical flow.
MPI Sintel ：由 Butler 等人提出的 MPI Sintel Flow benchmark22 （2012）利用开源电影 Sintel（短片动画），以光流地面的真相呈现不同复杂度的场景。总共有 Sintel 包括 1,628 帧。使用不同渲染流程获得的不同数据集的复杂度如图 3 所示。反照率传递具有大致分段恒定颜色，无照明效果，而清洁通道则引入各种照明。最后的通行证增加了大气效果，模糊，颜色校正和渐晕。除了平均终点误差之外，基准网站还提供了基于速度，遮挡边界和不相关的方法的不同排名。
飞行椅和飞行事物：光流数据集的数量有限，妨碍了深层大容量模型的训练。为了训练卷积神经网络，Dosovitskiy 等人（2015）引入了一个简单的合成 2D 数据集，它们呈现在 Flickr 的随机背景图像之上。由于该数据集的有限现实性和大小证明不足以学习高精度模型，Mayer 等（2016）提出了另外一个由三个合成立体视频数据集组成的大型数据集：FlyingThings3D，Monkaa，Driving。 FlyingThings3D 在随机创建的场景中提供随机 3D 轨迹飞行的每天 3D 对象。受 KITTI 数据集的启发，已经创建了一个驱动数据集，它使用与 FlyingThings3D 相同的池中的汽车模型，以及来自 3D Warehouse 的另外高度详细的树和建筑模型。 Monkaa 是一个类似于 Sintel 的动画短片，用于 MPI Sintel 基准测试。
游戏引擎：不幸的是，动画电影的数据非常有限，因为内容很难改变，这样的电影很少是开源的。相比之下，游戏引擎允许创建无限量的数据。 Gaidon 等人提出了使用游戏引擎创建虚拟世界的一种方式。（2016）介绍了虚拟 KITTI 数据集 23。他们提出了一种高效的实时虚拟世界克隆方法来创建现实的代理世界。克隆的虚拟世界允许改变诸如天气或照明的条件，并使用不同的相机设置。这样，代理世界可以用于虚拟数据扩充来训练深层网络。虚拟 KITTI 包含 35 张照片合成视频，总共 17,000 个高分辨率帧。它们为物体检测，跟踪，场景和实例分割，深度和光流提供了基础。
In concurrent work, Ros et al. (2016) created SYNTHIA24, a synthetic collection of Imagery and Annotations of urban scenarios for semantic segmentation. They rendered a virtual city with the Unity Engine. The dataset consists of 13,400 randomly taken virtual images from the city and four video sequences with 200,000 frames in total. Pixel-level semantic annotations are provided for 13 classes.
Richter et al. (2016) have extracted pixel-accurate semantic label maps for images from the commercial video game Grand Theft Auto V. Towards this goal, they developed a wrapper which operates between the game and the graphics hardware to obtain pixel-accurate object signatures across time and instances. The wrapper allows them to produce dense semantic annotations for 25 thousand images synthesized by the photorealistic open-world computer game with minimal human supervision. However, for legal reasons, the extracted 3D geometry can not be made publicly available. Similarly, Qiu & Yuille (2016) provide an open-source tool to create virtual worlds by accessing and modifying the internal data structure of Unreal Engine 4. They show how virtual worlds can be used to test deep learning algorithms by linking them with the deep learning framework Caffe Jia et al. (2014).
在并行工作中，Ros et al。（2016）创建了 SYNTHIA24，一种用于语义分割的城市场景图像和注释的综合集合。他们用 Unity Engine 渲染了一个虚拟的城市。该数据集由 13,400 个随机抽取的城市虚拟图像和四个视频序列组成，共 20 万帧。为 13 个类提供像素级语义注释。
Richter et al。（2016）已经为商业视频游戏“侠盗猎车手”V 提取了图像的像素精确语义标签贴图。为了实现这一目标，他们开发了一种在游戏和图形硬件之间运行的包装器，以便跨越时间获得像素精确的对象签名，实例。包装器允许他们通过最小的人力监督来生成由真实感的开放世界电脑游戏合成的 2.5 万张图像的密集语义注释。然而，出于法律原因，提取的 3D 几何不能公开获得。同样，Qiu&Yuille（2016）通过访问和修改虚幻引擎 4 的内部数据结构，提供了一个开源工具来创建虚拟世界。他们展示了虚拟世界如何通过将深度学习算法与深层次学习框架 Caffe Jia（2014）。