Semi-supervised Monocular 3D Object Detection by Multi-view Consistency


The success of monocular 3D object detection highly relies on considerable labeled data, which is costly to obtain. To alleviate the annotation effort, we propose MVC-MonoDet, the first semi-supervised training framework that improves Monocular 3D object detection by enforcing multi-view consistency. In particular, a box-level regularization and an object-level regularization are designed to enforce the consistency of 3D bounding box predictions of the detection model across unlabeled multi-view data (stereo or video). The box-level regularizer requires the model to consistently estimate 3D boxes in different views so that the model can learn cross-view invariant features for 3D detection. The object-level regularizer employs an object-wise photometric consistency loss that mitigates 3D box estimation error through structure-from-motion (SFM). A key innovation in our approach to effectively utilize these consistency losses from multi-view data is a novel relative depth module that replaces the standard depth module in vanilla SFM. This technique allows the depth estimation to be coupled with the estimated 3D bounding boxes, so that the derivative of consistency regularization can be used to directly optimize the estimated 3D bounding boxes using unlabeled data. We show that the proposed semi-supervised learning techniques effectively improve the performance of 3D detection on the KITTI and nuScenes datasets. We also demonstrate that the framework is flexible and can be adapted to both stereo and video data.

Proceedings of the European conference on computer vision (ECCV)
Ying-Cong Chen
Ying-Cong Chen
Assistant Professor

Ying-Cong Chen is an Assistant Professor at AI Thrust, Information Hub of Hong Kong University of Science and Technology (Guangzhou Campus). He obtained his Ph.D. degree from the Chinese University of Hong Kong. His research lies in the broad area of computer vision and machine learning, aiming for empowering machine with the capacity to understand human appearance, physiology and psychology. His works contribute to a wide range of applications, including contactless health monitoring, semantic photo synthesis, and intelligent video surveillance.