The success of monocular 3D object detection highly relies on considerable labeled data, which is costly to obtain. To alleviate the annotation effort, we propose MVC-MonoDet, the first semi-supervised training framework that improves Monocular 3D object detection by enforcing multi-view consistency. In particular, a box-level regularization and an object-level regularization are designed to enforce the consistency of 3D bounding box predictions of the detection model across unlabeled multi-view data (stereo or video). The box-level regularizer requires the model to consistently estimate 3D boxes in different views so that the model can learn cross-view invariant features for 3D detection. The object-level regularizer employs an object-wise photometric consistency loss that mitigates 3D box estimation error through structure-from-motion (SFM). A key innovation in our approach to effectively utilize these consistency losses from multi-view data is a novel relative depth module that replaces the standard depth module in vanilla SFM. This technique allows the depth estimation to be coupled with the estimated 3D bounding boxes, so that the derivative of consistency regularization can be used to directly optimize the estimated 3D bounding boxes using unlabeled data. We show that the proposed semi-supervised learning techniques effectively improve the performance of 3D detection on the KITTI and nuScenes datasets. We also demonstrate that the framework is flexible and can be adapted to both stereo and video data.