We have successfully performed Weakly supervised Lidar object detection on the hard KITTI dataset. Although we are not beating VS3D, our results are quite satisfactory. VS3D uses online proposal generation, while our whole proposal generation pipeline is offline.

Advantages

  1. Fast convergence: Our model converges really fast in around 5-10 epochs using the pre-trained Imagenet weights
  2. Geometrically grounded proposal generation: Density-based clustering is a solid technique for object detection. Barring the inability to produce clusters far away from the sensor, most of the clusters proposed are of high quality
  3. No Requirement for Annotations: Our network can be trained on any dataset with just classification labels which are easy to collect in driving scenes. Annotating in 3D is a herculean task compared to images.
  4. Faster Proposal generation: Due to the availability of fast clustering methods in C++, we can achieve a higher inference speed compared to the online proposal generation method.

Drawbacks

  1. Less quality of proposals in 3D: When projected to the front camera the proposals are of high quality, but in 3D scenes, most of the time the proposals are smaller than the ground truth. This mainly occurs in they-direction we saw in negative results
  2. Bottlenecked loss: As the network is trained on Focal/ BCE loss, there is a limit to which the network can learn just from the labels. This disadvantage comes with the advantage of training without ground truth bounding boxes

Future Work

  1. Augment the already obtained bounding boxes in a way that we get more proposals. As most of the proposals even due to occlusion roughly gives a position of a probable object augmenting the boxes can produce better results.
  2. Training with bounding box length regressor. The length (y-axis) is where most of the occlusion occurs, so training with a bounding box regressor for the y-axis can improve the results. That does not require full annotation, only a single axis annotation is required.