As suggested in Approach our first model was a failure, we used our BEV-based weakly supervised detection model to train the data. We got mAP values close to zero.
IoU |
2D BEV car AP |
0.1 |
5.11 |
0.2 |
2.1 |
0.3 |
.79 |
0.4 |
.32 |
Predicted Boxes Ground truth car boxes
Reasons for failure
- No high-quality starting weights: Unlike normal RGB-based object detection, we cannot use ImageNet pre-trained weights on BEV images, because BEV images are very sparse compared to RGB images and have more channels. For instance, without loss of much information, our BEV contains 36 channels.
- Occlusion of points: Assuming the car faces in the y-direction, the backside of most of the objects of interest like other cars, pedestrians, etc are occluded. So the length of the proposal will be really inaccurate. When working on BEV this occlusion really affects ROI pooling, so the network fails to learn from ROIs.
- Weak feature Extractor (PIXOR): For our BEV-based object detection we used PIXOR’s backbone as the feature extractor. PIXOR is trained for supervised object detection and contains a bounding box regressor. PIXOR is kind of a single shot detection, so ROI pooling from encode feature maps produced weak features