Fast Vehicle Detection and Tracking on Fisheye Traffic Monitoring Video Using CNN and Bounding Box Propagation

1. Video visualization

Green box: ground-truth
Blue box: YOLOv5 detection
Yellow box: bounding box propagation

fisheye-day-test-30072020_01_fisheye_day_test

fisheye-day-test-30072020_02_fisheye_day_test

fisheye-day-test-30072020_03_fisheye_day_test

fisheye-night-test-30072020_CLIP_20200628-210253

fisheye-night-test-30072020_CLIP_20200628-210808

Resources

Citation: Ardianto, Sandy, Hsueh-Ming Hang, and Wen-Huang Cheng. "Fast Vehicle Detection and Tracking on Fisheye Traffic Monitoring Video Using CNN and Bounding Box Propagation." In 2022 IEEE International Conference on Image Processing (ICIP), pp. 1891-1895. IEEE, 2022.

Bibtex:

@inproceedings{ardianto2022fast,
                    title={Fast Vehicle Detection and Tracking on Fisheye Traffic Monitoring Video Using CNN and Bounding Box Propagation},
                    author={Ardianto, Sandy and Hang, Hsueh-Ming and Cheng, Wen-Huang},
                    booktitle={Proceedings of the International Conference on Image Processing (ICIP)},
                    pages={1891--1895},
                    year={2022},
                    organization={IEEE}
                  }

Dataset Download:

2. Experiments

In our experiments, the image is resized to 640x640 and fed into the YOLOv5 object detector. The initial model was pretrained with the COCO dataset. The speed comparison is measured on a PC with an Intel i9-9900X CPU and a single NVIDIA 1080Ti GPU. Experiment settings are as follows: (or stated otherwise)

Test data: Night video + Day video (both)
Modality: Keyframe: RGB, Intermediate frame: frame difference
Segment size: 3 (KF: 2, IF: 1)
High confidence car propagation: threshold: 0.8, and success & fail limit: 3
Car validation network: ResNet18, input size: 128x128

1.1. Detection rates in 3 regions for YOLOv5 and our scheme

It is clear in Table 1 the detection rates of the outer ring are much lower (than the total average), particularly, on the Night video.

Table 1: YOLOv5 on each region comparison

AP50 Night Day Both

YOLOv5 +BBox Prop YOLOv5 +BBox Prop YOLOv5 +BBox Prop

Inner circle 0.892 0.886 0.917 0.911 0.864 0.875

Middle ring 0.969 0.962 0.942 0.934 0.965 0.955

Outer ring 0.425 0.655 0.723 0.878 0.695 0.842

Overall 0.541 0.720 (+17.9 pp) 0.820 0.882 (+6.2 pp) 0.772 0.854 (+8.2 pp)

AP50	Night	Day	Both
Inner circle	0.892	0.886	0.917	0.911	0.864	0.875
Middle ring	0.969	0.962	0.942	0.934	0.965	0.955
Outer ring	0.425	0.655	0.723	0.878	0.695	0.842
Overall	0.541	0.720 (+17.9 pp)	0.820	0.882 (+6.2 pp)	0.772	0.854 (+8.2 pp)

1.2. Bounding box size filtering results

The bounding box size filtering process offers about 2 percentage points (pp) AP50 improvement for Night video and about 1 pp for all datasets.

Table 2: Bounding box size filtering results

Inputs Inner Ring Middle Ring Outer Ring # Cars AP50

Min. Size Max. Size Min. Size Max. Size Min. Size Max. Size Ground Truth Before After Before After

Night 113 162 92 132 53 76 19,533 21,423 20,352 0.541 0.563

Day 217 311 166 238 94 135 22,973 24,195 23,846 0.820 0.826

Both 211 302 133 191 88 127 42,506 45,115 42,702 0.772 0.781

Inputs	Inner Ring	Middle Ring	Outer Ring	# Cars	AP50
Night	113	162	92	132	53	76	19,533	21,423	20,352	0.541	0.563
Day	217	311	166	238	94	135	22,973	24,195	23,846	0.820	0.826
Both	211	302	133	191	88	127	42,506	45,115	42,702	0.772	0.781

1.3. Modality variation

If we use different input modality for our bounding box propagation schemes, the results are shown in Table 3. If every frame (including both keyframe and intermediate frame) is RGB YOLO detection, the detection accuracy is 1 pp higher but its speed is much slower.

Table 3. Modality variation results

Keyframe Intermediate Frame Ap50 Speed (fps)

RGB RGB 0.863 30

RGB FD 0.854 58

FD RGB 0.783 46

FD FD 0.676 84

Keyframe	Intermediate Frame	Ap50	Speed (fps)
RGB	RGB	0.863	30
RGB	FD	0.854	58
FD	RGB	0.783	46
FD	FD	0.676	84

1.4. Ablation study

We propose three bounding box propagation algorithms ((a) Keyframe to Intermediate-frame (K → I), (b) Intermediate-frame to Keyframe (I → K), and (c) High Confidence Car (HCC) propagation). Table 4 provides the performance of all possible combinations of using these three algorithms. We notice that a single propagation algorithm may not be effective but their combined use shows better performance than the individual.

Table 4. Evaluation result with various features

K → I I → K HCC AP50 Speed (fps)

0.759 74

• 0.771 68

• 0.728 69

• 0.753 66

• • 0.778 65

• • 0.802 62

• • 0.757 62

• • • 0.854 58

K → I	I → K	HCC	AP50	Speed (fps)
			0.759	74
•			0.771	68
	•		0.728	69
		•	0.753	66
•	•		0.778	65
•		•	0.802	62
	•	•	0.757	62
•	•	•	0.854	58

1.5. Car validation check network

We try different models and input size of the neural network used by the car validation checking process.

Table 5. Car validation network results on various models and input sizes

Model Inputs Network Accuracy AP50 Speed (fps)

ResNet18 32x32 96.8 0.848 62

64x64 97.2 0.853 60

128x128 98.9 0.854 58

ResNet50 32x32 99.0 0.856 54

64x64 99.2 0.857 53

128x128 99.5 0.862 52

Model	Inputs	Network Accuracy	AP50	Speed (fps)
ResNet18	32x32	96.8	0.848	62
64x64	97.2	0.853	60
128x128	98.9	0.854	58
ResNet50	32x32	99.0	0.856	54
64x64	99.2	0.857	53
128x128	99.5	0.862	52

1.6. Parameters of the High Confidence Car (HCC) propagation

The HCC propagation algorithm has three parameters: the threshold used to select the high confidence car and the two upper limits in the stopping rule: the existing car attempts and the validation failure attempts. As shown in Table 6, a confidence threshold around 0.8 gives the best results (Table 6). Tables 7 and 8 shows that there is nearly 1 pp accuracy improvement when the validation failure limit increases from one to two. On the other hand, the existing car checking limit provides only a small improvement, particularly when its value is higher than 2. For the upper limits of the attempts, Fig. 1 indicates a trade-off between accuracy and speed. It seems that the accuracy saturates if the attempt limit is higher than 3. Hence, we choose 3 as the upper limit for both the existing car and validation failure attempts.

Table 6. Threshold variation on high confidence car propagation

Threshold	AP50
0.5	0.684
0.6	0.795
0.7	0.837
0.8	0.854
0.9	0.853

Table 7. Existing and failure attempts limit - AP50

Existing Fail	1	2	3	4	5
1	0.831	0.839	0.845	0.849	0.852
2	0.839	0.847	0.851	0.854	0.856
3	0.845	0.851	0.854	0.856	0.858
4	0.849	0.854	0.856	0.858	0.859
5	0.852	0.856	0.858	0.859	0.859

Table 8. Existing and failure attempts limit - Speed (fps)

Existing Fail	1	2	3	4	5
1	64	63	62	61	59
2	63	62	61	59	57
3	62	61	59	57	55
4	61	59	57	54	53
5	59	57	55	53	51

Fig. 1. High confidence car performance and speed trade-off
(Same number of existing and fail attempts)

3. Visualization results

Box color scheme:

Green box: ground-truth, blue box: YOLOv5 detection, yellow box: new box produced by propagation

(Top) YOLOv5 detection and ground-truth

(Middle) YOLOv5 detection, ground-truth, and new bboxes produced by the propagation process

(Bottom) The entire fisheye image: the red window indicates the event of interest.

Fig. 2. Cars occluded by a tree are often undetected by YOLO (red window). They can be recovered by a local propagation pass.

Fig. 3. When waiting for traffic light, cars close together (in the red window) are not detected. The missing cars are now detected by a local propagation pass.

Fig. 4. Cars are not detected by YOLO (in the red window) due to fisheye distortion. They are recovered by the High Confidence Car propagation.

Fig. 5. Cars are not detected by YOLO due to their bright headlight (in red window). They are recovered by the backward pass of High Confidence Car propagation