Multi-object tracking (MOT) in videos involves multiple steps, such as, detec- tion of large numbers of objects, identifying inter-object occlusions, devising data association methods and finally, finding efficient and fast algorithms to determine object trajectories. This dissertation deals with the detection and tracking of multiple objects (e.g.,pedestrians), using a track-by-detection approach. For the detection part, we perform transfer learning on a single shot detector (SSD), a deep network based object detector which can detect the objects in a single for- ward pass of the network. The detector feeds to a data association system, which associates detections frame-by-frame based on appearance and location based fea- tures. We model the problem of finding optimal trajectories as a min cost flow problem. Mutual object occlusion is explicitly added in the constraint set along with other flow related constraints. We then derive a Langrangian relaxation of the problem, which is a much more generalised form of the relaxation formulation available in the literature. Based on this relaxation formulation, we develop the algorithm that can find the optimal trajectories or shortest paths in pseudo-polynomial time. Our tracker performs quite close to the state-of-the-art, in terms of many standard MOT evaluation metrics.
In this thesis, we have outlined a tracking methodology that has the potential of tracking objects even in the middle of occlusion. Toward this end we have proposed an end-to-end system of modern, heavy duty tracking system comprising two important computer vision components. One is a robust detector that can handle various appearance variation related to pedestrians’ movement, and the second is a tracker that can solve the data association problem even when the people are occluded for a reasonable amount of time. We have used a state-of- the-art convolution neural net based single-shot detector to produce high quality detections in the video. In the tracking stage, we studied a recently proposed batch tracking method — based on the min-cost flow for graph optimization — and proposed a novel Lagrangian relaxation to further improve it. We have built a fairly large dataset (Figure 1) consisting of multiple videos which are widely used by the computer vision for various benchmarking purposes. An exten- sive evaluation process is carried out to investigate how well the proposed tracker performs in comparison with the existing state-of-best of the art. Lastly, we have illustrated exciting future directions that this thesis can inspire to investigate.