A team of Microsoft and Huazhong University researchers this week open-sourced an AI object detector — Fair Multi-Object Tracking (FairMOT) — they claim outperforms state-of-the-art models on public data sets at 30 frames per second. If productized, it could benefit industries ranging from elder care to security, and perhaps be used to track the spread of illnesses like COVID-19.
As the team explains, most existing methods employ multiple models to track objects: (1) a detection model that localizes objects of interest and (2) an association model that extracts features used to reidentify briefly obscured objects. By contrast, FairMOT adopts an anchor-free approach to estimate object centers on a high-resolution feature map, which allows the reidentification features to better align with the centers. A parallel branch estimates the features used to predict the objects’ identities, while a “backbone” module fuses together the features to deal with objects of different scales.
The researchers tested FairMOT on a training data set compiled from six public corpora for human detection and search: ETH, CityPerson, CalTech, MOT17, CUHK-SYSU, and PRW. (Training took 30 hours on two NVIDIA RTX 2080 graphics cards.) After removing duplicate clips, they tested the trained model against benchmarks including 2DMOT15, MOT16, and MOT17. All came from the MOT Challenge, a framework for validating people-tracking algorithms that ships with data sets, an evaluation tool providing several metrics, and tests for tasks like surveillance and sports analysis.
Compared with the only two published works that jointly perform object detection and identity feature embedding — TrackRCNN and JDE — the team reports that FairMOT outperformed both on the MOT16 data set with an inference speed “near video rate.”
“There has been remarkable progress on object detection and re-identification in recent years, which are the core components for multi-object tracking. However, little attention has been focused on accomplishing the two tasks in a single network to improve the inference speed. The initial attempts along this path ended up with degraded results mainly because the re-identification branch is not appropriately learned,” concluded the researchers in a paper describing FairMOT. “We find that the use of anchors in object detection and identity embedding is the main reason for the degraded results. In particular, multiple nearby anchors, which correspond to different parts of an object, may be responsible for estimating the same identity which causes ambiguities for network training.”
In addition to FairMOT’s source code, the research team made available several pretrained models that can be run on live or recorded video.