Perception for world modelling

About myself

I am a second-year Automotive technology master student at TU Eindhoven, specializing in signal processing systems. I’m particularly passionate about studying and applying machine learning techniques to computer vision applications. Currently I’m working towards my graduation project at TomTom in Traffic sign landmarking for localization. I joined ATeam in the beginning of 2018, initially working towards control related assignments and gradually shifted to signal processing projects. When I’ve not drenched myself in work, I can usually be found hanging out with friends, travelling to new places or bingeing on Netflix.

“I visualize a time when we will be to robots what dogs are to humans, and I’m rooting for the machines.”

Claude Shannon, father of information theory

My project

For any respectable self-driving vehicle, assessing its immediate surrounding environment is very critical in order to make confident autonomous-driving decisions. My current assignment with ATeam involves the visual perception part of the 360° surround-view application (my tasks are at the beginning of the sense-think-act pipeline). The application has been devised considering three monocular cameras, one stereo camera, three short range RADARs, one long range RADAR and two LEDDARs. My work is mainly focused on image processing on the camera feeds. Below is a basic visualization of the camera placements relative to the car and the kind of perspectives we deal with.


Camera placements and image perspectives within the ATeam 360° surround-view application

For now, a neural-network based object-detector is set up to detect relevant objects in the visual field (cars, trucks, pedestrians, traffic signs, etc.) and localize them on the image plane (bounding boxes enclosing the objects in the image). Once we find objects on the image plane, for this data to be made usable for vehicle-control modules, we project the detections onto the 3D world-coordinates (in other words, we estimate the relative distance between our car and the detected objects).

After this processing, we now know what objects surround us and their relative positions with respect to our own ego-vehicle. Strategically fusing data from all the three kinds of sensors available, we now proceed to build a 2D ego world-model grid (shown below). A known slice of real-world space around the vehicle is mapped out as a layered grid of voxel cells. Occupancy states of each cell is updated based on the fused data of cameras, RADARs and LEDDARs.


Rudimentary visualization of the 2D ego world-model grid as a result of fused output of all onboard sensors

In addition to having a good visualization of what the vehicle is actually perceiving, this kind of a local world-model is very useful as a reliable input for obstacle-avoidance, path-planning and other vehicle-control applications.

A technical explanation for the engineers

Perception system workflow for the 360° surround-view application is visualized below. Objects around are detected and localized by a combination of onboard sensors and the resulting data is fused under supervision along with the vehicle odometry data to continuously maintain a reliable 2D world view.

Overall architecture of the perception framework employed

Image processing is the most reliable source for object classification. Currently, the neural network model of Single-shot Multibox detector (SSD) with MobileNet v2 as feature extractor is used for real-time object detection. This network architecture is chosen because of its swift inference capabilities while maintaining a reasonable mAP accuracy. The network is pre-trained on COCo images and finetuned on KITTI images for traffic relevant object detection. Camera images are transported to the tensorflow model as ROS messages and in response, the resulting detection bounding-boxes and classes are published across as ROS messages.

For depth estimation of objects, although RADARs and LEDDARs are used primarily, disparity-based depth map from stereo camera is also a good source of estimating distance to object. Additionally, depth can also be estimated from monocular images using image processing techniques of triangulation (and although computationally expensive, deep CNNs can be employed for monocular depth prediction as well if needed).

The primary function of the supervisor block is to monitor sensor-weightages in the data fusion process. Considering accuracy and noise, weightages of individual sensors are designated pre-runtime. But in special cases, extended attention needs to be assigned to particular sensors. For example, if the vehicle is in the process of making a left lane-change maneuver, left side sensors are sampled more frequently than the right side sensors.

The data fusion block itself is maintained simple for now without performing any explicit filtering operations. Based on the perceived world, a multi-layer grid is constructed (one for each sensor type) which holds cell-level information of the object-class occupying the grid positions (simultaneously capturing object class and relative position). The overall objectness confidence level of each cell is then computed as a weighted average of layers.