I am a second-year Automotive technology master student at TU Eindhoven, specializing in signal processing systems. I’m particularly passionate about studying and applying machine learning techniques to computer vision applications.
“I visualize a time when we will be to robots what dogs are to humans, and I’m rooting for the machines.”Claude Shannon, father of information theory
For any respectable self-driving vehicle, assessing its immediate surrounding environment is very critical in order to make confident autonomous-driving decisions. My current assignment with ATeam involves the visual perception part of the 360° surround-view application (my tasks are at the beginning of the sense-think-act pipeline). The application has been devised considering three monocular cameras, one stereo camera, three
For now, a neural-network based object-detector is set up to detect relevant objects in the visual field (cars, trucks, pedestrians, traffic signs, etc.) and localize them on the image plane (bounding boxes enclosing the objects in the image). Once we find objects on the image plane, for this data to be made usable for vehicle-control modules, we project the detections onto the 3D world-coordinates (in other words, we estimate the relative distance between our car and the detected objects).
After this processing, we now know what objects surround us and their relative positions with respect to our own ego-vehicle. Strategically fusing data from all the three kinds of sensors available, we now proceed to build a 2D ego world-model grid (shown below). A known slice of real-world space around the vehicle is mapped out as a layered grid of voxel cells. Occupancy states of each cell is updated based on the fused data of cameras, RADARs and LEDDARs.
In addition to having a good visualization of what the vehicle is actually perceiving, this kind of a local world-model is very useful as a reliable input for obstacle-avoidance, path-planning and other vehicle-control applications.
A technical explanation for the engineers
Perception system workflow for the 360° surround-view application is visualized below. Objects around are detected and localized by a combination of onboard sensors and the resulting data is fused under supervision along with the vehicle odometry data to continuously maintain a reliable 2D world view.
Image processing is the most reliable source for object classification. Currently, the neural network model of Single-shot Multibox detector (SSD) with MobileNet v2 as feature extractor is used for real-time object detection. This network architecture is chosen because of its swift inference capabilities while maintaining a reasonable mAP accuracy. The network is pre-trained on COCo images and finetuned on KITTI images for traffic relevant object detection. Camera images are transported to the
For depth estimation of objects, although RADARs and LEDDARs are used primarily, disparity-based depth map from stereo camera is also a good source of estimating distance to object. Additionally, depth can also be estimated from monocular images using image processing techniques of triangulation (and although computationally expensive, deep CNNs can be employed for monocular depth prediction as well if needed).
The primary function of the supervisor block is to monitor sensor-weightages in the data fusion process. Considering accuracy and noise, weightages of individual sensors are designated pre-runtime. But in special cases, extended attention needs to be assigned to particular sensors. For example, if the vehicle is in the process of making a left lane-change maneuver, left side sensors are sampled more frequently than the right side sensors.
The data fusion block itself is maintained simple for now without performing any explicit filtering operations. Based on the perceived world, a multi-layer grid is constructed (one for each sensor type) which holds cell-level information of the object-class occupying the grid positions (simultaneously capturing object class and relative position). The overall objectness confidence level of each cell is then computed as a weighted average of layers.