A perception pipeline converts raw sensor data into actionable high-level scene understanding: detecting objects, estimating their positions, classifying their types, and tracking them over time. Autonomous vehicles and robots use multiple sensor modalities (cameras, lidar, radar) because each has complementary strengths and failure modes. A camera excels at semantic classification (what is that object?) and works in daylight but struggles at night and in fog. Lidar provides accurate 3D structure and range but is blinded by rain and fog. Radar penetrates adverse weather and measures velocity directly but has poor angular resolution. The pipeline must fuse these diverse signals, handling sensor noise, missing data, and partial occlusions. Each detection must be accompanied by confidence metrics — a 95% confident detection of a car is treated differently than a 60% confident one. The pipeline runs at real-time constraints (typically 10-50 Hz) on embedded hardware, requiring careful optimization of both algorithm and implementation.
A perception pipeline must solve several related problems. First, detection: identify what objects are present in the sensor data and estimate their positions. Second, classification: determine the type of each object (car, pedestrian, cyclist, traffic sign). Third, localization: precisely estimate 3D position and orientation. Fourth, tracking: maintain object identities across frames and estimate velocity and acceleration. Each layer builds on the previous one, but can also feedback to correct earlier estimates.
Camera-based detection uses deep convolutional neural networks trained on large labeled datasets. A network like YOLO (You Only Look Once) or Faster R-CNN takes an image and outputs bounding boxes with class labels and confidence scores. Camera detection excels at semantic classification — the network can recognize very subtle appearance cues — but struggles with ambiguous cases (is that a motorcycle or a small car?) and fails at night. Modern approaches use object detection trained on diverse lighting and weather conditions, with data augmentation (synthetic shadows, rain streaks, glare) to improve generalization. A single camera also provides limited depth information; depth must be inferred from appearance cues (closer objects appear larger, occlusion relationships, focus) which is unreliable for distant or small objects.
Lidar-based detection processes 3D point clouds. A lidar sweeps a laser around the environment, producing a point cloud of reflections. Detection can be done by voxelizing the point cloud (dividing 3D space into regular grid cells), treating the voxel grid as a 3D image, and running a 3D CNN. Or by processing points directly using networks like PointNet that operate on unordered point sets. Lidar provides precise depth and 3D structure but is blind to weather. Lidar point clouds can be quite sparse (especially for distant objects), requiring careful handling of occlusions.
Radar-based detection measures range, radial velocity, and angle to reflective objects. Radar penetrates rain and fog where camera and lidar fail, making it invaluable for adverse weather. Radar's weakness is poor angular resolution — two nearby objects might appear as a single blob. Modern approaches fuse radar with camera and lidar to achieve the benefits of all three.
Sensor fusion combines detections from multiple sensors. A simple approach is voting: if camera and lidar both detect a car at roughly the same location, confidence is higher than either sensor alone. More sophisticated approaches use probabilistic fusion: each detector produces a detection with uncertainty (covariance matrix); a fusion filter (extended Kalman filter, particle filter, or learned model) combines these uncertain estimates, weighting higher-confidence sources more heavily. When one sensor disagrees strongly with others, its confidence is discounted or flagged as potentially failed.
Tracking maintains object identities across time. A tracking algorithm takes detections from the current frame and matches them to tracked objects from previous frames using distance metrics (Euclidean distance, Mahalanobis distance) or learned similarity measures. Matched detections update the tracked object's position and velocity; unmatched detections initiate new tracks; unmatched previous tracks are allowed to coast (move forward using velocity estimate) or are terminated if they go undetected for too long. Tracking provides velocity estimates and smooths noisy detections through temporal filtering.
The full pipeline thus produces, for each detected object: (1) position and orientation, (2) velocity, (3) classification (car, pedestrian, etc.), (4) confidence in each of these estimates, and (5) a consistent identity across frames. This structured output is what the planning module needs to predict collisions and plan safe trajectories.