Understanding Safety-critical Driving Scenes in Rural Roads

Despite typically lower traffic volumes and fewer drivers in rural areas compared to urban settings, the fatality rate from traffic crashes in rural regions is comparatively higher. Moreover, the severity and consequences of such crashes tend to be more pronounced. While autonomous vehicles have the potential to improve accessibility and mobility for rural communities, they face significant limitations in these areas. For example, degraded or absent lane markings can impair an autonomous vehicle’s perception system, leading to inaccurate identification of drivable areas and an increased risk of collisions. Enhancing the visual cognitive abilities of autonomous vehicles to perceive, describe, reason, and make decisions in rural crash scenarios will improve safety across these widely distributed regions.

The objective of this project is to leverage Multimodal Large Language Models (MLLMs) to advance video understanding in rural driving scenarios, particularly in the context of crash events. The project will first develop a benchmark dataset annotated with crash temporal localization, risky agent spatial grounding, driving scene identification, crash event description, crash reasoning, and avoidance or mitigation measures. This dataset will include both rural crash instances and corresponding comparative scenarios, enabling a comprehensive analysis of challenges in rural crash video understanding. Next, state-of-theart video understanding models will be evaluated on the dataset to identify performance gaps specific to rural contexts. Finally, an efficient method will be developed to adapt video LLMs for rural driving applications, enhancing the operability and safety of autonomous vehicles on rural roads.

Exhibit D