The IRL-DAL framework integrates four core components: a hybrid IL-IRL-RL training curriculum, an energy-guided diffusion planner, a learnable adaptive mask for contextual perception, and a safety-aware experience correction mechanism.
The IRL-DAL framework integrates four core components: a hybrid IL-IRL-RL training curriculum, an energy-guided diffusion planner, a learnable adaptive mask for contextual perception, and a safety-aware experience correction mechanism.
This study introduces IRL-DAL (Inverse Reinforcement Learning with a Diffusion-based Adaptive Lookahead planner), an integrated framework that unites reinforcement learning with generative trajectory planning for autonomous driving. At its foundation lies the Diffusion-based Adaptive Lookahead (DAL) planner — a conditional diffusion model designed to generate safe and dynamically feasible motion trajectories governing both steering and velocity control.
Distinct from conventional sampling approaches, DAL operates in a classifier-free mode guided by a multi-objective energy function. This function automatically balances lane keeping, obstacle avoidance, and control stability in response to real-time perceptual signals, allowing the agent to emphasize safety under challenging conditions without compromising efficiency.
To address sample inefficiency and promote stable policy convergence, training proceeds through a two-stage curriculum implemented in the Webots simulator. The process begins with Behavioral Cloning (BC) for policy initialization and transitions to Proximal Policy Optimization (PPO) for refinement. During the PPO stage, a hybrid reward formulation merges sparse environmental feedback with a dense, learned reward obtained via Inverse Reinforcement Learning (IRL), thereby encouraging expert-level driving behavior.
Comprehensive simulation results show that IRL-DAL consistently surpasses standard baselines in terms of safety, control smoothness, and task completion. These findings validate the proposed framework's capability to produce robust, precise, and safety-compliant driving policies suitable for deployment in safety-critical autonomous systems.
| Model | Mean Reward ↑ | Coll./1k Steps ↓ | Success (%) ↑ | BC Loss (×10-2) ↓ | Action Sim. (%) ↑ | ADE (m) ↓ | FDE (m) ↓ |
|---|---|---|---|---|---|---|---|
| PPO + Uniform Sampling | 85.2 ± 4.1 | 0.63 ± 0.12 | 78.1 ± 3.2 | 17.1 ± 1.4 | 65.3 ± 4.1 | 5.25 ± 0.31 | 11.8 ± 0.65 |
| + FSM Replay | 120.4 ± 3.8 (+41%) | 0.30 ± 0.08 | 88.4 ± 2.1 | 12.3 ± 1.1 | 75.1 ± 3.5 | 4.10 ± 0.27 | 9.5 ± 0.58 |
| + Diffusion Planner | 155.1 ± 3.2 (+29%) | 0.15 ± 0.05 | 92.0 ± 1.8 | 13.0 ± 1.0 | 80.2 ± 3.0 | 3.15 ± 0.22 | 7.2 ± 0.49 |
| + LAM + SAEC (Ours) | 180.7 ± 2.9 (+16%) | 0.05 ± 0.03 | 96.3 ± 1.2 | 7.4 ± 0.8 | 85.7 ± 2.4 | 2.45 ± 0.18 | 5.1 ± 0.41 |
Quantitative performance across architectural variants (10 seeds, mean ± std). Mean reward normalized to [0, 200]. Trajectory prediction metrics (ADE/FDE) from rollout evaluation. Arrows indicate improvement direction; bold denotes best.
To ensure stable convergence and sample efficiency, our agent is trained via a two-phase curriculum. Phase 1 (Behavioral Cloning) involves pre-training the policy for 20,000 steps using an expert dataset generated by a Finite State Machine (FSM). This establishes a robust behavioral prior, teaching the agent fundamental driving skills. Phase 2 (IRL-PPO) fine-tunes the policy for 30,000 steps using a hybrid reward signal. This signal combines sparse task rewards with a dense, GAIL-based imitation reward, encouraging the agent to explore complex scenarios while adhering to expert-like decision-making.
The Diffusion Planner acts as a critical, on-demand safety module. It is activated only in high-risk states identified by the base policy, making it computationally efficient. Upon activation, it generates a set of safe, short-horizon trajectories. The optimal trajectory is selected via a multi-objective energy function that expertly balances three critical goals: lane adherence to maintain course, obstacle avoidance for collision prevention, and control smoothness (jerk minimization) to ensure passenger comfort and stability.
The Learnable Adaptive Mask (LAM) is a state-aware perception module designed for intelligent allocation of visual attention. It dynamically adjusts the agent's perceptual focus based on real-time driving context. At high speeds, rather than expanding to the horizon, LAM actively amplifies the lower visual field (near-field road features) to ensure precise lane tracking and lateral stability. Conversely, when LiDAR detects nearby hazards or in dense traffic, the mask intensifies focus on the immediate surroundings to facilitate rapid collision avoidance, offering an interpretable and efficient alternative to heavy self-attention layers.
Our work introduces several novel contributions to the field of autonomous driving: