DiWA

DiWA is an algorithmic framework for fine-tuning diffusion-based policies entirely inside frozen world models learned from large play data.

(a) Standard diffusion policies trained via imitation learning are limited by offline data. (b) DPPO fine-tunes diffusion policies using online interactions, which are expensive and require access to real or simulated environments. (c) DiWA fine-tunes diffusion policies entirely offline through imagined rollouts in a learned world model, enabling safe and efficient policy improvement without additional physical interaction.

Abstract

Fine-tuning diffusion policies with reinforcement learning (RL) presents significant challenges. The long denoising sequence for each action prediction impedes effective reward propagation. Moreover, standard RL methods require millions of real-world interactions, posing a major bottleneck for practical fine-tuning. Although prior work frames the denoising process in diffusion policies as a Markov Decision Process to enable RL-based updates, its strong dependence on environment interaction remains highly inefficient. To bridge this gap, we introduce DiWA, a novel framework that leverages a world model for fine-tuning diffusion-based robotic skills entirely offline with reinforcement learning. Unlike model-free approaches that require millions of environment interactions to fine-tune a repertoire of robot skills, DiWA achieves effective adaptation using a world model trained once on a few hundred thousand offline play interactions. This results in dramatically improved sample efficiency, making the approach significantly more practical and safer for real-world robot learning. On the challenging CALVIN benchmark, DiWA improves performance across eight tasks using only offline adaptation, while requiring orders of magnitude fewer physical interactions than model-free baselines. To our knowledge, this is the first demonstration of fine-tuning diffusion policies for real-world robotic skills using an offline world model.

Overview

DiWA framework: (1) A world model is trained on robot play data to learn latent dynamics. (2) A diffusion policy is pre-trained on expert demonstrations using latent representations. (3) A success classifier is trained on expert rollouts to estimate task rewards. (4) The diffusion policy is fine-tuned entirely offline via imagined rollouts within the Dream Diffusion MDP, using policy gradients and classifier-based rewards.

Policy Improvement Examples: Real-World

On the left, we see the base policies struggling to perform tasks.
In the middle, the policies find better actions while dreaming.
On the right, the fine-tuned policies successfully perform the tasks in the real world.

Open Drawer

Close Drawer

Push Slider Right

Code

For academic usage a software implementation of this project based on PyTorch can be found in our GitHub repository and is released under the GPLv3 license. For any commercial purpose, please contact the authors. You can download the pretrained models and collected play data below.

Publications

If you find our work useful, please consider citing our paper:

Akshay L Chandra, Iman Nematollahi, Chenguang Huang, Tim Welschehold, Wolfram Burgard, Abhinav Valada
DiWA: Diffusion Policy Adaptation with World Models
Conference on Robot Learning (CoRL), 2025.

(PDF) (Code) (BibTeX)

Authors

DiWA

Diffusion Policy Adaptation with World Models

Abstract

Overview

Summary Video