DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion

Abstract

In this work, we introduce DreamWaltz-G, a novel learning framework for text-driven 3D avatar creation and expressive whole-body animation.

The core of this framework lies in the proposed Skeleton-guided Score Distillation and Hybrid 3D Gaussian Avatar Representation. Specifically, the proposed skeleton-guided score distillation integrates skeleton controls from 3D human templates into 2D diffusion models, enhancing the consistency of SDS supervision in terms of view and human pose. This facilitates the generation of high-quality avatars, mitigating issues such as multiple faces, extra limbs, and blurring. The proposed hybrid 3D Gaussian avatar representation builds on the efficient 3D Gaussians, combining neural implicit fields and parameterized 3D meshes to enable real-time rendering, stable SDS optimization, and expressive animation.

Extensive experiments demonstrate that DreamWaltz-G is highly effective in generating and animating 3D avatars, outperforming existing methods in both visual quality and animation expressiveness. Our framework further supports diverse applications, including human video reenactment and multi-subject scene composition.

What's New

📢 2025-06-27: Accepted to TPAMI!
🔥 2024-11-20: [New feature] Reenact arbitrary in-the-wild video with our avatars!
🔥 2024-10-15: Release the training and inference code!
🔥 2024-10-15: Release the pre-trained models of 12 full-body 3D Gaussian avatars!
🔥 2024-10-15: Release a human motion dataset for 2D human video reenactment!
📢 2024-09-26: Release arXiv preprint and project page.

Avatar Gallery

Canonical Avatars. DreamWaltz-G can create whole-body canonical 3D avatars from only textual descriptions:

A chef dressed in white.

A clown.

A gardener in overalls and a wide-brimmed hat.

A karate master.

A man wearing a white tank top and shorts.

A professional boxer.

An Asian woman in a leather jacket.

An elderly man wearing a beige suit.

Black Widow.

Flynn Rider.

Goku.

Harley Quinn.

Jane Goodall.

Joker.

Kobe Bryant.

Rapunzel in Tangled.

Animatable Avatars. The generated 3D avatars can be animated given SMPL-X motion sequences:

Motion Source: TalkSHOW

Motion Source: Motion-X

Motion Source: In-the-wild Video

Applications

Shape Control and Editing. Our method enables: (a) training-time shape control by modifying the SMPL-X template, and (b) inference-time shape editing by explicitly adjusting the 3D Gaussians:

Human Video Renenactment. Leveraging 3D human pose estimation and video inpainting techniques, the generated 3D avatars can be seamlessly reenacted 2D human videos:

Multi-Subject Scene Composition. The generated 3D avatars can be composed with existing 3D Gaussian scenes:

BibTeX

@article{huang2024dreamwaltz-g,
  title={{DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion}},
  author={Huang, Yukun and Wang, Jianan and Zeng, Ailing and Zha, Zheng-Jun and Zhang, Lei and Liu, Xihui},
  year={2024},
  eprint={arXiv preprint arXiv:2409.17145},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
}

@inproceedings{huang2024dreamwaltz,
  title={{DreamWaltz: Make a Scene with Complex 3D Animatable Avatars}},
  author={Huang, Yukun and Wang, Jianan and Zeng, Ailing and Cao, He and Qi, Xianbiao and Shi, Yukai and Zha, Zheng-Jun and Zhang, Lei},
  booktitle={Proceedings of the 37th International Conference on Neural Information Processing Systems},
  pages={4566--4584},
  year={2023}
}