Abstract

Co-speech gesture generation is crucial for creating lifelike avatars and enhancing human-computer interactions by synchronizing gestures with speech. Despite recent advancements, existing methods struggle with accurately identifying the rhythmic or semantic triggers from audio for generating contextualized gesture patterns and achieving pixel-level realism. To address these challenges, we introduce Contextual Gesture, a framework that improves co-speech gesture video generation through three innovative components: (1) a chronological speech-gesture alignment that temporally connects two modalities, (2) a contextualized gesture tokenization that incorporate speech context into motion pattern representation through distillation, and (3) a structure-aware refinement module that employs edge connection to link gesture keypoints to improve video generation. Our extensive experiments demonstrate that Contextual Gesture not only produces realistic and speech-aligned gesture videos but also supports long-sequence generation and video gesture editing applications.

Method

Left: Contrastive Learning for gesture-speech alignment. We distill the joint speech contextual-aware feature into latent codebook. Right: We use speech for generating discrete gesture motion tokens with Mask Gesture Generator. We apply random mask for token reconstruction during training and iterative remask based on probability for inference. Residual Gesture Generator finally based on the base VQ-tokens to predict the residual quantized tokens.

Comparisons

Long Sequence Generation

BibTeX

@misc{liu2025contextualgesturecospeechgesture, title={Contextual Gesture: Co-Speech Gesture Video Generation through Context-aware Gesture Representation}, author={Pinxin Liu and Pengfei Zhang and Hyeongwoo Kim and Pablo Garrido and Ari Sharpio and Kyle Olszewski}, year={2025}, eprint={2502.07239}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.07239} }

Contexual Gesture: Co-Speech Gesture Video Generation Through Context-aware Gesture Representation

Abstract

Method

Comparisons

We compare our method with S2G-Diffusion, ANGIE, we exlude the results of MM-Diffusion due to its inablilty to generate long sequence videos

Comparison with EchoMimicV2 and Ablation Study of Alignment

Long Sequence Generation

We can achieve longer than 30s or even 1 min video speech-driven video generations.

Gesture Video Editing

In this example, we modify the first 7 seconds with the new audio, and the last 8 seconds from the original videos for generating the edited result.

Gesture Pattern Transfer-1

We can re-enact different characters with the same audio to present the same gesture patterns.

Gesture Pattern Transfer-2

We can re-enact the same character with the same audio to present different gesture patterns.

Comparison on BEAT-X

Emage presents unnatural temporal transitions of gestures and jittorings. Our work achieves more aligned gesture motions conditioned on speech audio. With contextural distillation, the motion patterns can be more natural as shown on the left.

Ablation: Contexutalized Motion Representation

Ablation: Video Avatar Animation

BibTeX