Contexual Gesture: Co-Speech Gesture Video Generation Through Context-aware Gesture Representation

Arxiv 2025

Contextual Gesture achieves various fine-grained control over video-level gesture motion. Left: We can generate 30s to 1 min speech conditioned gesture videos. Mid: We modify the gestures for intermediate frames of a video by providing a new audio segment. Right: Different people present the same gesture patterns for a given audio.

Abstract

Co-speech gesture generation is crucial for creating lifelike avatars and enhancing human-computer interactions by synchronizing gestures with speech. Despite recent advancements, existing methods struggle with accurately identifying the rhythmic or semantic triggers from audio for generating contextualized gesture patterns and achieving pixel-level realism. To address these challenges, we introduce Contextual Gesture, a framework that improves co-speech gesture video generation through three innovative components: (1) a chronological speech-gesture alignment that temporally connects two modalities, (2) a contextualized gesture tokenization that incorporate speech context into motion pattern representation through distillation, and (3) a structure-aware refinement module that employs edge connection to link gesture keypoints to improve video generation. Our extensive experiments demonstrate that Contextual Gesture not only produces realistic and speech-aligned gesture videos but also supports long-sequence generation and video gesture editing applications.

Method

Left: Contrastive Learning for gesture-speech alignment. We distill the joint speech contextual-aware feature into latent codebook. Right: We use speech for generating discrete gesture motion tokens with Mask Gesture Generator. We apply random mask for token reconstruction during training and iterative remask based on probability for inference. Residual Gesture Generator finally based on the base VQ-tokens to predict the residual quantized tokens.

Comparisons

We compare our method with S2G-Diffusion, ANGIE, we exlude the results of MM-Diffusion due to its inablilty to generate long sequence videos

Comparison with EchoMimicV2 and Ablation Study of Alignment

We compare with EchoMimicV2, together with the ablation study of the effectivenss of chronological modality alignment during distillation. The results show that chronological alignment can help reduce the unnatural temporal transitions and jittering problems while making gesture patterns more diverse.

Long Sequence Generation

We can achieve longer than 30s or even 1 min video speech-driven video generations.

Gesture Video Editing

In this example, we modify the first 7 seconds with the new audio, and the last 8 seconds from the original videos for generating the edited result.

Gesture Pattern Transfer-1

We can re-enact different characters with the same audio to present the same gesture patterns.

Gesture Pattern Transfer-2

We can re-enact the same character with the same audio to present different gesture patterns.

Comparison on BEAT-X

Emage presents unnatural temporal transitions of gestures and jittorings. Our work achieves more aligned gesture motions conditioned on speech audio. With contextural distillation, the motion patterns can be more natural as shown on the left.

Ablation: Contexutalized Motion Representation

Only relying on RVQ tokenization, the generated gestures are weakly aligned with the speech audio. By incorporating the pretrained audio encoder from the temporal alignment, this problem can be alleviated. Our contexutalized distillation can further enhance the temporal matching with more natural movements, beat patterns and faical expressions.

Ablation: Video Avatar Animation

We compare our image-warping based method with AnimateAnyone for video avatar animation. AnimateAnyone, though achieves high quality hand structures, fails to maintain the identity of the source speaker. In addition, it fails to capture the temporal background motions caused by camera movement within the video, leading to unstable background rendering.

BibTeX

@misc{liu2025contextualgesturecospeechgesture,
        title={Contextual Gesture: Co-Speech Gesture Video Generation through Context-aware Gesture Representation}, 
        author={Pinxin Liu and Pengfei Zhang and Hyeongwoo Kim and Pablo Garrido and Ari Sharpio and Kyle Olszewski},
        year={2025},
        eprint={2502.07239},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2502.07239}
}