Abstract

When humans speak, gestures help convey communicative intentions, such as adding emphasis or describing concepts. However, current co-speech gesture generation methods rely solely on superficial linguistic cues (e.g. speech audio or text transcripts), neglecting to understand and leverage the communicative intention that underpins human gestures. This results in outputs that are rhythmically synchronized with speech but are semantically shallow. To address this gap, we introduce Intentional-Gesture, a novel framework that casts gesture generation as an intention-reasoning task grounded in high-level communicative functions. First, we curate the InG dataset by augmenting BEAT-2 with gesture-intention annotations (i.e., text sentences summarizing intentions), which are automatically annotated using large vision-language models. Next, we introduce the Intentional Gesture Motion Tokenizer to leverage these intention annotations. It injects high-level communicative functions (e.g., intentions) into tokenized motion representations to enable intention-aware gesture synthesis. Our framework offers a modular foundation for expressive gesture generation in digital humans and embodied AI.

Method

Left: AuMoCLIP learns a hierarchical joint embedding of motion, audio, and intention. Transcript embeddings (BERT) aligned via CTC serve as queries in a cross-attention module with intention embeddings as keys/values. The resulting semantic features are concatenated with wav2vec2 audio features for contrastive learning. Right: Motion is quantized via a multi-codebook VQ module and supervised by semantic features from AuMoCLIP, enabling expressive and controllable gesture generation.

BibTeX

@misc{liu2025intentionalgesturedeliverintentions, title={Intentional Gesture: Deliver Your Intentions with Gestures for Speech}, author={Pinxin Liu and Haiyang Liu and Luchuan Song and Chenliang Xu}, year={2025}, eprint={2505.15197}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2505.15197}, }

Intentional Gesture: Deliver your Intentions with Gestures for Speech

Abstract

Method

BibTeX