GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling

ICCV 2025

¹University of Rochester ²University of Tokyo

Abstract

Generating full-body human gestures based on speech signals remains challenges on quality and speed. Existing approaches model different body regions such as body, legs and hands separately, which fail to capture the spatial interactions between them and result in unnatural and disjointed movements. Additionally, their autoregressive/diffusion-based pipelines show slow generation speed due to dozens of inference steps. To address these two challenges, we propose GestureLSM, a flow-matching-based approach for Co-Speech Gesture Generation with spatial-temporal modeling. Our method i) explicitly model the interaction of tokenized body regions through spatial and temporal attention, for generating coherent full-body gestures. ii) introduce the flow matching to enable more efficient sampling by explicitly modeling the latent velocity space. To overcome the suboptimal performance of flow matching baseline, we propose latent shortcut learning and beta distribution time stamp sampling during training to enhance gesture synthesis quality and accelerate inference. Combining the spatial-temporal modeling and improved flow matching-based framework, GestureLSM achieves state-of-the-art performance on BEAT2 while significantly reducing inference time compared to existing methods, highlighting its potential for enhancing digital humans and embodied agents in real-world applications.

Method

(1) GestureLSM generate full-body gestures from speech and text scripts. From left to right, the concatenated audio and text features are fused into gesture features via cross-attention. The condition fused gesture features are adopted to decode gesture latents with our proposed spatial-temporal decoder. The optimization objective is based on the flow matching. (2) The gesture latents are from pretrained RVQ (Residual Vector Quantization) models. (3) The details of spatial-temporal attention, it integrates with position encoding to learn the interaction of body regions.

Additional Results

BibTeX

@inproceedings{liu2025gesturelsmlatentshortcutbased, title={GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling}, author={Pinxin Liu and Luchuan Song and Junhua Huang and Chenliang Xu}, booktitle={IEEE/CVF International Conference on Computer Vision}, year={2025}, }

GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling

Abstract

Method

Comparison on BEAT2

Additional Results

GestureLSM can generate various gestures with different body regions and motions, showcasing its potential for various applications in the development of digital humans and embodied agents.

BibTeX