GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling

GestureLSM. We present Gesture Latent Shortcut Model, a method that generates full-body human gestures from speech with high quality and real-time speed. It explicitly models the body regions interactions, e.g., the interactions between body and hands, to achieve coherent and smooth overall gesture motions. Besides, it is also capable of real-time generation based on shortcut sampling.

Abstract

Generating full-body human gestures based on speech signals remains challenges on quality and speed. Existing approaches model different body regions such as body, legs and hands separately, which fail to capture the spatial interactions between them and result in unnatural and disjointed movements. Additionally, their autoregressive/diffusion-based pipelines show slow generation speed due to dozens of inference steps. To address these two challenges, we propose GestureLSM, a flow-matching-based approach for Co-Speech Gesture Generation with spatial-temporal modeling. Our method i) explicitly model the interaction of tokenized body regions through spatial and temporal attention, for generating coherent full-body gestures. ii) introduce the flow matching to enable more efficient sampling by explicitly modeling the latent velocity space. To overcome the suboptimal performance of flow matching baseline, we propose latent shortcut learning and beta distribution time stamp sampling during training to enhance gesture synthesis quality and accelerate inference. Combining the spatial-temporal modeling and improved flow matching-based framework, GestureLSM achieves state-of-the-art performance on BEAT2 while significantly reducing inference time compared to existing methods, highlighting its potential for enhancing digital humans and embodied agents in real-world applications.

Method

(1) GestureLSM generate full-body gestures from speech and text scripts. From left to right, the concatenated audio and text features are fused into gesture features via cross-attention. The condition fused gesture features are adopted to decode gesture latents with our proposed spatial-temporal decoder. The optimization objective is based on the flow matching. (2) The gesture latents are from pretrained RVQ (Residual Vector Quantization) models. (3) The details of spatial-temporal attention, it integrates with position encoding to learn the interaction of body regions.

Comparison on BEAT2

GestureLSM achieves more natural gesture motions conditioned on speech audio, benefiting from the spatial-temporal modeling based on spatial and temporal attentions. The generated gestures are coherent and smooth, while other works always present temporal jittering, abnormal body parts movements, and disjointed gestures.

Additional Results

GestureLSM can generate various gestures with different body regions and motions, showcasing its potential for various applications in the development of digital humans and embodied agents.

BibTeX

@misc{liu2025gesturelsmlatentshortcutbased,
      title={GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling}, 
      author={Pinxin Liu and Luchuan Song and Junhua Huang and Chenliang Xu},
      year={2025},
      eprint={2501.18898},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.18898}, 
}