KinMo: Kinematic-aware Human Motion Understanding and Generation

ICCV 2025

We present KinMo, a method designed for fine-grained motion understanding for (a) efficient motion retrieval based on text descriptions, text-aligned motion (b) generation and (c) editing, and (d) trajectory control on local kinematic body parts.

Method

Overview of KinMo. The pipeline steps include three components: (1) semi-supervised dataset annotation with LLMs; (2) hierarchical text-motion alignment for fine-grained motion understanding; (3) coarse-to-fine motion generation for downstream applications.

Demo Video

Motion Generation Comparisons

A person holds their arms up as if holding a dance partner's hand and back, then dances in a square pattern.

A person who is standing with his hands by his sides, crosses his arms, then slightly adjusts his arms in the crossed position, then drops his arms to his original position.

A man leans forward to pick up an object slightly to his right, and places it down slightly to his left.

A man is dancing, jumping from side to side waving his arms above his head.

A man stretches his arms out to the side shoulder length, then brings them in front of him with elbow slightly bent.

A person looks for something on the ground with right hand.

We compare our method with MMM, MoMask and STMC. Our method achieves natural movements closely aligned with the text descriptions, even with long and complex text descriptions. The static texture on the meshes is randomly chosen for better visualization and is not part of our method output.

Text-to-Motion Generation

A person quickly extends one hand at a time, usually alternating between right and left.

A man is walking up the stairs while holding the railing with their left hand.

A person raises his right arm up in front of his face, and holds it there for a second before putting his arm back down.

We show more visual results of our motion generation based on text descriptions. Our method performs accurate kinematic body-part control, like left hand, right arm, up, down etc.

Text-Motion Editing

Original: A person takes a couple of steps, then pivots on left leg to turn around and walks back.

Edited: A person takes a couple of steps, then pivots on right leg to turn around and walks back.

Original: A person leans his upper body forward with both arms behind him, trying to balance himself.

Edited: A person leans his upper body and steps forward, trying to balance himself.

Original: A person walks forward fast, uses right hand to reach forward, then turns around and walks back.

Edited: A person walks forward fast, then squats down to pick something up, then turns around and walks back.

We can re-enact the same character by editing the text descriptions to generate different gestures.

Motion Trajectory Control

The man is standing and lost his balance, almost fell but caught himeself.

The man is walking with arms at side.

The man first run quickly and slow down his pace.

The man crouches and punches the air with his right hand.

The man crouches and punches the air with his left hand.

We can control any joint motions by providing trajectory during motion generation.

Ablation Study

A man bends his knees in a squatting motion while holding a bar over his shoulders with both hands.

We generate motion based on (a) only global semantic, (b) global with joint semantic, and (c) global, joint, and interaction semantics, compared with (d) ground truth, to show the importance of our hierarchical text-motion alignment.

BibTeX

@misc{zhang2024kinmokinematicawarehumanmotion,
      title={KinMo: Kinematic-aware Human Motion Understanding and Generation}, 
      author={Pengfei Zhang and Pinxin Liu and Hyeongwoo Kim and Pablo Garrido and Bindita Chaudhuri},
      year={2024},
      eprint={2411.15472},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.15472}, 
     
}