KinMo: Kinematic-aware Human Motion Understanding and Generation

Arxiv 2024

We present KinMo, a method designed for fine-grained motion understanding for (a) efficient motion retrieval based on text descriptions, text-aligned motion (b) generation and (c) editing, and (d) trajectory control on local kinematic body parts.

Abstract

Controlling human motion based on text descriptions presents an important challenge in computer vision. Traditional approaches often rely on holistic action descriptions for motion synthesis, which struggle to capture subtle movements of local body parts. This limitation restricts the ability to isolate and manipulate specific movements. To address this, we propose a novel motion representation that decomposes motion into distinct body joint group movements and interactions from a kinematic perspective. We design an automatic dataset collection pipeline that enhances the existing text-motion benchmark by incorporating fine-grained local joint-group motion and interaction descriptions. To bridge the gap between text and motion domains, we introduce a hierarchical motion semantics approach that progressively fuses joint-level interaction information into the global action-level semantics for modality alignment. With this hierarchy, we introduce a coarse-to-fine motion synthesis procedure for various generation and editing downstream applications. Our quantitative and qualitative experiments demonstrate that the proposed formulation enhances text-motion retrieval by improving joint-spatial understanding, and enables more precise joint-motion generation and control.

Method

Overview of KinMo. The pipeline steps include three components: (1) semi-supervised dataset annotation with LLMs; (2) hierarchical text-motion alignment for fine-grained motion understanding; (3) coarse-to-fine motion generation for downstream applications.

Demo Video

Motion Generation Comparisons

A person holds their arms up as if holding a dance partner's hand and back, then dances in a square pattern.

A person who is standing with his hands by his sides, crosses his arms, then slightly adjusts his arms in the crossed position, then drops his arms to his original position.

A man leans forward to pick up an object slightly to his right, and places it down slightly to his left.

A man is dancing, jumping from side to side waving his arms above his head.

A man stretches his arms out to the side shoulder length, then brings them in front of him with elbow slightly bent.

A person looks for something on the ground with right hand.

We compare our method with MMM, MoMask and STMC. Our method achieves natural movements closely aligned with the text descriptions, even with long and complex text descriptions. The static texture on the meshes is randomly chosen for better visualization and is not part of our method output.

Text-to-Motion Generation

A person quickly extends one hand at a time, usually alternating between right and left.

A man is walking up the stairs while holding the railing with their left hand.

A person raises his right arm up in front of his face, and holds it there for a second before putting his arm back down.

We show more visual results of our motion generation based on text descriptions. Our method performs accurate kinematic body-part control, like left hand, right arm, up, down etc.

Text-Motion Editing

We can re-enact the same character by editing the text descriptions to generate different gestures.

Motion Trajectory Control

We can control any joint motions by providing trajectory during motion generation.

Ablation Study

A man bends his knees in a squatting motion while holding a bar over his shoulders with both hands.

We generate motion based on (a) only global semantic, (b) global with joint semantic, and (c) global, joint, and interaction semantics, compared with (d) ground truth, to show the importance of our hierarchical text-motion alignment.

BibTeX

@misc{
     Arxiv 2024
}