Pinxin Liu

Pinxin Liu (Andy)

I am a graduate student in the Computer Science Department at the University of Rochester. I received my B.S. in Computer Science with highest honor distinction from the University of Rochester.

My research interests are multimodal large language models, visual agents, video generative models, and human-behavior understanding and generation.

Email / Google Scholar / GitHub / CV

Job Interest

Seeking Research / Applied Scientist / ML Engineer positions in Multimodal Language Models, Human Video Generation (3D / Diffusion), Video Understanding (MLLM), Audio-Visual Learning, and 3D Perception & Understanding. My master's research focused primarily on language models and human-centered perception/generation.

News

[02/2026]	One paper accepted to CVPR 2026.
[09/2025]	One co-author paper accepted Cell Reports Methods.
[09/2025]	One co-author paper accepted NeurIPS 2025.
[06/2025]	One first-author paper accepted ACM Multimedia 2025.
[06/2025]	Two first-author papers accepted ICCV 2025.
[05/2025]	Joined Meta Reality Lab as a Research Scientist Intern.
[11/2024]	One first-author paper accepted 3DV 2025.
[09/2024]	One co-author paper accepted Siggraph Asia 2024.
[07/2024]	One first-author paper accepted ECCV 2024.
[06/2024]	Joined FlawlessAI as a Research Scientist Intern.

Experience

Meta, Reality Lab — Research Scientist Intern
May 2025 – Dec 2025
Report to: A. Richard, D. Markovic

LLM-based Motion Understanding & Generation: Integrated 3D body mesh representations into an LLM tokenization pipeline with a diffusion/flow-matching generation head, enabling unified cross-modal understanding and controllable fine-grained body motion generation; unlocking instruction-following avatar animation at scale for interactive digital human experiences on Meta Quest and AR devices.
Automatic Data Annotation at Scale: Designed an automated annotation pipeline converting raw 3D pose and mesh sequences into structured semantic language descriptions without human labelers; reducing data curation cost by orders of magnitude and enabling training on million-scale motion corpora that was previously infeasible.

Flawless AI — Research Scientist Intern
Jun 2024 – Dec 2024
Report to: P. Garrido, A. Shapiro

Foundation Human Video Generation: Built a large-scale diffusion-based foundation model with multi-stage training on 100M+ video-text pairs using FSDP across multi-node GPU clusters, establishing the core generative backbone for the company's film production pipeline.
Speech-to-Gesture Alignment: Developed cross-modal alignment between pixel-level motion and speech semantic embeddings; shipped an upper-body co-speech gesture animation system in production for film post-production clients.
Pixel-Level Artifact Refinement: Shipped a pixel-space refinement module suppressing boundary artifacts across diverse backbones without retraining, adopted as a plug-and-play quality layer across video generation products.

Research

	Bridging Facial Understanding and Animation via Language Models Luchuan Song, Pinxin Liu, Haiyang Liu, Zhenchao Jin, Yunlong Tang, Zicong Xu, Susan Liang, Jing Bi, Jason J. Corso, Chenliang Xu CVPR, 2026 project page / code / data We introduce Open3DFaceVid and cast facial-parameter modeling as a language problem, enabling bidirectional Motion2Language and Language2Motion for text-conditioned facial animation and understanding.
	Contextual Gesture: Co-Speech Gesture Video Generation through Context-aware Gesture Representation Pinxin Liu, Pengfei Zhang, Hyeongwoo Kim, Pablo Garrido, Ari Shapiro, Kyle Olszewski ACM Multimedia, 2025 project page / paper / code We propose a context-aware gesture representation for co-speech gesture video generation.
	Video Understanding with Large Language Models: A Survey Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, ..., Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu IEEE Transactions on Circuits and Systems for Video Technology* paper / project page A comprehensive survey on video understanding techniques powered by large language models.
	GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling Pinxin Liu, Luchuan Song, Junhua Huang, Haiyang Liu, Chenliang Xu ICCV, 2025 project page / paper / code We propose a latent shortcut mechanism for co-speech gesture generation with spatial-temporal modeling.
	KinMo: Kinematic-aware Human Motion Understanding and Generation Pinxin Liu, Pengfei Zhang, Hyeongwoo Kim, Pablo Garrido, Bindita Chaudhuri ICCV, 2025 project page / paper / code We present a kinematic-aware approach for human motion understanding and generation.
	MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness Pinxin Liu, Yunlong Tang, Zhangyun Tan, Mingqian Feng, Rui Mao, Chao Huang, Jing Bi, Yunzhong Xiao, Susan Liang, Hang Hua, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Chenliang Xu NeurIPS*, 2025 (Datasets and Benchmarks Track) project page / paper We introduce MMPerspective, the first benchmark to systematically evaluate MLLMs' understanding of perspective through 10 tasks across perception, reasoning, and robustness (2,711 images, 5,083 QA pairs). Accepted to NeurIPS 2025 DB Track.
	GaussianStyle: Gaussian Head Avatar via StyleGAN Pinxin Liu, Luchuan Song, Daoan Zhang, Hang Hua, Yunlong Tang, Huaijin Tu, Jiebo Luo, Chenliang Xu 3DV, 2025 paper We propose GaussianStyle, integrating 3D Gaussian Splatting with StyleGAN for head avatars. The framework preserves expression and pose with Gaussians while projecting implicit volume into StyleGAN for high-frequency detail, achieving state-of-the-art in reenactment, novel view synthesis, and animation.
	Generative AI for Cel-Animation: A Survey Yunlong Tang, Junjia Guo, Pinxin Liu, Zhiyuan Wang, Hang Hua, Jia-Xing Zhong, Yunzhong Xiao, Chao Huang, Luchuan Song, Susan Liang and 7 more authors ICCVW, 2025 paper / project page A comprehensive survey on generative AI for cel-animation.
	TextToon: Real-Time Text Toonify Head Avatar from Single Video Luchuan Song, Lele Chen, Celong Liu, Pinxin Liu, Chenliang Xu Siggraph Asia, 2024 project page / paper / code We present a method to generate a drivable toonified avatar. Given a monocular video and a written instruction about the avatar style, it can generate a toonified avatar that can be animated in real time.
	Tri2-plane: Thinking Head Avatar via Feature Pyramid Pinxin Liu, Luchuan Song, Lele Chen, Guojun Yin, Chenliang Xu ECCV, 2024 project page / paper / code We attach the multi-combined tri-plane structure for monocular photo-realistic volumetric head avatar reconstructions.
	Adaptive Super Resolution for One-Shot Talking Head Generation Luchuan Song, Pinxin Liu, Guojun Yin, Chenliang Xu ICASSP, 2024 paper / code / video We apply the mix-resolution images in one-shot talking head training. The resolution could achieve 512px from 256px in previous.

Education

University of Rochester	M.S., Computer Science Jan 2025 – May 2026 (expected)
University of Rochester	B.S., Computer Science; Highest Honor Distinction in Research; GPA: 3.83/4.0 Jul 2020 – May 2024

Teaching

CSC 245/445: Deep Learning — Fall 2025, Fall 2024, Fall 2023

CSC 249/449: Machine Vision — Spring 2024

CSC 446: Machine Learning — Spring 2026, Fall 2023

CSC 247/447: Natural Language Processing — Spring 2023

CSC 248/448: Statistical Speech and Language Processing — Fall 2022