SpatialAlign: Aligning Dynamic Spatial Relationships in Video Generation

Fengming Liu, Tat-Jen Cham, Chuanxia Zheng

Nanyang Technological University

SPATIALALIGN: dynamic spatial relationships are better aligned.

Abstract

Most text-to-video (T2V) generators prioritize aesthetic quality, but often ignoring the spatial constraints in the generated videos. In this work, we present SPATIALALIGN, a self-improvement framework that enhances T2V models’ capabilities to depict Dynamic Spatial Relationships (DSR) specified in text prompts. We present a zeroth-order regularized Direct Preference Optimization (DPO) to fine-tune T2V models towards better alignment with DSR. Specifically, we design DSR-SCORE, a geometry-based metric that quantitatively measures the alignment between generated videos and the specified DSRs in prompts, which is a step forward from prior works that rely on VLM for evaluation. We also conduct a dataset of text-video pairs with diverse DSRs to facilitate the study. Extensive experiments demonstrate that our fine-tuned model significantly outperforms the baseline in spatial relationships.

Method

Overview of SpatialAlign. (A) Given a text prompt, the pre-trained T2V model generates several video samples. For each sample, we first use GroundedSAM to obtain the bboxes of the animal and the object in each frame. (B) Then, for each frame, we compute the Static Spatial Relationship (SSR) Score based on the bboxes. From the SSR Score Sequence (all frames), we derive four metric components that are aggregated into the DSR-SCORE, which quantifies how well the video aligns with the dynamic spatial relationship (DSR). (C) During DPO training, we identify winner/loser pairs based on the DSR-SCORE using a threshold. We then train a LoRA to enhance the model's ability to accurately represent DSR in generated videos, using our proposed zero-order regularized DPO.

Results

Remark: INVALID means the video does not satisfy the one-animal-one-object criterion

BibTeX

@misc{liu2026spatialalignaligningdynamicspatial,
      title={SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation}, 
      author={Fengming Liu and Tat-Jen Cham and Chuanxia Zheng},
      year={2026},
      eprint={2602.22745},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.22745}, 
}