Point Prompting: Counterfactual Tracking with Video Diffusion Models

1University of Michigan, 2Cornell University

TL;DR: We propose a method to do zero-shot point tracking by simply prompting video diffusion models to visually mark points as they move over time.



These videos are directly generated by the video diffusion model, with red markers propagated across time.

Abstract

Trackers and video generators solve closely related problems: the former analyze motion, while the latter synthesize it. We show that this connection enables pretrained video diffusion models to perform zero-shot point tracking by simply prompting them to visually mark points as they move over time. We place a distinctively colored marker at the query point, then regenerate the rest of the video from an intermediate noise level. This propagates the marker across frames, tracing the point's trajectory. To ensure that the marker remains visible in this counterfactual generation, despite such markers being unlikely in natural videos, we use the unedited initial frame as a negative prompt. Through experiments with multiple image-conditioned video diffusion models, we find that these "emergent" tracks outperform those of prior zero-shot methods and persist through occlusions, often obtaining performance that is competitive with specialized self-supervised models.

Point Propagation


We mark a query point in the first frame with a colored dot. Then, using SDEdit, we guide the diffusion model to regenerate the video, propagating the dot across subsequent frames, by enhancing the counterfactual signal. The regenerated video propagates the marker across frames, tracing the underlying point's trajectory.

Enhancing the Counterfactual Signal


We use negative prompting to ensure that the generated video contains the marker. In each denoising step, we condition the denoising on two images: (1) Edited First Frame: the first frame of the video with a marking added, and (2) Unedited First Frame: the original first frame of the video. We then subtract the weighted noise vector of the latter from the former which enhances the counterfactual signal.

Qualitative Results

Point Propagation

These generated videos show a propagated red marker across time. The video model used here is Wan2.1 14B I2V.

Tracking

We run a color-based tracker on generated videos to get the track for a propagated point. We combine the tracks for all propagated points and show it here.

Tracking Through Occlusions

Drag the slider on the videos to compare our tracking results with the original video.

Generated Video with Refinement Comparison

The regenerated video (middle) contains objects that differ in appearance or position from the original (left), with artifacts highlighted by white circles. Applying inpainting-based refinement removes these artifacts, producing a refined video (right) that is visually consistent with the original.

Failure Cases

These generated videos show failure cases of our method.

Stationary Point: Point remains stationary with respect to the image boundaries.

Symmetry: Point drawn on the right foot gets propagated to the left foot.

Propagation Failure: Model fails to propagate the point in the last few frames.

Ambiguity Near Edges: Point near the boundary snaps to background.

Point Propagation (with CogVideoX)

BibTeX

@InProceedings{shrivastava2025pointprompting,
      title     = {Point Prompting: Counterfactual Tracking with Video Diffusion Models},
      author    = {Shrivastava, Ayush and Mehta, Sanyam and Geng, Daniel and Owens, Andrew},
      booktitle = {arXiv},
      year      = {2025},
      url       = {https://arxiv.org/abs/2510.11715},
}