TL;DR: We propose a method to do zero-shot point tracking by simply prompting video diffusion models to visually mark points as they move over time.

These videos are directly generated by the video diffusion model, with red markers propagated across time.

Abstract

Trackers and video generators solve closely related problems: the former analyze motion, while the latter synthesize it. We show that this connection enables pretrained video diffusion models to perform zero-shot point tracking by simply prompting them to visually mark points as they move over time. We place a distinctively colored marker at the query point, then regenerate the rest of the video from an intermediate noise level. This propagates the marker across frames, tracing the point's trajectory. To ensure that the marker remains visible in this counterfactual generation, despite such markers being unlikely in natural videos, we use the unedited initial frame as a negative prompt. Through experiments with multiple image-conditioned video diffusion models, we find that these "emergent" tracks outperform those of prior zero-shot methods and persist through occlusions, often obtaining performance that is competitive with specialized self-supervised models.

Point Propagation

We mark a query point in the first frame with a colored dot. Then, using SDEdit, we guide the diffusion model to regenerate the video, propagating the dot across subsequent frames, by enhancing the counterfactual signal. The regenerated video propagates the marker across frames, tracing the underlying point's trajectory.

Enhancing the Counterfactual Signal

We use negative prompting to ensure that the generated video contains the marker. In each denoising step, we condition the denoising on two images: (1) Edited First Frame: the first frame of the video with a marking added, and (2) Unedited First Frame: the original first frame of the video. We then subtract the weighted noise vector of the latter from the former which enhances the counterfactual signal.

Qualitative Results

Point Propagation

These generated videos show a propagated red marker across time. The video model used here is Wan2.1 14B I2V.

Tracking

We run a color-based tracker on generated videos to get the track for a propagated point. We combine the tracks for all propagated points and show it here.

Tracking Through Occlusions

Drag the slider on the videos to compare our tracking results with the original video.

Generated Video with Refinement Comparison

The regenerated video (middle) contains objects that differ in appearance or position from the original (left), with artifacts highlighted by white circles. Applying inpainting-based refinement removes these artifacts, producing a refined video (right) that is visually consistent with the original.

Failure Cases

These generated videos show failure cases of our method.

Stationary Point: Point remains stationary with respect to the image boundaries.

Symmetry: Point drawn on the right foot gets propagated to the left foot.

Propagation Failure: Model fails to propagate the point in the last few frames.

Ambiguity Near Edges: Point near the boundary snaps to background.

Point Propagation (with CogVideoX)

Related Work

DIFT: Emergent Correspondence from Image Diffusion extracts features from a pretrained image diffusion model for matching points in images.
Opt-CWM: Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals inspires our counterfactual enchancement technique. This work learns RGB permutations that can be propagated through a frozen next-frame predictor, optimizing them with a jointly trained sparse optical-flow module, to enable temporal correspondence.
We evaluate two video diffusion models, Wan2.1 and CogVideoX, both image-conditioned transformer models for video generation trained using Flow Matching.
A recent concurrent work, Emergent Temporal Correspondences from Video Diffusion Transformers, extracts features from a pretrained video model for tracking, using a one-to-one frame-to-latent mapping to avoid temporal compression. However, it requires architecture-specific analysis to identify effective layers and does not address occlusions.
Another concurrent work, Taming generative video models for zero-shot optical flow extraction, uses a generative video model to trace an injected perturbation across the predicted next frame for optical flow estimation and point tracking.

BibTeX

@InProceedings{shrivastava2025pointprompting,
      title     = {Point Prompting: Counterfactual Tracking with Video Diffusion Models},
      author    = {Shrivastava, Ayush and Mehta, Sanyam and Geng, Daniel and Owens, Andrew},
      booktitle = {arXiv},
      year      = {2025},
      url       = {https://arxiv.org/abs/2510.11715},
}