A correlated sampling scheme for boosting temporal consistency in diffusion models

Visualization for human motion generation

We present visualizations for action-to-motion generation task. The left column is produced by MDM, and right column is produced by MDM + ARTDiff. We can see that the videos on the left has jitter, flickering, and other unnatural and inconsistent phenomena, while the videos on the right is consistent and natural.

Eat

Drink

Run

Jump

The following four motions are all generated by ARTDiff given the same prompt "jump", which showcases the diversity of generation.

Visualization for one-shot video generation

We present visualizations for the one-shot video generation task. The left column is produced by Tune-A-Video, and right column is produced by Tune-A-Video + ARTDiff.

"a jeep car is moving on the snow"

The background on the left has abrupt changes, while the right is smooth.

"a raccoon is surfing, cartoon style"

The left raccoon's arms are shaking and flickering, while the right arms are stable.

"spider man is skiing on the beach, cartoon style"

The skateboard under Spider-Man's feet on the left is a single board one moment and a double board the next, while the skateboard on the right stays a single board all the time.

"a tiger is eating a watermelon"

The left tiger's ass is changing, the right one isn't.

"a person kiteboarding in a stormy sea"

The person is flickering at the end of the left video, while the right is consistent.

"a pink flamingo swimming in a pond"

The left flamingo has two necks, while the right looks more natural.

Visualization for video editing

We present visualizations for the video editing task. The left column is the input video; the middle column is produced by Video-P2P; and the right column is produced by VideoP2P + ARTDiff.

"a penguin is running on the ice" --> "a crochet penguin is running on the ice"

The mouth of the baseline penguin has abrupt changes, while ours is smooth.

"a man is surfing on the ocean" --> "a man is surfing on the desert"

The surfboard in the middle is flickering, while that on the right is stable.

"a rabbit is jumping on the grass" --> "a origami rabbit is jumping the grass"

The head of the baseline-generated rabbit is changing, while ours is more consistent.

"a man is walking on the mountain" --> "a panda is walking on the mountain"

The face of the panda in the baseline-generated image is flickering, while ours isn't.