We present visualizations for action-to-motion generation task. The left column is produced by MDM, and right column is produced by MDM + ARTDiff. We can see that the videos on the left has jitter, flickering, and other unnatural and inconsistent phenomena, while the videos on the right is consistent and natural.
The following four motions are all generated by ARTDiff given the same prompt "jump", which showcases the diversity of generation.
We present visualizations for the one-shot video generation task. The left column is produced by Tune-A-Video, and right column is produced by Tune-A-Video + ARTDiff.
The background on the left has abrupt changes, while the right is smooth.
The left raccoon's arms are shaking and flickering, while the right arms are stable.
The skateboard under Spider-Man's feet on the left is a single board one moment and a double board the next, while the skateboard on the right stays a single board all the time.
The left tiger's ass is changing, the right one isn't.
The person is flickering at the end of the left video, while the right is consistent.
The left flamingo has two necks, while the right looks more natural.
We present visualizations for the video editing task. The left column is the input video; the middle column is produced by Video-P2P; and the right column is produced by VideoP2P + ARTDiff.
The mouth of the baseline penguin has abrupt changes, while ours is smooth.
The surfboard in the middle is flickering, while that on the right is stable.
The head of the baseline-generated rabbit is changing, while ours is more consistent.
The face of the panda in the baseline-generated image is flickering, while ours isn't.