Retiming People in Video

When I was a senior in college, I took a seminar called Creative AI in Visual Computing, taught by Prof. Julie Dorsey. In a week focused on Video Manipulation, we read the paper Layered Neural Rendering for Retiming People in Video, which presented an algorithm that could retime people in a video. The prime example they showed, which is copied below, blew me away in its realism.

I think there are a lot of examples of synthetic media that ‘aren’t there yet’. Whether it’s a deepfake face straight out of the uncanny valley or a clueless attempt at art generation from a prompt, AI has a lot of room to grow. Yet I couldn’t tell that this video example had been synthesized at all. Yet it is, undeniably, synthetic media; the order of events are manipulated by a computer, creating causality and/or synchronization where there was none.

Process

The paper explains that their algorithm uses a deep learning neural network. The biggest breakthrough of this paper is that their model learns not just to recognize a person, but the impact they have on the world around them.

It is only trained on the video itself, and is self-supervised; the network learns after each iteration by attempting to reconstruct original video frames. Each cycle the model divides each frame into RGBA layers, learning not just to separate people but also their environmental impact like splashes, shadows, and trampoline deformations. Thus, when people are retimed, their effects are automatically retimed too. The ending result of the network produces a set of layers is then ready to be interpolated or temporally shifted.

Ramifications

The resulting videos of the paper are very neat, but we should not overlook the power of literally reconstructing the course of events on our whim. In today’s modern world, the question of who acted first? is crucial. From pop culture to international warfare, public opinion will change drastically based on whether someone instigated or acted in defense.

For example, in controversial police encounters, the question of who provoked who is usually revealed with body cam footage. This can make or break a case in court. What’s to stop this technology from doctoring video to protect instigating cops?

I could also see this being leveraged as political misinformation: doctoring interactions from events like debates to change the context of a reaction or a statement. While these clips could ultimately be verified, misinformation notoriously spreads rampantly on social media unchecked, and so these kinds of manipulation could be weaponized by entities looking to push an agenda or influence public opinion.

Who shot first? With Video Retiming, it could be anyone.

He said, she said…

Overall, I found the process in this paper to be really fascinating and intuitive, while also producing impressive results. As synthetic media becomes more and more sophisticated and accessible, it’s important to keep in mind all the good and bad such a technology is capable of.