Temporal In-Context Fine-Tuning
for Versatile Control of Video Diffusion Models

KAIST AI

Abstract

We present Temporal In-Context Fine-Tuning (TIC-FT), a simple and efficient method for adapting pretrained video diffusion models to a wide range of conditional generation tasks. TIC-FT works by concatenating condition and target frames along the temporal axis and inserting buffer frames with increasing noise, enabling smooth transitions and alignment with the model's temporal dynamics. Unlike prior approaches, TIC-FT requires no architectural changes or large datasets, and achieves strong performance with as few as 10–30 training samples. We demonstrate its effectiveness on tasks such as image-to-video and video-to-video generation using large-scale models, showing superior condition fidelity, visual quality, and efficiency compared to existing baselines.

※ All videos below are generated using Wan2.1-T2V-14B and CogVideoX-T2V-5B

Image-To-Video

This task generates a full video conditioned on a single image. The image may represent a high-level concept—such as a character profile or a top-view object—with the video depicting novel dynamics, such as a character-centric animation or a 360° rotation.

Cartoon Input 1
Cartoon Input 2
3DToVideo 1-1
3DToVideo 2-1
CharacterToVideo 1-1
CharacterToVideo 2-1
360 1-1
360 2-1
NeRF 1-1
NeRF 2-1

Video Style Transfer

This task transforms the visual style of a source video into that of a target domain (e.g., converting a realistic video into an animated version) while preserving motion and structure.

Multiple Image Conditions

This task generates a video based on two or more image conditions—such as a person and clothing, or a person and an object—capturing the combined semantics in motion.

Advertise 1-1
Advertise 1-2
Advertise 2-1
Advertise 2-2
VITON 1-1
VITON 1-2
VITON 2-1
VITON 2-2

Keyframe Interpolation

This task fills in intermediate frames between sparse keyframes to generate a smooth and temporally coherent video.

INTERPOLATE 1-1
INTERPOLATE 1-2
INTERPOLATE 1-3
INTERPOLATE 1-4

In-Context Action Transfer

This task continues a novel scene by transferring the motion pattern of a reference action video into a new context, guided by the first frame of the new scene.

SSv2 1-2

Method: Temporal In-Context Fine-Tuning (TIC-FT)

TIC-FT is a simple yet powerful approach for conditional video generation. The method's core innovation lies in its temporal concatenation strategy, which combines three key components:

  • Clean condition frames that provide the initial context
  • Buffer frames with gradually increasing noise levels
  • Pure noise target frames for generation

This carefully designed architecture ensures temporal coherence throughout the sequence while minimizing distribution mismatch during the fine-tuning process.

Key Formulations

1. Temporal Concatenation

The basic temporal concatenation combines condition frames with target frames: \[ \mathbf{z}^{(t)} = [\bar{\mathbf{z}}^{(0)}_{1:L} \parallel \hat{\mathbf{z}}^{(t)}_{L+1:L+K}] \] where \(\bar{\mathbf{z}}^{(0)}\) represents clean condition frames and \(\hat{\mathbf{z}}^{(t)}\) represents noisy target frames.

2. Buffer Frames

Buffer frames are inserted with noise levels that gradually increase: \[ \tilde{\tau}_b = \frac{b}{B+1} \cdot T \quad \text{for} \quad b = 1,\ldots,B \] The complete sequence with buffer frames is then: \[ \mathbf{z}^{(T)} = [\bar{\mathbf{z}}^{(0)} \parallel \tilde{\mathbf{z}}^{(\tilde{\tau}_{1:B})} \parallel \hat{\mathbf{z}}^{(T)}] \] where \(\tilde{\mathbf{z}}^{(\tilde{\tau}_{1:B})}\) represents the buffer frames with interpolated noise levels.

3. Loss Function

The training loss is computed only on the target frames: \[ \mathcal{L} = \frac{1}{K} \sum_{i=L+B+1}^{L+B+K} \|\boldsymbol{\epsilon}_i - \hat{\boldsymbol{\epsilon}}_i\|^2 \] This formulation allows buffer frames to evolve naturally, creating a smooth transition between condition and target sequences.

4. Inference

At each global timestep \(t\), the model identifies frames with the current noise level and selectively denoises only those frames: \[ \mathbf{T}(\mathbf{z}^{(T)}) = [0, \tilde{\tau}_1, \ldots, \tilde{\tau}_B, T, \ldots, T] \] \[ \mathbf{T}(\mathbf{z}^{(t)}) = [0, \tau_1(t), \ldots, \tau_B(t), t, \ldots, t] \quad \text{where} \quad \tau_b(t) = \min(t, \tilde{\tau}_b) \] This step-wise denoising continues from \(t = T\) down to \(t = 0\), preserving the condition frames and ensuring a smooth noise transition across buffer and target frames.

Evaluation

I2V Evaluation Metrics

Table 1: Comparison on VBench, GPT-4o, and perceptual similarity metrics for I2V tasks

V2V Evaluation Metrics

Table 2: Comparison on VBench, GPT-4o, and perceptual similarity metrics for V2V tasks

BibTeX

TBD