We present Temporal In-Context Fine-Tuning (TIC-FT), a simple and efficient method for adapting pretrained video diffusion models to a wide range of conditional generation tasks. TIC-FT works by concatenating condition and target frames along the temporal axis and inserting buffer frames with increasing noise, enabling smooth transitions and alignment with the model's temporal dynamics. Unlike prior approaches, TIC-FT requires no architectural changes or large datasets, and achieves strong performance with as few as 10–30 training samples. We demonstrate its effectiveness on tasks such as image-to-video and video-to-video generation using large-scale models, showing superior condition fidelity, visual quality, and efficiency compared to existing baselines.
※ All videos below are generated using Wan2.1-T2V-14B and CogVideoX-T2V-5B
This task generates a full video conditioned on a single image. The image may represent a high-level concept—such as a character profile or a top-view object—with the video depicting novel dynamics, such as a character-centric animation or a 360° rotation.
This task transforms the visual style of a source video into that of a target domain (e.g., converting a realistic video into an animated version) while preserving motion and structure.
This task generates a video based on two or more image conditions—such as a person and clothing, or a person and an object—capturing the combined semantics in motion.
This task fills in intermediate frames between sparse keyframes to generate a smooth and temporally coherent video.
This task continues a novel scene by transferring the motion pattern of a reference action video into a new context, guided by the first frame of the new scene.
TIC-FT is a simple yet powerful approach for conditional video generation. The method's core innovation lies in its temporal concatenation strategy, which combines three key components:
This carefully designed architecture ensures temporal coherence throughout the sequence while minimizing distribution mismatch during the fine-tuning process.
The basic temporal concatenation combines condition frames with target frames: \[ \mathbf{z}^{(t)} = [\bar{\mathbf{z}}^{(0)}_{1:L} \parallel \hat{\mathbf{z}}^{(t)}_{L+1:L+K}] \] where \(\bar{\mathbf{z}}^{(0)}\) represents clean condition frames and \(\hat{\mathbf{z}}^{(t)}\) represents noisy target frames.
Buffer frames are inserted with noise levels that gradually increase: \[ \tilde{\tau}_b = \frac{b}{B+1} \cdot T \quad \text{for} \quad b = 1,\ldots,B \] The complete sequence with buffer frames is then: \[ \mathbf{z}^{(T)} = [\bar{\mathbf{z}}^{(0)} \parallel \tilde{\mathbf{z}}^{(\tilde{\tau}_{1:B})} \parallel \hat{\mathbf{z}}^{(T)}] \] where \(\tilde{\mathbf{z}}^{(\tilde{\tau}_{1:B})}\) represents the buffer frames with interpolated noise levels.
The training loss is computed only on the target frames: \[ \mathcal{L} = \frac{1}{K} \sum_{i=L+B+1}^{L+B+K} \|\boldsymbol{\epsilon}_i - \hat{\boldsymbol{\epsilon}}_i\|^2 \] This formulation allows buffer frames to evolve naturally, creating a smooth transition between condition and target sequences.
At each global timestep \(t\), the model identifies frames with the current noise level and selectively denoises only those frames: \[ \mathbf{T}(\mathbf{z}^{(T)}) = [0, \tilde{\tau}_1, \ldots, \tilde{\tau}_B, T, \ldots, T] \] \[ \mathbf{T}(\mathbf{z}^{(t)}) = [0, \tau_1(t), \ldots, \tau_B(t), t, \ldots, t] \quad \text{where} \quad \tau_b(t) = \min(t, \tilde{\tau}_b) \] This step-wise denoising continues from \(t = T\) down to \(t = 0\), preserving the condition frames and ensuring a smooth noise transition across buffer and target frames.
Table 1: Comparison on VBench, GPT-4o, and perceptual similarity metrics for I2V tasks
Table 2: Comparison on VBench, GPT-4o, and perceptual similarity metrics for V2V tasks
TBD