The rising popularity of immersive visual experiences has increased interest in stereoscopic 3D video generation. De- spite significant advances in video synthesis, creating 3D videos remains challenging due to the relative scarcity of 3D video data. We propose a simple approach for transforming a text-to-video generator into a video-to-stereo generator. Given an input video, our framework automatically produces the video frames from a shifted viewpoint, enabling a compelling 3D effect. Prior and concurrent approaches for this task typically operate in multiple phases, first estimating video disparity or depth, then warping the video accordingly to produce a second view, and finally inpainting the disoccluded regions. This approach inherently fails when the scene involves specular surfaces or transpar- ent objects. In such cases, single-layer disparity estimation is insufficient, resulting in artifacts and incorrect pixel shifts during warping. Our work bypass these restriction by directly synthesizing the new viewpoint, avoiding any intermediate steps. This is achived by leveraging a pre-trained video model's priors on geometry, object materials, optics, and semantics, without relying on external geometry models or manually disentangling geometry from the synthesis process. We demonstrate the advantages of our approach in complex, real-world scenarios featuring diverse object materials and compositions.
Our mono-to-stereo pipeline. We leverage the pre-trained cascaded text to video model Lumiere, as well as a curated rectified stereo pairs dataset, to tackle mono-to-stereo synthesis. We fine tune base (low resolution) pre-traind model in two manners. First, we add additional input channels to condition the model on an input right eye, and train the base Eye2Eye generator on downsampled stereo pairs (top left). Second, we train a refiner model with the same conditioning mechanism, only on non-downsampled crops (bottom left). The base Eye2Eye model models the correct pixels disparity at a low resolution, and the Eye2Eye refiner yields higher quality for pixels with large disparities when sampling at a high resolution. Our sampling process (right) combines both models' strengths by first generating a low-resolution output from the base Eye2Eye model to establish appropriate stereo disparity for a compelling 3D effect, then noising and denoising it with the Eye2Eye refiner to achieve high visual quality.
Click on the buttons to toggle between anaglyph and side-by-side (SBS) views.
Note: Viewing anaglyph content on Safari/iOS devices may not provide the optimal experience due to browser limitations. For the best experience, we recommend using Chrome on desktop.
For more videos, see results page.
To see the restuls in a VR headset, use the headset browser and press here.
We combine these stages by first generating a low-resolution output with the base model, then upsampling and noising it, then denoising it with the refiner. This results in a high-resolution video that maintains both a proper 3D effect and high visual quality (bottom right).
Note: Viewing anaglyph content on Safari/iOS devices may not provide the optimal experience due to browser limitations. For the best experience, we recommend using Chrome on desktop.
We compare our method with the standard "Warp and Inpaint" approach, which relies on monocular depth estimation to warp the right-eye view to the left eye and uses and inpainting model to fill missing areas.
We consider both our implementation and the StereoCrafter [3] method.
This baseline struggles with videos containing specular reflections or transparent objects, as a single depth cannot be assigned to each pixel (e.g., when the depth of the reflection and that of the surface differ).
Consequently, reflections and transparent surfaces are incorrectly warped, resulting in distorted or incorrect 3D effects. StereoCrafter additionally tends to produce warping artifacts, such as blurred edges and temporal inconsistencies.
In contrast, our method generates stereo RGB views directly, bypassing explicit depth estimation and leveraging the generative model's implicit knowledge about materials and optics to handle challenging scenarios effectively.
Note: Viewing anaglyph content on Safari/iOS devices may not provide the optimal experience due to browser limitations. For the best experience, we recommend using Chrome on desktop.
For more videos, please see the comparisons page.
To view the comparison in a VR headset, use the headset browser and press here.
Note the 3D effect of the reflection of the woman's face; in our output it appears near, while it has no 3D in the Warp & Inpaint and Stereo-Crafter baselines, since the depth estimation doesn't account for the reflection.
Note the buildings behind the umbrella; in the Warp & Inpaint and Stereo-Crafter baselines, they appear to be as near as the umbrella, while in our result they are correctly distant.
Note the depth of the reflection: it should appear near, as in our result. The warp & inpaint baseline fails to create a 3D effect of this reflection. While the Stereo-Crafter result has some 3D in the reflection area, it is full of artefacts.
Note the depth difference between the far away content reflected on the window, and the woman's face: In in the Warp & Inpaint and Stereo-Crafter baselines, they both have the same depth. In our results, the woman's face is indeed closer than the reflected content.
Note the building behind reflection on the window: in in the Warp & Inpaint and Stereo-Crafter baselines, the building is as near as the reflection; in our result the reflection is in front of it.
Note that the reflection of the distant trees on the window in in the Warp & Inpaint and Stereo-Crafter baselines appears close, while it should appear far, as in our result.
Note the reflection on the mug; in the warp and inpaint output it is as near as the mug, while in our output it is correctly distant.
We show additional qualitative comparison to:
[1] Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks. Junyuan Xie, Ross Girshick, Ali Farhadi
[2] Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos. Colton Stearns, Adam Harley, Mikaela Uy, Florian Dubost, Federico Tombari, Gordon Wetzstein, Leonidas Guibas
[3] StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos. Sijie Zhao, Wenbo Hu, Xiaodong Cun, Yong Zhang, Xiaoyu Li, Zhe Kong, Xiangjun Gao, Muyao Niu, Ying Shan
@misc{geyer2025eye2eye,
title={Eye2Eye: A Simple Approach for Monocular-to-Stereo Video Synthesis},
author={Michal Geyer and Omer Tov and Linyi Jin and Richard Tucker and Inbar Mosseri and Tali Dekel and Noah Snavely},
year={2025},
eprint={2505.00135},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.00135},
}