Eye2Eye: A simple approach for monocular-to-stereo video synthesis

Michal Geyer 1,2     Omer Tov 1     Richard Tucker 1     Linyi Jin 1     Inbar Mosseri 1     Tali Dekel 1, 2     Noah Snavely 1    
Google DeepMind 1     Weizmann Institute of Science 2
Work was done while M. Geyer was an intern at Google DeepMind.

Our model takes a real-world monocular video as an input right-eye view and produces a left-eye video, enabling stereoscopic viewing using 3D glasses or a VR headset.
Our pipeline does not include explicit depth estimation and warping, and thus can plausibly handle
videos with specular and semi transparent objects, like the wine glass or the snow flakes.

Abstract

The rising popularity of immersive visual experiences has increased interest in stereoscopic 3D video generation. De- spite significant advances in video synthesis, creating 3D videos remains challenging due to the relative scarcity of 3D video data. We propose a simple approach for transforming a text-to-video generator into a video-to-stereo generator. Given an input video, our framework automatically produces the video frames from a shifted viewpoint, enabling a compelling 3D effect. Prior and concurrent approaches for this task typically operate in multiple phases, first estimating video disparity or depth, then warping the video accordingly to produce a second view, and finally inpainting the disoccluded regions. This approach inherently fails when the scene involves specular surfaces or transpar- ent objects. In such cases, single-layer disparity estimation is insufficient, resulting in artifacts and incorrect pixel shifts during warping. Our work bypass these restriction by directly synthesizing the new viewpoint, avoiding any intermediate steps. This is achived by leveraging a pre-trained video model's priors on geometry, object materials, optics, and semantics, without relying on external geometry models or manually disentangling geometry from the synthesis process. We demonstrate the advantages of our approach in complex, real-world scenarios featuring diverse object materials and compositions.

Pipeline

Pipeline Diagram

Our mono-to-stereo pipeline. We leverage the pre-trained cascaded text to video model Lumiere, as well as a curated rectified stereo pairs dataset, to tackle mono-to-stereo synthesis. We fine tune base (low resolution) pre-traind model in two manners. First, we add additional input channels to condition the model on an input right eye, and train the base Eye2Eye generator on downsampled stereo pairs (top left). Second, we train a refiner model with the same conditioning mechanism, only on non-downsampled crops (bottom left). The base Eye2Eye model models the correct pixels disparity at a low resolution, and the Eye2Eye refiner yields higher quality for pixels with large disparities when sampling at a high resolution. Our sampling process (right) combines both models' strengths by first generating a low-resolution output from the base Eye2Eye model to establish appropriate stereo disparity for a compelling 3D effect, then noising and denoising it with the Eye2Eye refiner to achieve high visual quality.

Sample Results: Anaglyph / SBS

Click on the buttons to toggle between anaglyph and side-by-side (SBS) views.
Note: Viewing anaglyph content on Safari/iOS devices may not provide the optimal experience due to browser limitations. For the best experience, we recommend using Chrome on desktop.
For more videos, see results page.
To see the restuls in a VR headset, use the headset browser and press here.

Model Ablation

We combine these stages by first generating a low-resolution output with the base model, then upsampling and noising it, then denoising it with the refiner. This results in a high-resolution video that maintains both a proper 3D effect and high visual quality (bottom right). Note: Viewing anaglyph content on Safari/iOS devices may not provide the optimal experience due to browser limitations. For the best experience, we recommend using Chrome on desktop.

Base Eye2Eye
(low resolution,
correct parallax but low quality)

Base Eye2Eye
(high resolution sampling
reduces 3D effect)

Eye2Eye Refiner
(good quality,
but uniformly shifts the right eye)

Full Eye2Eye
(correct parallax and
good quality)

Comparison to baselines

We compare our method with the standard "Warp and Inpaint" approach, which relies on monocular depth estimation to warp the right-eye view to the left eye and uses and inpainting model to fill missing areas. We consider both our implementation and the StereoCrafter [3] method.
This baseline struggles with videos containing specular reflections or transparent objects, as a single depth cannot be assigned to each pixel (e.g., when the depth of the reflection and that of the surface differ).
Consequently, reflections and transparent surfaces are incorrectly warped, resulting in distorted or incorrect 3D effects. StereoCrafter additionally tends to produce warping artifacts, such as blurred edges and temporal inconsistencies.
In contrast, our method generates stereo RGB views directly, bypassing explicit depth estimation and leveraging the generative model's implicit knowledge about materials and optics to handle challenging scenarios effectively. Note: Viewing anaglyph content on Safari/iOS devices may not provide the optimal experience due to browser limitations. For the best experience, we recommend using Chrome on desktop.

For more videos, please see the comparisons page.
To view the comparison in a VR headset, use the headset browser and press here.

Ours

Warp & Inpaint

Stereo-Crafter

Note the 3D effect of the reflection of the woman's face; in our output it appears near, while it has no 3D in the Warp & Inpaint and Stereo-Crafter baselines, since the depth estimation doesn't account for the reflection.

Ours

Warp & Inpaint

Stereo-Crafter

Note the buildings behind the umbrella; in the Warp & Inpaint and Stereo-Crafter baselines, they appear to be as near as the umbrella, while in our result they are correctly distant.

Ours

Warp & Inpaint

Stereo-Crafter

Note the depth of the reflection: it should appear near, as in our result. The warp & inpaint baseline fails to create a 3D effect of this reflection. While the Stereo-Crafter result has some 3D in the reflection area, it is full of artefacts.

Ours

Warp & Inpaint

Stereo-Crafter

Note the depth difference between the far away content reflected on the window, and the woman's face: In in the Warp & Inpaint and Stereo-Crafter baselines, they both have the same depth. In our results, the woman's face is indeed closer than the reflected content.

Ours

Warp & Inpaint

Stereo-Crafter

Note the building behind reflection on the window: in in the Warp & Inpaint and Stereo-Crafter baselines, the building is as near as the reflection; in our result the reflection is in front of it.

Ours

Warp & Inpaint

Stereo-Crafter

Note that the reflection of the distant trees on the window in in the Warp & Inpaint and Stereo-Crafter baselines appears close, while it should appear far, as in our result.

Ours

Warp & Inpaint

Stereo-Crafter

Note the reflection on the mug; in the warp and inpaint output it is as near as the mug, while in our output it is correctly distant.

Additional Qualitative comparison to baselines

We show additional qualitative comparison to:

  • Deep3D [1] , which traines a deep CNN to predict stereo from monocular videos. This early work often does not produce a compelling 3D effect.
  • Dynamic Gaussian Marbles (DGM) [2], a 4D reconstruction method from monocular videos, from which we synthesis stereo views. This method does not leverage a generative prior, and only relays on infromation appearing in the video. Thus it can not inpaint missing content, leading to holes in the synthesized views. Additionally, as it uses on monocular depth estimation as a regularization, it often fails to correctly model the geometry under challanging settings that include specular reflections, refractions, etc.
  • [1] Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks. Junyuan Xie, Ross Girshick, Ali Farhadi

    [2] Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos. Colton Stearns, Adam Harley, Mikaela Uy, Florian Dubost, Federico Tombari, Gordon Wetzstein, Leonidas Guibas

    [3] StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos. Sijie Zhao, Wenbo Hu, Xiaodong Cun, Yong Zhang, Xiaoyu Li, Zhe Kong, Xiangjun Gao, Muyao Niu, Ying Shan

    BibTeX

    @misc{geyer2025eye2eye,
          title={Eye2Eye: A Simple Approach for Monocular-to-Stereo Video Synthesis}, 
          author={Michal Geyer and Omer Tov and Linyi Jin and Richard Tucker and Inbar Mosseri and Tali Dekel and Noah Snavely},
          year={2025},
          eprint={2505.00135},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2505.00135}, 
    }