Eye2Eye

Abstract

The rising popularity of immersive visual experiences has increased interest in stereoscopic 3D video generation. De- spite significant advances in video synthesis, creating 3D videos remains challenging due to the relative scarcity of 3D video data. We propose a simple approach for transforming a text-to-video generator into a video-to-stereo generator. Given an input video, our framework automatically produces the video frames from a shifted viewpoint, enabling a compelling 3D effect. Prior and concurrent approaches for this task typically operate in multiple phases, first estimating video disparity or depth, then warping the video accordingly to produce a second view, and finally inpainting the disoccluded regions. This approach inherently fails when the scene involves specular surfaces or transpar- ent objects. In such cases, single-layer disparity estimation is insufficient, resulting in artifacts and incorrect pixel shifts during warping. Our work bypass these restriction by directly synthesizing the new viewpoint, avoiding any intermediate steps. This is achived by leveraging a pre-trained video model's priors on geometry, object materials, optics, and semantics, without relying on external geometry models or manually disentangling geometry from the synthesis process. We demonstrate the advantages of our approach in complex, real-world scenarios featuring diverse object materials and compositions.

Sample Results: Anaglyph / SBS

Click on the buttons to toggle between anaglyph and side-by-side (SBS) views.
Note: Viewing anaglyph content on Safari/iOS devices may not provide the optimal experience due to browser limitations. For the best experience, we recommend using Chrome on desktop.
For more videos, see results page.
To see the restuls in a VR headset, use the headset browser and press here.

Model Ablation

(1) The base eye2eye model, which is trained on downsampled stereo pairs, yielding a compelling 3D effect at low resolution (top left), though its disparity does not scale when sampling at higher resolutions (top right).
(2) The eye2eye refiner model, which is trained on non-downsampled crops, capable of large pixel shifts yet tending to produce nonzero disparity for distant objects and reduced 3D effect (bottom left).

We combine these stages by first generating a low-resolution output with the base model, then upsampling and noising it, then denoising it with the refiner. This results in a high-resolution video that maintains both a proper 3D effect and high visual quality (bottom right). Note: Viewing anaglyph content on Safari/iOS devices may not provide the optimal experience due to browser limitations. For the best experience, we recommend using Chrome on desktop.

Base Eye2Eye
(low resolution,
correct parallax but low quality)

Base Eye2Eye
(high resolution sampling
reduces 3D effect)

Eye2Eye Refiner
(good quality,
but uniformly shifts the right eye)

Full Eye2Eye
(correct parallax and
good quality)

Comparison to baselines

We compare our method with the standard "Warp and Inpaint" approach, which relies on monocular depth estimation to warp the right-eye view to the left eye and uses and inpainting model to fill missing areas. We consider both our implementation and the StereoCrafter [3] method.
This baseline struggles with videos containing specular reflections or transparent objects, as a single depth cannot be assigned to each pixel (e.g., when the depth of the reflection and that of the surface differ).
Consequently, reflections and transparent surfaces are incorrectly warped, resulting in distorted or incorrect 3D effects. StereoCrafter additionally tends to produce warping artifacts, such as blurred edges and temporal inconsistencies.
In contrast, our method generates stereo RGB views directly, bypassing explicit depth estimation and leveraging the generative model's implicit knowledge about materials and optics to handle challenging scenarios effectively. Note: Viewing anaglyph content on Safari/iOS devices may not provide the optimal experience due to browser limitations. For the best experience, we recommend using Chrome on desktop.

For more videos, please see the comparisons page.
To view the comparison in a VR headset, use the headset browser and press here.

Ours

Warp & Inpaint

Stereo-Crafter

Note the 3D effect of the reflection of the woman's face; in our output it appears near, while it has no 3D in the Warp & Inpaint and Stereo-Crafter baselines, since the depth estimation doesn't account for the reflection.

Ours

Warp & Inpaint

Stereo-Crafter

Note the buildings behind the umbrella; in the Warp & Inpaint and Stereo-Crafter baselines, they appear to be as near as the umbrella, while in our result they are correctly distant.

Ours

Warp & Inpaint

Stereo-Crafter

Note the depth of the reflection: it should appear near, as in our result. The warp & inpaint baseline fails to create a 3D effect of this reflection. While the Stereo-Crafter result has some 3D in the reflection area, it is full of artefacts.

Ours

Warp & Inpaint

Stereo-Crafter

Note the depth difference between the far away content reflected on the window, and the woman's face: In in the Warp & Inpaint and Stereo-Crafter baselines, they both have the same depth. In our results, the woman's face is indeed closer than the reflected content.

Ours

Warp & Inpaint

Stereo-Crafter

Note the building behind reflection on the window: in in the Warp & Inpaint and Stereo-Crafter baselines, the building is as near as the reflection; in our result the reflection is in front of it.

Ours

Warp & Inpaint

Stereo-Crafter

Note that the reflection of the distant trees on the window in in the Warp & Inpaint and Stereo-Crafter baselines appears close, while it should appear far, as in our result.

Ours

Warp & Inpaint

Stereo-Crafter

Note the reflection on the mug; in the warp and inpaint output it is as near as the mug, while in our output it is correctly distant.

BibTeX

@misc{geyer2025eye2eye,
      title={Eye2Eye: A Simple Approach for Monocular-to-Stereo Video Synthesis}, 
      author={Michal Geyer and Omer Tov and Linyi Jin and Richard Tucker and Inbar Mosseri and Tali Dekel and Noah Snavely},
      year={2025},
      eprint={2505.00135},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.00135}, 
}

Eye2Eye: A simple approach for monocular-to-stereo video synthesis

Abstract

Pipeline

Sample Results: Anaglyph / SBS

Model Ablation

Comparison to baselines

Additional Qualitative comparison to baselines

BibTeX