New AI model turns photos into explorable 3D worlds, with caveats

As an Amazon Associate I earn from qualifying purchases.

Training with automated information pipeline

Voyager develops on Tencent’s earlier HunyuanWorld 1.0, launched in July. Voyager is likewise part of Tencent’s more comprehensive “Hunyuan” environment, that includes the Hunyuan3D-2 design for text-to-3D generation and the formerly covered HunyuanVideo for video synthesis.

To train Voyager, scientists established software application that immediately evaluates existing videos to process video camera motions and determine depth for every single frame– getting rid of the requirement for human beings to by hand identify countless hours of video. The system processed over 100,000 video from both real-world recordings and the previously mentioned Unreal Engine renders.

A diagram of the Voyager world development pipeline.

Credit: Tencent

The design requires severe computing power to run, needing a minimum of 60GB of GPU memory for 540p resolution, though Tencent suggests 80GB for much better outcomes. Tencent released the design weights on Hugging Face and consisted of code that deals with both single and multi-GPU setups.

The design includes noteworthy licensing limitations. Like other Hunyuan designs from Tencent, the license restricts use in the European Union, the United Kingdom, and South Korea. In addition, industrial implementations serving over 100 million regular monthly active users need different licensing from Tencent.

On the WorldScore criteria established by Stanford University scientists, Voyager apparently accomplished the greatest total rating of 77.62, compared to 72.69 for WonderWorld and 62.15 for CogVideoX-I2V. The design supposedly mastered item control (66.92 ), design consistency (84.89 ), and subjective quality (71.09 ), though it positioned 2nd in electronic camera control (85.95) behind WonderWorld’s 92.98. WorldScore examines world generation approaches throughout several requirements, consisting of 3D consistency and material positioning.

While these self-reported benchmark outcomes appear appealing, larger release still deals with obstacles due to the computational muscle included. For designers requiring much faster processing, the system supports parallel reasoning throughout several GPUs utilizing the xDiT structure. Working on 8 GPUs provides processing speeds 6.69 times faster than single-GPU setups.

Offered the processing power needed and the constraints in producing long, meaningful “worlds,” it might be a while before we see real-time interactive experiences utilizing a comparable strategy. As we’ve seen so far with experiments like Google’s Genie, we’re possibly experiencing extremely early actions into a brand-new interactive, generative art type.

Learn more

As an Amazon Associate I earn from qualifying purchases.