Genie 2: Google's New AI Model Turns One Image into Infinite Playable Worlds - Developers Digest

Genie 2: Google's New AI Model Turns One Image into Infinite Playable Worlds - Developers Digest

Transcript

--- type: transcript date: 2024-12-05 youtube_id: Ustm9JBoDcM --- # Transcript: Genie 2: Google's New AI Model Turns One Image into Infinite Playable Worlds just today Google announced genie2 which is a foundation model that's capable of generating these endless variety action controlled and playable 3D environments in the examples that I'm showing you here all of these are being played and controlled almost as if it was a video game the way that these are created are based on a single image imagine just being able to pass in an image this could be an AI generated image or an image of a photo you've taken and then ultimately be able to infer what that environment is around it now you can essentially create these simulated virtual environments and it includes whatever the consequences are from taking those actions whether you're swimming or jumping or acting as a character the way that this was trained was similar to other generative models was trained on a large video data set and from that training there are emerging capabilities at that scale such as being able to detect objects complex character animations physics and the ability to model and thus predict the behavior of other agents now in all the examples that I showed you all of these images are actually also generated from one single image from their image generation model you can think of this something similar to dolly or mid Journey it's Google's version of that diffusion style model where you can put in a text prompt and it will generate a photo for you they described in the blog post that this effectively means that anyone can describe the world in text select their favorite rendering of the idea and interact with that newly created world how this works is when you're using your keyboard or Mouse to interact with the model that model is going to simulate the next response almost like that next token that we see from things like chat gpts the GPT series of models and just large language models in general another thing that I thought was impressive is that genie2 can generate consistent World models for up to a minute and the majority of examples that I'm showing you in this video are between 10 and seconds as you see here there's a bunch of different examples another thing that they highlight is that they can generate a diverse trajectory from the same starting frame which means that it's possible to simulate counterfactual experiences for training agents in these two examples here every video starts at the same frame but as you can see by the end of the video they're very different ultimate outcomes right there's a great Distinction on the bottom there what it's like to take one path versus another path and that will infer the subsequent steps it's really cool and imaginative to think about obviously in a video game context but also just in the context of maybe something like a virtual reality environment if we're all of a sudden able to just generate these worlds that we want to play around in or exist in I could really see how this could potentially be a popular use case obviously within VR or within video games as you can see it can really create a diverse set of environments as well as things like 3D structures and the other cool thing with this is there's even the ability to have the interaction with different objects you can see in some of these examples like shooting a barrel or jumping into a balloon and it be able to have that response similar within this example going through and walking through that door it's able to detect that object some of those emergent capabilities and as you see here's an example of the model interacting with NPCs not characters that you're playing but characters that are within the environment and being able to have Dynamic responses based on your interaction with those characters and then if we look at the physics now I know we're probably going to get some comments within the description of the video potentially of this look terrible or what have you but mind you this is just the beginning like I'd imagine in a year in two years these are going to look really impressive this might look like an old video game especially given how sophisticated video games are now but definitely in time as these models begin to improve the methods as well as just the scale of information and the techniques to actually develop these models improve it's a safe bet that these things are going to improve dramatically there's some examples of smoke gravity lighting we can see reflection even and then here's an example of playing the world environment based on real world images say you take a nice Majestic photo and you want to walk around within that environment another thing that they call out within the blog post is that Genie makes it easy to rapidly prototype diverse interactive experiences and this enables researchers to quickly experiment with novel environments to train and test in AI agents here's just a few more examples of a paper plane a dragon a bird as well as what looks like a parachute they talk about this environment as a great tool for artists and designers to quickly prototype and bootstrap new environments and help the creative process the way that they're describing this is it's not necessarily something that's going to obviously replace a complicated video game at least not in the short term instead just use it as a tool to augment and ideate different it ideas on how that world could potentially look here's an example of an image generated from image in and the prompt was a screenshot of a third person open world exploration game the player is an adventurer exploring a forest there is a house with a red door on the left and a house with a blue door on the right the camera is placed directly behind the player and it's photorealistic and immersive and then based on that the SEMA agent that they designed is designed to complete complex tasks of a range of 3D games by following natural language instructions here's the instruction where the prompt is to open the blue door based on that first initial frame and then the second one is open the red door but you can start to see how you can take one image have an agent potentially run a command or a different series of commands especially with its capabilities of being able to run up for a minute it could give you a lot of ideas maybe if you're a game developer to think about what's inside that house or what's around that corner or what does the environment look or as a whole and then finally here's just a few more examples of an image generated from imagin and if we look down here's three different prompts with natural language with that SEMA agent all of which are taking and rendering different environments on the Fly finally just to close it out on a couple technical pieces so as they describ within the blog post Genie is an auto regressive latent diffusion model trained on a large video data set after passing through an autoencoder latent frames from the video are passed to a large Transformer Dynamics model trained with a casual mask similar to that used by large language models at inference time G2 can be sampled in an autor regressive fashion taking individual actions and pass latent frames on a frame byf frame basis we use classifier free guidance to improve action control ability these samples in the blog post are generated by an UND distilling base model to show what's possible we can play a distilled version in real time with a reduction in quality of the outputs this video is brought to you by scrimba the Innovative coding platform that brings Interactive Learning To Life dive into a variety of courses from AI engineering to frontend python UI design and much more scrim is gamechanging feature is their unique scrim screencast format which lets you pause the lesson anytime and start directly editing the teachers code their curriculum is built in collaboration with industry leaders including mazilla mdn hugging face and langine chain and includes building application with open AI CLA mistal models and guides you on deploying projects to platforms like cloudflare while AI tools can assist with coding a solid grasp of the fundamentals is essential for achieving real experience scrimba offers something for everyone from complete beginners to Advanced developers and about 80% of scribus content is completely free sign up for a free account today using my link below and enjoy an extra 20% discount on their Pro plans when you're ready to upgrade I'm sure you'll love it here's the flow of the text to image based on the image in model and then that image is passed into the encoder and then depending on the keyboard shortcuts within video games W is usually forward a is usually left and then in this example e is representative of attack here we just see effectively how this works is based on the command we we have that decoder generating that next frame so from W to a it's taking this image frame by frame and then based on the action it's going to generate that image each time finally they say that this shows the potential of foundation World models for creating diverse 3D environments and accelerating agent research they describe that their research is towards building more General AI systems and agents that can understand and safely carry out a wide range of task in a way that is helpful to people online and in the real world there's some interesting outtakes here of a few different videos but otherwise that's pretty much it what do you think is this something that you're interested in exploring is this something that you would use as if it was a video game say if in the future you could generate an hour worth of content would you just think of spinning up a video game and playing it on the fly or within VR would you leverage this would this be a use case where maybe you'd finally consider buying and using a VR maybe something like this could potentially make it more and more interesting especially over time as these things begin to improve with quality but otherwise that's pretty much it for this video if you found this video useful please like comment share and subscribe otherwise until the next one