I got a laugh, showing this silly 😛 before and after to a family member recently.

But this expensive silliness is just an example of an otherwise very versatile technique.

I remember still for most of the brief lifetime of ChatGPT that this kind of image manipulation using a prompt, with autoencoder latent space manipulation was not possible. The only models available for most of ChatGPT history were like DALL-E text to image word2vec types, trained by using a dataset of text and image pairs.

But now as of around 2025-09-20 at least I see we also have access to an architecture like CLIP, “Cross-modal Latent Manipulation”.

So yes silly photo chopping is possible, but I think semantic manipulation in latent space can open up the doors to the vibe editing of all kinds of modalities, including of course code and so the models behind Cursor are probably close cousins here.