Visual Reasoning Is Coming Soon

121 points by softwaredoug 4 days ago

This seems to ignore the mixed record of video generation models:

  For visual reasoning practice, we can do supervised fine-tuning on sequences similar to the marble example above. For instance, to understand more about the physical world, we can show the model sequential pictures of Slinkys going down stairs, or basketball players shooting 3-pointers, or people hammering birdhouses together.... 

  But where will we get all this training data? For spatial and physical reasoning tasks, we can leverage computer graphics to generate synthetic data. This approach is particularly valuable because simulations provide a controlled environment where we can create scenarios with known outcomes, making it easy to verify the model's predictions. But we'll also need real-world examples. Fortunately, there's an abundance of video content online that we can tap into. While initial datasets might require human annotation, soon models themselves will be able to process videos and their transcripts to extract training examples automatically.

Almost every video generator makes constant "folk physics" errors and doesn't understand object permanence. DeepMind's Veo2 is very impressive but still struggles with object permanence and qualitatively nonsensical physics: https://x.com/Norod78/status/1894438169061269750

Humans do not learn these things by pure observation (newborns understand object permanence, I suspect this is the case for all vertebrates). I doubt transformers are capable of learning it as robustly, even if trained on all of YouTube. There will always be "out of distribution" physical nonsense involving mistakes humans (or lizards) would never make, even if they've never seen the specific objects.

throwanem 4 days ago

> newborns understand object permanence
Is that why the peekaboo game is funny for babies? The violated expectation at the soul of the comedy?
- broof 4 days ago
  
  Yeah I had thought that newborns famously didnt understand object permanence and that it developed sometime during their first year. And that was why peekaboo is fun, you’re essentially popping in and out of existence.
  - AIPedant 4 days ago
    
    This is a case where early 20th century psychology is wrong, yet still propagates as false folk knowledge:
    https://en.wikipedia.org/wiki/Object_permanence#Contradictin...
    
    namaria 3 days ago
    
    I don't think it's so clear cut. The wikipedia article cites a 1971 book and an article from 1991 as sources.
    A paper from 2014 is not so sure this is a settled matter:
    "Infant object permanence is still an enigma after four decades of research. Is it an innate endowment, a developmental attainment, or an abstract idea not attributable to non-verbal infants?"
    and argues that object permanence is not in fact innate but developed:
    "It is argued that object permanence is not innately specified, but develops. The theory proposed here is that object permanence is an attainment that grows from a developmentally prior understanding of object identity"
    "In sum, it is posited that permanence is initially dependent on the nature of the occlusion; with development, it becomes a property of objects. Even for 10–12-month-olds, object permanence is still a work-in-progress, manifested on one disappearance transformation but not another."
    https://pmc.ncbi.nlm.nih.gov/articles/PMC4215949/
    
    smusamashah 4 days ago
    
    My kid use to drop things and look down at them from his chair. I didn't understand what he was trying to do. I learned that it was his way of trying to understand how world works. That if he dropped something, will it remain there or disappear.
    That contradicts this contradiction, unless there is another explanation.
    
    sepositus 4 days ago
    
    My (much) older kid still does ridiculous things that defy reason (and usually end up with something broken). I don't think it's fair to say that every action they take has some deeper meaning behind it.
    "Why were you throwing the baseball at the running fan?" "I don't know...I was bored."
    
    tim333 3 days ago
    
    I'm not sure that stuff dies out fully with adulthood. I imagine part of Musk doing iffy salutes or Trump doing weird tariffs is a curiosity - what is I do this odd thing, I wonder what happens. I can think of examples with myself too - I'm kind of learning short term trading and the behaviors can be counterintuitive.
    
    simplify 4 days ago
    
    A child doesn't always a mental "why" reasoning for their actions. Sometimes kids just behave in coarse playful ways, and those ways happen to be very useful for mental development.
    
    wongarsu 4 days ago
    
    If it's a round thing it might well disappear. Even with full understanding that things don't cease to exist when they leave your field of view, "where do things end up if I drop them" is a pretty big field to experiment with. Youtube is full of videos of adults doing the same, just from more extreme heights or other unusual scenarios (since as adults we have a solid grasp of the simpler scenarios)
    
    AIPedant 4 days ago
    
    > I learned that it was his way of trying to understand how world works. That if he dropped something, will it remain there or disappear.
    I don’t understand how you learned this. Who told you that’s what was going on in his head?
    
    crispycas12 4 days ago
    
    That's odd. This was content that was still on the MCAT when I took it last year. I even remember keeping the formation of objection permanence occuring ~0-2 years of age on my flashcards.
    
    AIPedant 4 days ago
    
    It’s far from perfect in newborns (containers take some time to understand, and in general infants have weak short-term memory). I also wonder how much effort goes into updating MCAT questions with new scientific developments, especially when there is limited clinical significance - an infant who struggles with object permanence likely has serious neurological/cognitive problems across the board.
    
    throwanem 4 days ago
    
    Have you checked lately, though?
  - andoando 4 days ago
    
    babies pretty much laugh if you're laughing and being silly
    
    Tostino 4 days ago
    
    Can confirm. Is quite fun.
    
    throwanem 4 days ago
    
    All passed me by, sad to say, hence the guessing. For what I had to start with I've done pretty well, and I think no one ever really sees their each and every last hope come true. Maybe next time.
  - moi2388 3 days ago
    
    No, they understand object permanence just fine.
    Peekaboo is fun because fun is fun. When doing peekaboo the other person is paying attention to you, and often smiling and being relaxed.
    They laugh just as much if you play ‘peekaboo’ without actually covering your face ;)
dinfinity 4 days ago

You provide no actual arguments as to why LLMs are fundamentally unable to learn this. Your doubt is as valuable as my confidence.
- viccis 4 days ago
  
  Because the nature of their operation (learning a probability distribution over a corpus of observed data) is not the same as creating synthetic a priori knowledge (object permanence is a case of cause and effect which is synthetic a priori knowledge). All LLM knowledge is by definition a posteriori.
  - AstralStorm 4 days ago
    
    That LLMs cannot synthesize it into a propri, including other rules of logic and mathematics, is a major failure of the technology...
- AIPedant 4 days ago
  
  Well, it's a good thing I didn't say "fundamentally unable to learn this"!
  I said that learning visual reasoning from video is probably not enough: if you claim it is enough, you have to reconcile that with failures in Sora, Veo 2, etc. Veo 2's problems are especially serious since it was trained on an all-DeepMind-can-eat diet of YouTube videos. It seems like they need a stronger algorithm, not more Red Dead Redemption 2 footage.
  - dinfinity 4 days ago
    
    > I said that learning visual reasoning from video is probably not enough
    Fair enough; you did indeed say that.
    > if you claim it is enough, you have to reconcile that with failures in Sora, Veo 2, etc.
    This is flawed reasoning, though. The current state of video generating AI and the completeness of the training set does not reliably prove that the network used to perform the generation is incapable of physical modeling and/or object permanence. Those things are ultimately (the modeling of) relations between past and present tokens, so the transformer architecture does fit.
    It might just be a matter of compute/network size (modeling four dimensional physical relations in high resolution is pretty hard, yo). If you look at the scaling results from the early Sora blogs, the natural increase of physical accuracy with more compute is visible: https://openai.com/index/video-generation-models-as-world-si...
    It also might be a matter of fine-tuning training on (and optimizing for) four dimensional/physical accuracy rather than on "does this generated frame look like the actual frame?"
    
    zveyaeyv3sfye 4 days ago
    
    [flagged]
nonameiguess 4 days ago

"All of YouTube" brings the same problem as training on all of the text on the Internet. Much of that text is not factual, which is why RLHF and various other fine-tuning efforts need to happen in addition to just reading all the text on the Internet. All videos on YouTube are not unedited footage of the real world faithfully reproducing the same physics you'd get by watching the real world instead of YouTube.
As for object permanence, I don't know jack about animal cognitive development, but it seems important that all animals are themselves also objects. Whether or not they can see at all, they can feel their bodies and sense in some way or other its relation to the larger world of other objects. They know they don't blink in and out of existence or teleport, which seems like it would create a strong bias toward believing nothing else can do that, either. The same holds true with physics. As physical objects existing in the physical world, we are ourselves subject to physics and learn a model that is largely correct within the realm of energy densities and speeds we can directly experience. If we had to learn physics entirely from watching videos, I'm afraid Roadrunner cartoons and Fast and the Furious movies would muddy the waters a bit.

nkingsy 4 days ago

The example of the cat and detective hat shows that even with the latest update, it isn't "editing" the image. The generated cat is younger, with bigger, brighter eyes, more "perfect" ears.

I found that when editing images of myself, the result looked weird, like a funky version of me. For the cat, it looks "more attractive" I guess, but for humans (and I'd imagine for a cat looking at the edited cat with a keen eye for cat faces), the features often don't work together when changed slightly.

porphyra 4 days ago

Chatgpt 4o's advanced image generation seems to have a low-resolution autoregressive part that generates tokens directly, and an image upscaling decoding step that turns the (perhaps 100 px wide) token-image into the actual 1024 px wide final result. The former step is able to almost nail things perfectly, but the latter step will always change things slightly. That's why it is so good at, say, generating large text but still struggles with fine text, and will always introduce subtle variations when you ask it to edit an existing image.
- BriggyDwiggs42 4 days ago
  
  Has anyone tried putting in a model that selects the editing region prior to the process? Training data would probably be hard, but maybe existing image recognition tech that draws rectangles would be a start.
ilynd 3 days ago

Genuine question - how would such a model "edit" the image, besides manipulating the binary? I.e. changing pixel values programmatically

Tiberium 4 days ago

It's sad that they used 4o's image generation feature for the cat example which does some diffusion or something else, results in the whole image changing. They should've instead used Gemini 2.0 Flash's image generation feature (or at least mentioned it!), which, even if far lower quality and resolution (max of 1024x1024, but Gemini will try to match the res of the original image, so you can get something like 681x1024), is much much better at leaving the untouched parts of the image actually "untouched".

Here's the best out of a few attempts for a really similar prompt, more detailed since Flash is a much smaller model "Give the cat a detective hat and a monocle over his right eye, properly integrate them into the photo.". You can see how the rest of the image is practically untouched to the naked human eye: https://ibb.co/zVgDbqV3

Honestly Google has been really good at catching up in the LLM race, and their modern models like 2.0 Flash, 2.5 Pro are one of (or the) best in their respective areas. I hope that they'll scale up their image generation feature to base it on 2.5 Pro (or maybe 3 Pro by the time they do it) for higher quality and prompt adherence.

If you want, you can give 2.0 Flash image gen a try for free (with generous limits) on https://aistudio.google.com/prompts/new_chat, just select it in the model selector on the right.

blixt 4 days ago

I'm not sure I see the behavior in the Gemini 2.0 Flash model's image output as a strength. It seems to me it has multiple output modes, one indeed being masked edits. But it also seems to have convolutional matrix edits (e.g. "make this image grayscale" looks practically like it's applying a Photoshop filter) and true latent space edits ("show me this scene 1 minute later" or "move the camera so it is above this scene, pointing down"). And it almost seems to me these are actually distinct modes, which seems like it's been a bit too hand engineered.
On the other hand, OpenAI's model, while it does seem to have some upscaling magic happening (which makes the outputs look a lot nicer than the ones from Gemini FWIW), also seems to perform all its edits entirely in latent space (hence it's easy to see things degrade at a conceptual level such as texture, rotation, position, etc.) But this is a sign that its latent space mode is solid enough to always use, while with Gemini 2.0 Flash I get the feeling when it is used, it's just not performing as well.
esperent 4 days ago

> You can see how the rest of the image is practically untouched to the naked human eye: https://ibb.co/zVgDbqV3
The cat's facial hair coloring is entirely different in your image, far more so than the OpenAI one. It has a half white nose instead of a half black nose, the ears are black instead of pink, the cheeks are solid colors instead of striped. Yours is the head of an entirely different cat grafted onto the original body.
The OpenAI one is like the original cat run through a beauty filter. The eyes are completely different. But it's facial hair patterning is matched much better.
Neither one of great though. They both output cats that are not recognizable as the original.
casey2 2 days ago

For me flash is much worse. Both are incapable of generating any kind of flowchart similar to the one I give, but 4o does slightly better in that I can read it. Neither make sense.

uaas 4 days ago

> Rather watch than read? Hey, I get it - sometimes you just want to kick back and watch! Check out this quick video where I walk through everything in this post

Hm, no, I’ve never had this thought.

roguecoder 3 days ago

Pivot To Video will never die.
esperent 4 days ago

Me neither. But some people are dyslexic or non native speakers (it's common to have listening comprehension better than reading comprehension when learning a language) and have to put more effort into reading than I do, and some people genuinely just prefer video.
I think it's decent of the author to provide a video for these people.

rel_ic 4 days ago

The inconsistency of an optimistic blog post ending with a picture of a terminator robot makes me think this author isn't taking themself seriously enough. Or - the author is the terminator robot?

porphyra 4 days ago

I think that one reason that humans are so good at understanding images is that our eyes see video rather than still images. Video lets us see "cause and effect" by seeing what happens after something. It also allows us to grasp the 3D structure of things since we will almost always see everything from multiple angles. So long as we just feed a big bunch of stills into training these models, it will struggle to understand how things affect one another.

throwanem 4 days ago

I have some bad news for you about how every digital video you've ever seen in your life is encoded.

anxoo 3 days ago

"I set a plate on a table, and glass next to it. I set a marble on the plate. Then I pick up the marble, drop it in the glass. Then I turn the glass upside down and set it on the plate. Then, I pick up the glass and put it in the microwave. Where is the marble?"

the author claims that visual reasoning will help the model solve this problem, noting that gpt-4o got the question right after making a mistake in the beginning of the response. i asked gpt-4o, claude 3.7, and gemini 2.5 pro experimental, who all answered 100% correctly.

the author also demonstrates trying to do "visual reasoning" with gpt-4o, notes that the model got it wrong, then handwaves it away by saying the model wasn't trained for visual reasoning.

"visual reasoning" is a tweet-worthy thought that the author completely fails to justify

District5524 4 days ago

The first caption of the cat picture may be a bit misleading for those who are not sure of how this works: "The best a traditional LLM can do when asked to give it a detective hat and monocle." The role of the traditional LLM in creating a picture is quite minimal (if there is any LLMs used), it might just tweak a bit the prompt for the diffusion model. It was definitely not the LLM that created the picture: https://platform.openai.com/docs/guides/image-generation 4o image generation is surely a bit different, but I don't really have that kind of more precise technical information (there must be indeed a specialized transformer model used, linking tokens to pixels, https://openai.com/index/introducing-4o-image-generation/)

CSMastermind 4 days ago

What's interesting to me is how many of these advancements are just obvious next steps for these tools. Chain of thought, tree of thought, mixture of experts etc. are things you'd come up with in the first 10 minutes of thinking about improving LLMs.

Of course the devil's always in the details and there have been real non-obvious advancements at the same time.

AstralStorm 3 days ago

The problem always was that the improvements worsened other cases.
Mixture of experts tends to lock models up. Chain of thought tends to turn the models loopy. As in circling the drain even when asked not to. Tree of thought has a tendency to make the models unstable by flipping between the branches, and otherwise unpredictable complicating training ...

KTibow 4 days ago

I've seen some speculate that o3 is already using visual reasoning and that's what made it a breakthrough model.

budmichstelk 4 days ago

[dead]

thierrydamiba 4 days ago

Excellent write up.

The example you used to demonstrate is well done.

Such a simple way to describe the current issues.