you want get the exact coordinated by running a key point network to pinpoint which coordinates does the next click point is you can. here I show a example simple prompt which returns the keypoint location of the next botton to click and visually localize the point with a keypoint in the image
I was using this for video understanding with inference form vlm.run infra. It definitely has outperformed Gemini which generally is much better than openai or Claude on videos. The detailed extraction is pretty good. With agents you can also crop into a segment and do more operations on it. have to see how the multi modal space progresses:
I found it pretty funny how bad Claude was at cropping an image. It was a cute little character with some text off to the side on a white background, all very clean cartoon vibes and it COULD NOT just select the character. I pursued it for 20 minutes because I thought it was funny. Of course it was 45 seconds to do it myself.
A lot of my side projects involve UIs and almost all of my problems with getting LLMs to write them for me involve "The UI isn't doing what you say it's doing" and struggling to get A) a reliable way to get it to look at the UI so it can continue its loop and B) getting it to understand what it's looking at well enough to do something about it
I agree claude and chatgpt and even gemini does a poor job in detecting and cropping into a region. Some of the simplest tasks, Qwen also is great at summerization but not into solving simple vision tasks like cropping, segmentetation and detection. Here is an examples where we compared claude, gemini, chatgpt and other frontier models for simple(and complicated) visual tasks
https://chat.vlm.run/showdown#:~:text=Crop%20into%20the%20cl...
The part that was funny to me is I would respond "is that right?" and it would tell me exactly how it was wrong and proceed to do it incorrectly again in a very similar but different way. It was like a Monty Python sketch. I might have also been very tired and easily amused.
Wow that is crazy may be YC next funded company will be on this. But there are so many ethical considereations here. Tracking immigrant means they will track citizens as well. We need to see how these companies are moderated
No I mean I am an immigrant I am already tracked more than a citizen. We have accepted that fact. But the tracking is done with government portal. But asking third party sources to keep a tab on immigrant can go bad very quickly both for immigrants and citizens. I think it is worse for citizens.
I am an immigrant in a very rich and beautiful country (hint, not America) and they don't track me specifically or make my life more miserable. Consider what you do with your life and if this is what you actually want.
I tested it out here https://news.ysimulator.run/item/3196 for ocr, segmentation, detection and 3d in a single chat. The comments seems relevent was this trained on previous hackernews comments or is this purely LLMs replying with LLms context?
The model looks good for an open source model. I want to see how these models are trained. may be they have a base model from academic datasets and quickly fine-tune with models like nano banana pro or something? That could be the game for such models. But great to see an open source model competing with the big players.
great this is more on the techincal details. it is great but would be great to see the data. I know they will not expose such information but would be great to have a visibility onto the datasets and how the data was sourced.
The camera is capturing a 3D projection of a 4D universe. Just as our own cameras capture a 2D projection of a 3D universe.
It's confusing not just because hyperdimensions are inherently confusing to humans, but also because the resulting 3D projection is then reduced to a 2D projection by our computer screens. Just know that the representation is of a 3D projection that you can explore in the same way as any other 3D world inside of a computer game - until you rotate in the ana/kata axes, which will change the 3D universe.
It gets a bit easier to play the game if you hit "v" to enable additional orthogonal projections.
https://chat.vlm.run/c/e12f0153-7121-4599-9eb9-cd8c60bbbd69
reply