More

visioninmyblood · 2025-12-03T04:24:44 1764735884

you want get the exact coordinated by running a key point network to pinpoint which coordinates does the next click point is you can. here I show a example simple prompt which returns the keypoint location of the next botton to click and visually localize the point with a keypoint in the image

https://chat.vlm.run/c/e12f0153-7121-4599-9eb9-cd8c60bbbd69

visioninmyblood · 2025-12-02T22:13:57 1764713637

I was using this for video understanding with inference form vlm.run infra. It definitely has outperformed Gemini which generally is much better than openai or Claude on videos. The detailed extraction is pretty good. With agents you can also crop into a segment and do more operations on it. have to see how the multi modal space progresses:

link to results: https://chat.vlm.run/c/82a33ebb-65f9-40f3-9691-bc674ef28b52

Quick demo: https://www.youtube.com/watch?v=78ErDBuqBEo

colechristensen · 2025-12-03T00:10:37 1764720637

I found it pretty funny how bad Claude was at cropping an image. It was a cute little character with some text off to the side on a white background, all very clean cartoon vibes and it COULD NOT just select the character. I pursued it for 20 minutes because I thought it was funny. Of course it was 45 seconds to do it myself.

A lot of my side projects involve UIs and almost all of my problems with getting LLMs to write them for me involve "The UI isn't doing what you say it's doing" and struggling to get A) a reliable way to get it to look at the UI so it can continue its loop and B) getting it to understand what it's looking at well enough to do something about it

visioninmyblood · 2025-12-03T00:43:44 1764722624

I agree claude and chatgpt and even gemini does a poor job in detecting and cropping into a region. Some of the simplest tasks, Qwen also is great at summerization but not into solving simple vision tasks like cropping, segmentetation and detection. Here is an examples where we compared claude, gemini, chatgpt and other frontier models for simple(and complicated) visual tasks https://chat.vlm.run/showdown#:~:text=Crop%20into%20the%20cl...

colechristensen · 2025-12-03T01:01:54 1764723714

The part that was funny to me is I would respond "is that right?" and it would tell me exactly how it was wrong and proceed to do it incorrectly again in a very similar but different way. It was like a Monty Python sketch. I might have also been very tired and easily amused.

visioninmyblood · 2025-11-26T20:42:57 1764189777

The 3d viewer is slow, and might take time to load, thank you for your patience.

Modi: https://chat.vlm.run/c/bdbaf1dc-b3c2-4b8a-ad17-6e26d87475fd

Musk: https://chat.vlm.run/c/894f44ce-c366-4c93-b348-de6eebfb9f03

Ronaldo: https://chat.vlm.run/c/db5eff27-8c52-4e0b-8e4e-157df6bef278

Gomez: https://chat.vlm.run/c/02b92f23-a6fd-4380-b7a9-52ae7463c9c1

Jackie chan: https://chat.vlm.run/c/62ca5502-2066-4d5f-b994-c8845b2e72c9

Messi: https://chat.vlm.run/c/5bfda3c2-a2bf-4ca2-a874-78f1ce66edf1

Taylor swift: https://chat.vlm.run/c/d411c16a-e2b1-490c-85cc-9f86f2564a59

Obama: https://chat.vlm.run/c/c55fe1d6-853b-4113-bdaf-d828bee0da53

Trump: https://chat.vlm.run/c/84b57509-ed38-454a-9b17-85c03a247993

Kohli: https://chat.vlm.run/c/da8ffe8a-0818-4820-bdeb-b82d965ae3a4

Lamar: https://chat.vlm.run/c/ef6488c3-b0f0-4651-944e-7d0c4f98bdd1

Mrbeast: https://chat.vlm.run/c/c24d1cac-feb0-4a53-b1a2-2bbb62a7b7fb

visioninmyblood · 2025-11-26T03:50:35 1764129035

Try the creation engine: https://3d.hunyuanglobal.com/

Access the API: https://www.tencentcloud.com/products/ai3d

https://3d-models.hunyuan.tencent.com/world/worldMirror1_0/H...

visioninmyblood · 2025-11-25T23:51:13 1764114673

Would be great to see video results for this as well. I generated some with other models. Nano pro seems the best so far

visioninmyblood · 2025-11-25T20:26:33 1764102393

Wow that is crazy may be YC next funded company will be on this. But there are so many ethical considereations here. Tracking immigrant means they will track citizens as well. We need to see how these companies are moderated

tecleandor · 2025-11-25T20:54:23 1764104063

You mean, the ethical considerations are about tracking citizens only?

visioninmyblood · 2025-11-25T22:24:54 1764109494

No I mean I am an immigrant I am already tracked more than a citizen. We have accepted that fact. But the tracking is done with government portal. But asking third party sources to keep a tab on immigrant can go bad very quickly both for immigrants and citizens. I think it is worse for citizens.

herbst · 2025-11-26T00:38:36 1764117516

I am an immigrant in a very rich and beautiful country (hint, not America) and they don't track me specifically or make my life more miserable. Consider what you do with your life and if this is what you actually want.

dontlaugh · 2025-11-25T21:23:38 1764105818

Same as IBM’s ethical considerations when selling to the Nazis. This isn’t new behaviour.

throwawaygmbno · 2025-11-25T20:28:55 1764102535

If we make it to another election, it will be good to track the people who take up this offer

visioninmyblood · 2025-11-25T18:35:07 1764095707

I tested it out here https://news.ysimulator.run/item/3196 for ocr, segmentation, detection and 3d in a single chat. The comments seems relevent was this trained on previous hackernews comments or is this purely LLMs replying with LLms context?

visioninmyblood · 2025-11-25T17:52:45 1764093165

The model looks good for an open source model. I want to see how these models are trained. may be they have a base model from academic datasets and quickly fine-tune with models like nano banana pro or something? That could be the game for such models. But great to see an open source model competing with the big players.

anjneymidha · 2025-11-25T17:56:46 1764093406

they released a research post on how the new model's VAE was trained here: https://bfl.ai/research/representation-comparison

E-Reverance · 2025-11-25T18:16:53 1764094613

Surprised there wasn't any mention of Equilibrium Matching [1] in the future work section

[1] https://raywang4.github.io/equilibrium_matching/

visioninmyblood · 2025-11-25T18:00:39 1764093639

great this is more on the techincal details. it is great but would be great to see the data. I know they will not expose such information but would be great to have a visibility onto the datasets and how the data was sourced.

visioninmyblood · 2025-11-25T17:47:39 1764092859

you can download them at https://github.com/facebookresearch/sam3. for 3d https://github.com/facebookresearch/sam-3d-objects

visioninmyblood · 2025-11-25T03:14:01 1764040441

I am confuded what is 4d about this 3d visualization? What is the camera capturing?

nerdsniper · 2025-11-25T03:41:02 1764042062

The camera is capturing a 3D projection of a 4D universe. Just as our own cameras capture a 2D projection of a 3D universe.

It's confusing not just because hyperdimensions are inherently confusing to humans, but also because the resulting 3D projection is then reduced to a 2D projection by our computer screens. Just know that the representation is of a 3D projection that you can explore in the same way as any other 3D world inside of a computer game - until you rotate in the ana/kata axes, which will change the 3D universe.

It gets a bit easier to play the game if you hit "v" to enable additional orthogonal projections.