Saturday, 6 September 2025

vlm

 vlm
1. text ani  image duitai modality  ma operate garne large language models haru nai vision language models ho josle euta model lai feel garna (perceive) ani reason garna (bujhna, karan garna, why?) able garauxa. CLIP was pioneer developed by OpenAI (open-sourced project). alignment methods haru chai vlm ko aspect ma ekdam hot topic ho jasto lagxa malai. alignment vaneko chai kun word kasto dekhinxa (visual) vanne kura sikaunu parxa model lai through contrastive learning (correct and incorrect samples, model learns to keep correct samples closer and incorrect far (pull correct pairs closer in its internal representation and push incorrect ones apart), cross-modal attention (yesma chai image ani text attention mechanism ko through same time ma jasto hunxa, like text process garirahada model le image ko relevant parts haru ni attention mechanism ko through learn garxa or sampling garxa(What color is the car?”, the model focuses on the car in the image.)),  Masked Language and Image Modeling  (yesma chai bert jasto missing part guess garne ho, ani tehi learn garxa, jastai image dine ani tesko description batw chai certain word hataidine josle garda model le chai tyo visual data hererw word predict garnu parxa.), Supervised Alignment with Bounding Boxes (yesma chai model lai train garne belama bounding boxes haru grounding ko lagi banainxa josle garda model le generate garne belama pani grounding sanga response generate garxa. [[sled-lab]]

2. vlm ko sabai vanda thulo samasya vaneko yesle visual hallucination garxa, ani llm component batw parametric knowledge use garerw response generate garxa josle garxa kheri hallucination sanga deal garna ko lagi grounding dherai jaruri xah. 
3. currently llama-3.2-vision vanne model ma dekhiyeko chai k xah vanda vision ani text ko correlation ko lagi llm backbone use gareko painxa which doesn't requires pre-training like that of CLIP. 

No comments:

Post a Comment