transformers

Wednesday, 29 October 2025

Transformer Explained for Everyone

References

What Is a Transformer?

The Transformer is a neural network architecture built to model relationships between words in a sentence. Unlike earlier systems that handled words in strict left-to-right order, the Transformer analyzes all words in a sentence simultaneously — and still determines how each word influences the meaning of every other.

For example, in the sentence “The robot kicked the ball because it was programmed to score,” the word “it” could refer to “robot” or “ball.” The Transformer computes numerical connections between “it” and all other words, then assigns the strongest link to “robot” — based on learned patterns from vast amounts of text.

This article explains the Transformer as it operates on sequences of words — not images, sound, or full documents. We focus only on how it transforms a sentence into a set of context-aware numerical representations, one for each word.

Where Do You See Transformers in Real Life?

Transformers power many everyday tools — not just from big tech companies, but across open-source and commercial AI:

Autocomplete in messaging apps (e.g., suggesting “coffee?” after you type “Want to grab”)
Voice assistants (Siri, Alexa) interpreting commands like “Play jazz from the 1960s”
Open-source models like Llama, Mistral, or BERT used in research, customer support bots, and writing aids

What Came Before Transformers — and Why Did We Need Something New?

Before 2017, most language models were based on Recurrent Neural Networks (RNNs), including improved versions like LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units). These were widely used in speech recognition, machine translation, and early chatbots.

How RNNs Actually Worked

An RNN processes a sentence one token at a time, maintaining a hidden state — a vector that summarizes everything seen so far.

For the sentence “The robot kicked…”, it would:

Read “The” → update hidden state
Read “robot” → update hidden state to include “robot”
Read “kicked” → update again, now encoding “robot kicked”

This hidden state acted as working memory, carrying forward information needed to interpret later words — like remembering that “robot” is an agent, so “it” likely refers back to it.

Why RNNs Hit a Wall

Despite years of use, RNNs faced three hard limits:

Sequential Bottleneck
Because each step depends on the previous one, RNNs cannot be parallelized. Training on a 50-word sentence requires 50 sequential steps — even on powerful GPUs. A 1,000-word document could take hours per training example.
Vanishing Gradients Over Distance
In practice, RNNs struggled to connect words more than 10–20 tokens apart. If “robot” appeared at position 3 and “it” at position 42, the gradient signal linking them often faded to near zero during training — making correct resolution unreliable.
Fixed Memory Capacity
The hidden state has a fixed size (e.g., 512 numbers). It cannot grow to remember more details. Early words get overwritten or diluted as new ones arrive — like a notepad with only one page.

Researchers added fixes: LSTMs/GRUs improved memory control, and attention mechanisms (circa 2015) let RNN decoders focus on relevant input words. But the core sequential processing remained — limiting speed and scalability.

The Transformer’s Solution

The 2017 paper “Attention Is All You Need” proposed a radical shift: remove recurrence entirely.

Instead of passing a hidden state step by step, the Transformer:

Treats a sentence (typically up to 512 tokens) as a whole
Uses self-attention to compute a relevance score between every pair of words, regardless of distance
Runs all these computations in parallel on GPU cores

This eliminated the sequential bottleneck, allowed reliable connections across hundreds of tokens, and leveraged modern hardware far more efficiently. As a result, Transformers trained 5–10x faster than RNNs on the same data and achieved higher accuracy on tasks like translation and coreference resolution.

The Transformer replaced sequential processing with parallel attention — unlocking faster, deeper, and more contextual language understanding.

What Makes a Transformer Work? The Four Core Components

A Transformer’s power comes from four key parts, applied in sequence:

Input Embeddings — Turn words into numbers
Positional Encoding — Add word order
Multi-Head Self-Attention — Model relationships between words
Feed-Forward Blocks + Residual Connections + Layer Normalization — Refine and stabilize meaning

These components repeat in stacked layers (often 6 to 12 times), each deepening the model’s understanding.

From Sentence to Meaning

Imagine you type:

“ The robot kicked the ball because it was programmed to score.”

Your Goal:

The model should understand “it” refers to “the robot”, not “the ball.”

This sentence enters the Transformer as raw text. What comes out — after several layers — is a refined numerical representation of each word, now aware of its context.

Input Embeddings: Turn Words into Numbers

Neural networks cannot process raw text directly. Instead, each word (or word fragment) must be converted into a list of numbers that the network can understand and learn from. This list is called a dense vector, and it serves as a numerical “portrait” of the word’s meaning — its semantic meaning.

Think of semantic meaning like a person’s profile in a social app: just as “likes hiking, speaks French, works in robotics” tells you something about who they are beyond their name, a word’s vector captures its role and relationships in language. The word “kick” will have a vector closer to “throw” or “hit” than to “sleep” or “dream”, because the model learns these associations from vast amounts of text.

How does this conversion happen?

First, a tokenizer — a preprocessing tool built into the model — breaks the input sentence into smaller units called tokens. These tokens can be whole words (“robot”), parts of words (“un”, “do”, “able”), or even punctuation. The choice depends on the tokenizer’s design and the language’s structure.

Each token is then assigned a unique integer ID, known as input_ids. For example, in a given model, “robot” might be ID 4281, “kicked” ID 9032, and so on.

These IDs are not arbitrary labels — they act as addresses that point to specific rows in a large, trainable table called the embedding matrix. This matrix has dimensions vocab_size × d_model:

vocab_size is the total number of distinct tokens the model was trained to recognize (e.g., 20,000). This vocabulary is fixed during training and belongs to the model itself, not the user or the dataset alone — it’s built from the text data the model originally learned on.
d_model is the number of values (or dimensions) used to represent each token. If d_model = 512, then every token is described by a list of 512 numbers.

So, when the model sees the token ID for “robot”, it looks up the 4281st row in this matrix and retrieves a 512-number vector that stands for “robot” in its internal language.

This step gives every token a rich numerical identity — but it still knows nothing about where the word appears in the sentence. That comes next.

Input embeddings translate words into a numerical language the Transformer can learn from — like giving every word a fingerprint made of numbers.

Positional Encoding: Adding Word Order

After converting words into numerical vectors, the model still has no idea about their order in the sentence. Without this, the phrases “robot kicked ball” and “ball kicked robot” would produce identical internal representations — leading to serious misunderstandings. To resolve this, the Transformer adds positional encoding: a structured signal that tells the model where each word appears in the sequence.

But how do we represent position numerically — without overwhelming the model with arbitrary numbers?

The original Transformer paper proposed an solution: use mathematical wave patterns — specifically, sine and cosine functions — to generate position-aware vectors. These functions are smooth, predictable, and, crucially, allow the model to extrapolate to longer sentences than it saw during training.

Here’s the intuition:
Imagine assigning each word a unique “rhythm” based on its position. The first word gets one pattern, the second a slightly shifted version, and so on. Because sine and cosine waves repeat in a structured way, the model can learn relationships like “the word two steps ahead” or “the previous word” just by comparing these rhythms.

For each position in the sentence (e.g., position 1, 2, 3…), the model computes a vector of the same size as the embedding (d_model, e.g., 512). In this vector:

The values at even-numbered positions (0th, 2nd, 4th, …) come from a sine function.
The values at odd-numbered positions (1st, 3rd, 5th, …) come from a cosine function.

This alternation ensures that every position gets a distinct, high-dimensional signature that changes smoothly as you move through the sentence.

Note: We call the initial numerical representations input embeddings. Once positional encoding is added, we often refer to them simply as embeddings or token representations. The term “word embeddings” is commonly used in the field to describe these learned semantic vectors.

The final representation for each token is the sum of its semantic embedding and its positional vector. The model then learns to interpret this combined signal — for example: “This vector corresponds to the word ‘kicked,’ and it appears in the second position of the sentence.”

Positional Encoding ensures “dog bites man is not same as “man bites dog.”

Layer Normalization: Keeping Numbers in a Healthy Range

As a Transformer processes a sentence, each word is represented by a list of numbers (for example, 512 numbers if d_model = 512). During training, these numbers can grow very large or shrink very close to zero — especially in deep networks with many layers. When this happens, the learning process becomes unstable: small changes in input cause wildly different outputs, and the model struggles to improve.

To prevent this, the Transformer uses layer normalization — a technique that gently rescales the numbers for each word so they stay in a consistent, manageable range.

Here’s how it works:
Take one word in the sentence — say, “kicked.” Its current representation is a list of 512 numbers. Layer normalization looks only at this list (ignoring other words and other sentences in the batch) and adjusts it so that:

The average (mean) of the numbers becomes 0
The spread (standard deviation) of the numbers becomes 1

This process is called standardization, a specific type of normalization that centers and scales data. It’s like adjusting the volume of each instrument in an orchestra so no single one drowns out the others.

But what if the model wants some numbers to stay large or small? To preserve flexibility, layer normalization includes two learnable parameters — often called scale (α) and shift (β). These are numbers the model can adjust during training to reverse or modify the normalization if it helps performance. In other words, the model gets to decide how much normalization it actually needs.

Finally, to avoid mathematical errors, a tiny value called eps (e.g., 0.00001) is added during the calculation. This prevents division by zero when computing the standard deviation — especially important when all numbers in a vector are nearly identical.

Where does the (batch, seq_len, d_model) shape come from?

batch: A group of sentences processed together (e.g., 32 sentences at once).
seq_len: The number of tokens in a sentence (e.g., 10 words).
d_model: The size of each word’s vector (e.g., 512 numbers).
So the full data structure is a 3D block: 32 sentences × 10 words × 512 numbers per word. Layer normalization operates independently on each of the 320 word-vectors, one at a time.

Layer normalization acts like an automatic volume control for each word’s internal representation — keeping learning smooth and stable.

Feed-Forward Block: Letting Each Word Reflect on Its Meaning

Within each Transformer layer, two main components work in sequence:

Multi-Head Attention (which connects words to each other)
Feed-Forward Block (which processes each word independently)

After attention updates a word’s representation by considering its neighbors, the feed-forward block gives that word a chance to “think on its own” about what it now means in context.

Think of it like this:
During a team meeting, you first listen to everyone’s opinions (that’s attention). Then, you step aside for a moment to reflect privately on what you’ve heard and refine your own thoughts (that’s the feed-forward block). This private reflection happens separately for every participant — no one else is involved.

Technically, this “reflection” is done by a small two-layer neural network applied to each token’s vector:

# Step 1: Expand the vector to a larger space
x → Linear(d_model → d_ff) → ReLU

# Step 2: Compress it back to original size
→ Dropout → Linear(d_ff → d_model)

Here’s what the numbers mean:

d_model (e.g., 512) is the size of each word’s vector as it moves through the Transformer.
d_ff (e.g., 2048) is the size of an intermediate, expanded space — typically 4 times larger than d_model.

Why expand? Because working in a larger space gives the model more “room” to separate and combine ideas. For example, the word “bank” might activate some dimensions for “financial institution” and others for “river edge.” The expanded space lets these meanings be teased apart before being recombined.

After this expansion and processing, the vector is compressed back to d_model dimensions (e.g., 512). This ensures that the output of the feed-forward block has the same shape as its input, so it can be passed smoothly to the next layer — whether that’s another Transformer block or the final prediction step. This consistency is essential for stacking many layers without shape mismatches.

The ReLU function (Rectified Linear Unit) adds a simple but powerful rule: any negative number becomes zero. This introduces non-linearity, allowing the model to learn complex patterns instead of just straight-line relationships.

Multi-Head Attention: Understanding Word Relationships

The Transformer’s core mechanism lets every word determine which other words matter most to it in context. We’ll use this sentence:

“The robot kicked the ball because it was programmed to score.”

Our goal it to help the model link “it” to “robot.”

Step 1: Create Queries, Keys, and Values

From each word’s current embedding, the model generates three vectors:

Query: What the word seeks (e.g., “it” seeks a programmable entity).
Key: What the word offers (e.g., “robot” signals it is programmable).
Value: The actual information the word contributes (e.g., traits like “machine,” “agent”).

These are produced by multiplying the embedding with three learned matrices (WQ, WK, WV).

Step 2: Compute Relevance

For “it,” the model calculates a dot product between its query and every other word’s key.
The dot product — sum of element-wise products — measures alignment. A high value means strong relevance.

Step 3: Normalize with Softmax

All dot products for “it” become raw scores. Softmax converts them into attention weights — positive numbers that sum to 1.
For example: 0.82 for “robot,” 0.06 for “ball,” 0.03 for “programmed,” and small weights for the rest.

Step 4: Build a Contextual Representation

The new vector for “it” is a weighted sum of all value vectors:

New "it" = (0.82 × Value_robot) + (0.06 × Value_ball) + …

Because “robot” dominates, the updated “it” now carries numerical features from “robot.” Context is built through this mixture — not by magic, but by math.

Step 5: Use Multiple Heads in Parallel

Instead of one attention process, the Transformer runs 8 (or more) independent heads at once. Each learns different patterns using its own set of WQ, WK, WV.

Each head produces its own output vector for “it.” These are concatenated and passed through a final linear transformation — a matrix multiplication that maps the combined vector back to d_model dimensions (e.g., 512), ensuring compatibility with the next layer.

Multi-head attention lets every word gather contextual evidence from the entire sentence — through multiple independent lenses — before updating its meaning.

Residual Connections: Preserving Original Information

In every Transformer layer, the output of each major component — multi-head attention and the feed-forward block — is added directly to its original input. This design is called a residual connection.

For example, after the attention mechanism updates the representation of the word “it,” the model computes:

New representation = Attention output + Original input before attention

This simple addition ensures that the core identity of each word — its initial embedding and positional information — is never lost, even after multiple layers of transformation. At the same time, the model can layer on rich contextual updates.

Residual connections are essential for training deep networks. During backpropagation, gradients can flow backward through the “+ input” path without passing through complex transformations. This prevents the gradients from vanishing, which would otherwise stall learning in deep architectures.

Residual connections guarantee that every word carries its original self forward, even as it absorbs new meaning from context.

Putting It All Together: The Full Flow

Consider the sentence:

“The robot kicked the ball because it was programmed to score.”

Here is the complete sequence of operations inside a Transformer encoder:

Tokenization: The sentence is split into tokens (“The,” “robot,” “kicked,” …).
Embedding: Each token becomes a dense vector of size d_model (e.g., 512).
Positional Encoding: A position-specific vector is added to each embedding to mark word order.

Then, the data passes through N identical layers (typically 6 or 12). Each layer performs:

Multi-Head Self-Attention → Add residual connection → Apply layer normalization
Feed-Forward Block → Add residual connection → Apply layer normalization

After the final layer, every token is represented by a context-aware vector — a numerical encoding that reflects not only its own meaning but also its role in the full sentence.

These vectors can now be used for tasks like classification, named entity recognition, or as input to a decoder in sequence-to-sequence models.

The Transformer processes an entire sentence in parallel, refining each word’s meaning through repeated cycles of attention, reflection, and identity-preserving updates.

Final Summary: The Four Pillars

The Transformer’s effectiveness rests on four foundational components:

Word Embedding: Converts tokens into dense numerical vectors that capture semantic meaning.
Positional Encoding: Encodes the order of tokens in a sequence, enabling the model to distinguish between different word arrangements.
Self-Attention: Dynamically models relationships among all words in a sentence, allowing each word to influence and be influenced by others.
Residual Connections: Preserve the original signal through each layer, ensuring stable training in deep networks.

Together, these mechanisms enable parallel, context-aware language understanding — without relying on sequential recurrence or local convolution.

Unlike traditional models that process words one at a time, the Transformer analyzes the entire sentence simultaneously, capturing how every word shapes the meaning of every othe

Friday, 19 September 2025

molmo and pixmo

[[openweights vlm (molmo and pixmo).pdf]]

1. so ranjay krishna ko lab le chai ekdam transparent vision language model banayeko xah not only model but the datasets as well. ani yo models haru ko unique and most important feature vanekai yesko dataset ho. khasma chai dataset banaune belama llm or vlm ko lagi k garinxa vanda OpenAI ko api or aru kunai Claude, Gemini jasto models haru batw dataset prepare garinxa bulk ma ani tyo synthetic dataset ma chai model lai train garinxa jasle garda k samasya aayo vanda, tyo dataset batw train gareko model open source vaye pani just tyo model tyo proprietary model ko distilled version matra hunxa tyo vanda badi kei pani garna sakdaina, especially vision language model ko case ma.
2. So yo Molmo and PixMo vanne paper ma chai uniharu le ekdam novel approach batw dataset banayeko xah vision language model ko lagi jun ekdam costly, ekdam resource intensive, ani quality maintain garna garo hunxa compared to llm ko dataset. so uniharu le ekdam highly detailed image captions vako dataset banayeko xah pre-training ko lagi, ani free-form image q&a dataset (vaneko rigid,fixed caption navayerw explanable general and natural labels for the images are collected) ani innovative 2d pointing dataset (vaneko objects haru point garna ko lagi localization pani garxa co-ordinates ko basis ma). yo sabai datasets kunai pani external vlms use nagari banayeko dataset ho.
3. yiniharu ko best model 72b parameter ko jasle teti bela (5 dec 2024) ma proprietary vlms like claude 3.5 sonnet, gemini pro 1.5 jasto models haru lai academic benchmarks ma peldiyo ani gpt-4o sanga chai compettion ma second vayo.
4. yo model chai sota model thiyo in their class of openness vaneko jasle sabai kura openly reproducible tarika le banauna khojxa tyo class ma otherwise properatary models haru sanga tw compete garnu garo hunxa.
5. natural image understanding ani counting ma chai molmo specialized jo xah aru models haru vanda but advance reasoning problems haru ma chai properatary models haru le yeslai peldeko xah
6. yiniharu ko dataset ma 712k image haru thioyo josma harek image ko lagi around 200+ words ko caption pani thiyo which wasn't annotated by the crowdsourcing plaform rathery they innovated something very useful technique.
7. uniharu le annotators haru lai 60-90 seconds samma ko lagi image lai explain garna lagaye (speech ma) ani tyo explanation lai chai as the annotation use garyo. yesari first hand data collection ma modality change garne trick le ekdam high quality dataset chai banauna help garyo without using any proprietary vlms.
8. Pixmo euta single dataset matrai haina esma arrays ko dataset haru xa pre-training and finetuning ko lagi. instruction tuning dataset banauna ko lagi uni haru le users sanga batw interactive way ma free-form dataset collect garey for 72k images, annotations haru 162k thiyo (multiple annotations haru thiyo, euta visual object ko barema different comments (free-form annotation))
9. ani uniharu le language rw image lai grounding garna ko lagi 2.3 million grounding annotations haru liye images(223,000 images) haru batw, uni haru le bounding box(rectangle) or segmentation mask ko thauma just (x,y) single points use garey jasle garda annotation pani ekdam past ani feasible vayo ani counting, identification jasto tasks haru ko lagi yesle majjale kaam pani garne vayo.
10. The system uses a clever HTML-like format where coordinates are scaled from 0-100 regardless of image size: `<point x="10.0" y="10.0" alt="Mt. Rainier">mountain</point>`. This makes the system resolution-agnostic - works the same whether the image is 100x100 pixels or 4K.
11. they specifically generated the synthetic dataset tarw kunai vlm use nagari just llm use garerw code generate garyo ani specifically (clock reading, chart understanding ani table understanding ) jasto tasks haru ko lagi chai recipe jasto synthetic dataset banayo, like instead of learning about wine by tasting it, they studied chemical formula of the wine.
12. uniharu le model train garne tarika pani ekdam innovative and effective thiyo jaha uniharu le pre-trained llm ani vision encoder jo use gareko thiye just like any other vlm but special kura k thiyo vanda uniharu le two stage training pipeline banayo (three stage hunxa jaha connector tuning garinxa jasko matlab chai noisy data ma structure find garna ko lagi llm lai train garne kura ho if dataset ramro xaina vane), novel overlapping multi-crop strategy use garyo (yesle chai instead of image lai grid grid garerw slide garne thauma esle chai overlapping grids haru banayo jasle garda context samjhinu parena grid grid garerw image lai read garda, crop ko boundaries ma hune information loss rokyo yesle), ani efficient multi-annotation learning (yesko matlab euta image ko multiple annotations xah vane pani training ekaichoti hune vayo instead of creating duplicate image ani tyo image ko different different captions), ani vision/language connector lai pani improve garyo (the vision-language connector bridges visual and textual understanding in multimodal AI. Traditional methods use basic feature stacking. Molmo employs attention-based pooling that preserves spatial relationships effectively. This technical innovation significantly enhances performance on visual reasoning and counting tasks.)
13.

Wednesday, 10 September 2025

llms

grouped-query attention vaneko chai multi-headed attention vanda slightly different xah to reduce the cost, kina vane multi-head attention ma harek harek token ko lagi query, key ani values haru compute garnu parxa whereas grouped query attention ma yei kura queries haru ko group ko key ani values hunxa (share garxa same key and value query heads le) jasle garda compute cost significantly ghatxa, memory cost pani ghatxa.
2. yesko analogy chai k xah vanda, MHA vaneko chai gharma sabai ko iphone xa ani sable afno afno charger cable,plug use garxan tarw gqa ma chai if gharma 4 jana xah vane 2 ota charger, plug, cable use garxan jasle garda bijuli ko bill thorai vayo.
3. multi-head latent attention chai deepseek r1/v3 ma use vako important concept ho josle chai multi-head attention lai replace garxa original transformer architecture batw. additionally deepseek ko architecture ma 61 ota transformer blocks haru xan. aba aau multi-head latent attention ma, jasari gpt2 ko architecture ma euta query ko respective key ani value hunthyo (let's say 100 dimension ko each) then MLA ma chai key ani value lai compressed form ma rakhinxa (let's say 10 dimension vector) jasle garda memory usage majjale ghatne vayo ani MLA ko yo concept chai training ma vanda pani inference ma implemented hunxa. query ma chai attention mechanism apply hune ho so yeslai chai 100d ko rakhyo ani key rw value vaneko knowledge base ho jasle garda compressed form ma rakhda ni preserve vairakhxa.
4. yesko analogy vaneko chai zip file ko concept sanga milxa, like .mp4 file xah rey 100mb ko teslai .zip ma compress garerw 10mb ko banayo tarw extract garda tw sabai kura firta aauxa intact.
5. inference efficiency vanekao nai real-world deployment ko main kura ho memory ani speed MLA le majjale optimize garxa. training ma chai yo use hudena kina vane training ma model lai full precision chainxa.
6. so olmo2 vanne model ma chai normalization layer ko placement was something unique. initially, original transformer architecture ma (decoder part) post-norm use hunthyo (mha paxi normalization (layernorm) ani feedforward layer paxi normalization), tespaxi pre-norm use huna thalyo in models like llama 3 8b, gpt2 (rmsnorm and layernorm were used respectively) ani finally olmo2 7b model ma chai post-norm inside residual huna thalyo which means original transformer architecture ma post-norm chai residual ko bahira thiyo vane olmo ma chai thyakkai ulto gardiyo jasle garda loss spikes(instability) (mostly seen in pre-norm usage) chai post-norm chai testo thena.
7. qknorm pani use vako thiyo olmo2 ma additional normalization ho applied to query and key jasle garda chai

Saturday, 6 September 2025

90k parameters

# You Only Need 90K Parameters to Adapt Light: a Light Weight Transformer for Image Enhancement and Exposure Correction

1. The paper addresses the common problem that images taken in difficult lighting conditions (too dark, too bright, under- or over-exposed) look bad and also **degrade the performance** of computer vision algorithms (like object detection).
2. Normally, a camera's internal **Image Signal Processor (ISP**) converts the raw sensor data into the standard image format (sRGB) we usually see. This process involves steps like **color correction** and adjusting brightness/contrast **(gamma correction).**
3. The researchers propose a new method called the **Illumination Adaptive Transformer (IAT).** Instead of just trying to fix the final image, IAT works by essentially **learning to *adjust the parameters of the ISP process itself* based on the input image**. It breaks down the ISP process and uses a Transformer model (specifically, attention queries) to figure out the **best adjustments** for things like color and gamma needed to correct the lighting.
4. The key advantages highlighted are that IAT is very small (lightweight, only **90k parameters**) and extremely fast (takes only **0.004 seconds per image**). Despite its efficiency, it performs better than current leading methods (State-Of-The-Art) on s**tandard tests for fixing low-light and exposure problems**. Importantly, fixing the images with IAT also significantly helps other computer vision tasks, like detecting objects or understanding image segments, perform better in these challenging lighting conditions.

**Notes from the Paper Text:**

- **Problem:** Real-world challenging illumination (low light, under/over-exposure) harms visual quality and computer vision task performance.
- **Background:** Cameras use an Image Signal Processor (ISP) to convert raw data to sRGB images, involving steps like color/gamma correction.
- **Proposed Solution:** Illumination Adaptive Transformer (IAT).
- A lightweight and fast transformer model.
- Aims to restore normally lit sRGB images from poorly lit inputs.
- **IAT Mechanism:**
- Decomposes the ISP pipeline conceptually (into local/global components).
- Uses attention queries to learn and adjust ISP-related parameters (e.g., color correction, gamma correction).
- **IAT Features:**
- Very lightweight: ~90k parameters.
- Very fast: ~0.004s processing time per image [inference time].
- **Performance:**
- Consistently outperforms State-Of-The-Art (SOTA) methods on low-light enhancement and exposure correction benchmarks.
- Significantly improves downstream tasks (object detection, semantic segmentation) under various lighting conditions.

---

---

## Content Based Image Retrieval

**Content-based image retrieval** (CBIR) is a process in image retrieval where the system searches for images based on their **visual and semantic content**, rather than **metadata or textual descriptions.** It involves extracting features from images, such as color, texture, and shape, and using these features to **compare and rank images in the database**. This technology is often used in applications like facial recognition, image search engines, and medical imaging.

xtuner

[[llms]]

# Xtuner

### Single turn and multi-turn converstation dataset.

single turn dataset is effective for simple *FAQ bots and text classification related task.*

multi turn conversation dataset is required for applications needing sustained (continuing for long time) interaction like *customer support, mental health counselling and talkbot robots.*

#### incremental pre-training:

training the llama2 in nepali corpus for boosting the nepali language understanding.

**for instruction tuning reponse generation(output) loss is used for weight updates while the loss of instruction part(system input) is neglected**

*amalgamate:combine*

#### multi-turn conversation dataset sample:

<|system|> You are a helpful assistant.
<|user|> What is the capital of France?
<|assistant|> Paris is the capital of France.
<|user|> What's the population?
<|assistant|> About 67 million people live in France.
<|user|> Who is the president?
<|assistant|> Emmanuel Macron is the current president.

**xtuner uses their own method to deal with multi-turn conversation dataset**

i. concatenate the full converstaion into one sequence.
ii. add special **<|user|> and <|assistant|>** tokens to *mark who said what.*

iii. only computed the loss for *assistant token* (loss mask used: 1 means computer loss, 0 means ignore)

iv. training becomes fast and efficient.

**OpenAI's text-davinci-003 engine for dataset generation, alpaca dataset was generated by that engine.** a single turn dataset.

[arxiv_dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv)

[MOSS: an open conversational llm](https://link.springer.com/article/10.1007/s11633-024-1502-8)

16b parameter model which can perform variety of instructions in *MULTI-TURN INTERACTIONS WITH HUMAN*.

**datasets** are also provided for *sft*

[moss-oo3-sft](https://github.com/OpenLMLab/MOSS/tree/main/SFT_data)

a multi-turn dataset, 1.1 million dialogue samples *(full open-source)*

### Preference-aware training

method to align human preferences explictly to the model training process. (rhlf)

#### Spinning up training job with Xtuner

1. SLURM: Simple linux utility for resource management,

a fault-tolerant and highly-scalable cluster management and job scheduling system.

manages resources (cpu, gpu, ram, nodes in linux machine)

reference command: **srun**

2. Kubernetes

container orchestration platform and used in xtuner for orchestrating the containerized training jobs across multiple nodes.

######################################################################

[**accumulative_counts = 4** *(We do 4 forward/backward passes before stepping the optimizer, so it effectively behaves like batch size = 4.)]*

##### norm-based gradient clipping

rescales the gradient vector value if ||g|| > 1, limiting the magnitude to 1.

if ||g|| <= 1, gradient left unchanged.

datasets and finetuning

[[llms]]

# Domain Specific Dataset Curation for Effective Finetuning

[[axolotl]]

# LLM Finetuning Datasets & Methodologies: Comprehensive Technical Guide

(referenced from claude.ai)

## Models Consideration

Qwen2.5-7B, llama2-7b and llama3.2-3b

## Quantization

qlora 4-bit quantization for 7b models and standard lora for 3b models.

## Datasets

### Instruction tuning datasets

#### Multi Domain datasets

[Alpaca-52k ](https://huggingface.co/datasets/tatsu-lab/alpaca)

alpaca is the format for **single turn conversation type dataset.**

its can be used for general reasoning and creative writing with 4-6 hours of training 3b model.

[Ultrachat-200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)

multi-turn conversation dataset. *complex reasoning chain* and *natural dialogue*

[anon8231489123/ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json) ([text](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json))

chatgpt conversation 90k. for human like behavior finetuning.

#### Enhance instruction datasets

[microsoft/orca-math-word-problems-200k](http://huggingface.co/datasets/microsoft/orca-math-word-problems-200k)

for mathematical reasoning with step-by-step solutions.

upto grade 12.

format: problem statement....reasoning......final answer.

### Maths Dataset

1. GSM8K contains 8500 grade school maths problem including basic arithmetic through pre-algebra.

2. [hendrycks/competition_math](taken down)

12,500 competition-level problems (algebra, theory, calculus and number theory)

### Conversational Assistant Datasets

1. PersonaChat [bavard/personachat_truecased](https://huggingface.co/datasets/bavard/personachat_truecased)
Contains 160k dialogue with personality traits. good for **human-like engagement patterns training objective.**
2. Empathetic dialogues [empathetic_dialogues](discarded)

25k conversations

for emotional understanding and assistant like behavior development.

3. BlenderBot3-Dialog [facebook/blended_skill_talk](https://www.kaggle.com/datasets/thedevastator/multi-modal-conversation-data)

76k conversations
knowlege, empathy, personality and consistency.

### Specific Assistant Behavior Datasets

1. Assistant Conversations by Anthropic [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)

161k human-assitant dialogues.
helpful, harmless and honest response, RLHF-ready format.

2. OpenAssistant Conversations [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1)

161k human-generated conversations.
include multiple languages (might contain nepali as well)

### Nepali TTS Development

OpenSLR Nepali [openslr/43](https://openslr.org/43/)

## Can we finetune a vision language model on Maths Dataset/pictures?

### InternLM-Math [internlm/internlm2-math-plus-7b](https://huggingface.co/internlm/internlm2-math-plus-7b)

7b and 20b models which are pre-trained with ~100B math-related tokens and *SFT* with
~2M bilingual math supervised data.

{{minhash and exact number match used for decontaminate possible test set leakage.}}

InternLM-Math is solver, prover, verifier and augmentor.

It was evaluated for formal math reasoning with this evaluation set [MiniF2F-test](https://github.com/openai/miniF2F)

the dataset contains maths problems (theorem proving) from olympiads as well as high-school and undergraduate maths classes.

In informal maths reasoning MATH, MATH-Python and GSM8K are used as evaluation set.

InternLM-Math-7b performance: **34.6, 50.9, 78.1**

the 7b model outperforms the deepseek-7b-rl model

InternLM-Math will be combined with Lean 3 (for theorem proving and maths problem solving).

[Lean 3](https://lean-lang.org/doc/reference/latest/Elaboration-and-Compilation/) is a interactive theorem prover and functional programming language based on dependent kernel theory which means types can depend on terms, enabling expressive formalization of mathematics and programs

#### How does test-set leakage happens?

future data used for training in time series.

and improper cross-validation and its repeated use during hyperparameter tuning.

information from test fold influence the training process of the model causing data leakage.

## Mixture of Experts
a machine learning architecture where llm is divided into multiple networks called experts and the **gated network** dynamically selects and routes input to one or few relevant experts.

Models like Mixtral-8x7b, Youtube Recommendation system, Z-code, Switch Transformer are based on MOE.

Different modes or methods of MOE are Top-k routing,
top-1 routing (only one exper per input token),
expert choice routing (expert decides which input they can handle best),
sparse activation/routing (only subset of experts are activated)

**Capacity factor** is the hyperparameter that influences how many tokens each expert can handle during training and inference.

## Xtuner by InternLM

a finetuning toolkit for large language models, it can finetune the 7b models within 8GB V-RAM.

Supported models are **internlm, mixtral, llama and qwen**.

QLORA can be used for finetuning InternLM with publicly available datasets.

For example
```xtuner train internlm_7b_qlora_oasst_e3```

**Python3.10 support**
```conda create -n xtuner_env python=3.10```
``` pip install -U xtuner```

*Deepspeed* module not found. &&

can be installed with ```pip install deepspeed```

encountered another issue

{{ raise MissingCUDqAException("CUDA_HOME does not exist, unable to compile CUDA op(s)")
op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)
[end of output]}}

Above error encountered due to lack of CUDA Compiler, PyTorch install the CUDA runtime but *nvcc --version* checks whether the CUDA compiler is installe or not.

### [CUDA Compiler Installation](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#ubuntu-installation)

1. ```wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb```

2. ```sudo dpkg -i cuda-keyring_1.1-1_all.deb```

3. ``` sudo apt install cuda-toolkit -y```
4. ``` sudo apt install nvidia-gds```
5. ``` reboot ```

Or

1. ```sudo apt install nvidia-cuda-toolkit``` in Ubuntu 24.04.2 LTS.

{{}}

### What is Nvidia GDS packages?

means GPU direct Storage, that enables bypassing the CPU for data path.
It allow **direct memory access (DMA) transfers between GPU and Storage devices**

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb

axolotl

## Axolotl (alternative to hugginface/transformer)

It is a tool to full finetune model, parameter efficient finetuning and alignment techniques with support for multiple model architectures like llama, mixtral, phi, qwen, mixtral-moe, gemma, gpt-j, pythia etc.

Support includes fp16/fp32, lora, qlora, gptq and flash attn. pre-training, finetuning and preference-based post-training (DPO, ORPO AND PRMs)

Its installation requires **packaging==23.2, setuptools==75.8.0, wheel, ninja along with flash-attn and deepspeed.**

YAML file based finetuning technique.

### Dataset Format required for pre-training.

```python
{"text": "first row"}
{"text": "second row"}
...
```
in **.jsonl format.**

```python
Dataset.load_dataset
```
it loads various formats of dataset including *jsonl, csv, arrow, parquet, sql and Webdataset*

### Dataset Format required for SFT

SFT means training model to respond to an instruction or chat input. (chatbots like GPT and Gemini)

Formats supported are **Conversation Dataset and Instruction Dataset** along with *tokenized dataset*

#### Conversation dataset

It usually contain **role and content** key.

its called chat_templates which is a Jinja2 template which formats a list of messages into a prompt.

#### <|im_start|> and <|im_end|>

they are **de**limit**ers** which is a prompt that separates different speakers which allow model to identify which portion belongs to whom.

##### Sharegpt format

{"conversations": [{"from": "...", "value": "..."}]}

##### OpenAI format

{"messages": [{"role": "...", "content": "..."}]}

possible roles are *user, system, assistant*

{{**What do you want to mask?** }}

we can bring our own custom template via:

**chat_template_jinja: # your template**

#### Instruction Dataset

used for training instruction following models.

common format

```{"instruction": "...", "input": "...", "output": "..."}```

its called alpaca instruction dataset format.

but custom instruction prompt are welcomed.

## RLHF

RLHF means language model optimized through human feedback which means

### Methods for RLHF

#### DPO

Direct Preference Optimization

#### IPO

Identity preference optimization

#### KTO

Kahneman-Tversky Optimization

#### GRPO

Group relative policy optimization

#### ORPO

Odds ratio preference optimization

komputer vision

# COMPUTER VISION

```python
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
```

First tuple is mean for the RGB color channel whereas the second tuple is standard deviation, each values ranges from (0,1) and (-1,1). This ensure the pixel values are centered around 0 and have a standard deviation of 1 after normalization. Improves training stability and model learning convergence.

**Formula**

```python
normalized_value = (original_value - mean) / std
```

**Before Normalization**

```python
[[[0.0, 0.5, 1.0],
[0.2, 0.7, 0.9]],
[[0.1, 0.6, 0.8],
[0.3, 0.4, 0.5]]]
```

**After Normalization**

```python
[[[-1.0, 0.0, 1.0],
[-0.6, 0.4, 0.8]],
[[-0.8, 0.2, 0.6],
[-0.4, -0.2, 0.0]]]
```

---

---

---

# Image Captioning

***https://readmedium.com/image-caption-model-from-scratch-vit-gpt-94afaae30fb7***

1. patch_size: This variable determines the patch size used in the ViT component. The patch size is the size of the small image patches that are used as input to the ViT model. Here, the patch size is 16x16 pixels.
2. d_model_vit: This variable determines the dimensionality of the output embedding from the ViT component. It is calculated as the product of the patch size and the number of color channels.
3. num_patches: This variable determines the number of patches in the input image. It is calculated by dividing the image size by the patch size.
4. softmax_denom_eps: This variable determines a small value added to the denominator of the softmax function to prevent division by zero.

## Patch Embeddings

Patch Embedding is a technique used in computer vision to convert an image into a format that can be fed into a neural network.

Imagine an image is made up of small, non-overlapping squares called patches. Each patch is a small portion of the image, and it can be thought of as a tiny, independent image.

The Patch Embedding process involves:

- **Dividing the image into patches:** The image is divided into a grid of patches, where each patch is a small square portion of the image.
- **Representing each patch as a vector**: Each patch is represented as a vector, which is a list of numbers that describe the patch's color and texture.
- **Flattening the patch vectors:** The patch vectors are flattened into a long, one-dimensional list of numbers.

In Vision Transformers (ViT), Patch Embedding is used to convert the input image into a sequence of patch embeddings, which are then fed into the transformer network. The transformer network processes the patch embeddings in parallel, allowing it to learn global features and relationships in the image.

**PatchEmbeddings class**

The `PatchEmbeddings` class is responsible for creating the patches of an image using a convolutional layer. Here's a step-by-step explanation:

1. **Convolutional layer**: The class uses a convolutional layer (`nn.Conv2d`) to create the patches of the image. The convolutional layer takes the input image and applies a filter to it, resulting in a feature map.
2. **Flatten**: The feature map is then flattened using the `nn.Flatten` layer, which converts the 3D feature map into a 2D tensor.
3. **Permute**: The flattened tensor is then permuted using the `permute` method, which rearranges the dimensions of the tensor. The resulting tensor has shape `(B, N, D_MODEL)`, where `B` is the batch size, `N` is the number of patches, and `D_MODEL` is the dimensionality of the patch embeddings.

```jsx
self.conv_patch_layer = nn.Conv2d(in_channels=config['channels'],
out_channels=config['d_model'],
kernel_size=config['patch_size'],
stride=config['patch_size'])
```

**ViTEmbedding class**

The `ViTEmbedding` class creates the input embeddings for the ViT model by combining both patch and positional embeddings. Here's how it works:

1. **Class token embedding**: The class token embedding is a learnable parameter that represents the class token. The class token is a special token that is used to represent the entire image.
1. **Positional embedding**: The positional embedding is a learnable parameter that represents the position of each patch in the image. The positional embedding is used to capture the spatial relationships between patches.
2. **Patch embeddings layer**: The `PatchEmbeddings` class is used to create the patch embeddings from the input image.
3. **Dropout**: The patch embeddings are then passed through a dropout layer, which randomly sets a fraction of the output elements to zero during training.
4. **Add positional embedding**: The patch embeddings with class token are then added to the positional embedding, resulting in the final input embeddings for the ViT model.

## Creating Patch Embeddings using Convolutional Layers

Patch embeddings are created using a two-dimensional convolutional layer. This might seem surprising, as many people think that patch embeddings are created by simply dividing an image into patches and flattening them.

### Why Convolutional Layers?

However, using convolutional layers to create patch embeddings has several advantages:

1. Computational Efficiency: Convolutional layers are highly optimized and come pre-built with deep learning libraries like PyTorch and TensorFlow. This means that they can be used efficiently and effectively, without the need to implement custom patch embedding code.
2. Capturing Different Information: Convolutional layers can capture different types of information from the image, such as edges, textures, and patterns. This is because they are designed to extract features from images, which is exactly what we need to create patch embeddings.

**How it Works**

Here's a step-by-step explanation of how patch embeddings are created using convolutional layers:

1. **Convolutional Layer**: A 2D convolutional layer is applied to the input image. This layer extracts features from the image, such as edges and textures.
2. **Feature Maps**: The convolutional layer produces a feature map, which is a 2D array of values that represent the features extracted from the image.
3. **Flattening**: The feature map is then flattened into a 1D array, which represents the patch embeddings.
4. **Classification Token**: The classification token is appended to the front of the patch embeddings, which represents the entire image.

## Normalization

In reality, the mean of 0 and a standard deviation of 1 are mathematical concepts that are used to normalize the input data. Here's what it means in simple terms:

**Mean of 0**

Imagine you have a dataset of exam scores, and the average score is 80. If you subtract 80 from each score, you get a new set of scores that have a mean of 0. This means that the scores are centered around 0, and there is no longer an overall bias or shift in the data.

In the context of neural networks, normalizing the input data to have a mean of 0 helps to:

- Reduce the effect of outliers or extreme values
- Improve the stability of the model
- Enhance the accuracy of the model

**Standard Deviation of 1**

The standard deviation is a measure of how spread out the data is. If the standard deviation is 1, it means that the data points are relatively close to the mean, and there is not a lot of variation in the data.

In the context of neural networks, normalizing the input data to have a standard deviation of 1 helps to:

- Improve the convergence of the model during training
- Enhance the interpretability of the model
- Reduce the risk of overfitting

rest api

#rnd #web

[[proposal_rudra#Software Tools]]

# REST [REPRESENTATIONAL STATE TRANSFER] API

REST stands for representational state transfer and is a software architecture style that defines a pattern for client and server communications over a network. Performance, scalability, simplicity and reliability are some of the features of the REST architecture which ease the development of websites and software.

### Constraints

1. Stateless server that doesn’t maintain the state between requests from the client. In simpler words, it doesn’t remember anything about the past requests which keep the task of request and response simple and robust.
2. Independent server and client by decoupling each other allowing the changes and updates integration seamless, making maintenance easier.
3. The data retrieved from the server should be cacheable either by client or the server which reduces the load on server along with improved performance.
4. REST architecture may contain the intermediary layers like helpers between the client and the main server which allow the security, traffic and adding extra features. The client may access the resources on the server indirectly through other layers such as proxy or load balancer.
5. The server will provide a uniform interface for accessing resources without defining their representation. There is a standard way to request and response task no matter what kind of data it is. Those standard includes
1. Resources and URLs: Each individual data have their unique address.
2. HTTP Methods: These methods includes standard methods like GET(get/read data), POST(create/order new data), PUT/PATCH(update existing data) and DELETE(remove data).
3. Representations: Data is sent in standard formats like JSON or XML.

These constraints mentioned above aren’t a set of specificiation, rather a the guidelines and best practices to build a web system. The more we adhere these principles, the benefits we will get but its not strictly set to follow these rules.

## REST APIs and Web Services

REST web service is any web service that adheres to REST architecture constraints. These web services expose their data to the outside world through an API which can be accessed using the REST API with public web URLs.

Github’s REST API URL: `https://api.github.com/users/<username>`

The data is accessed from REST API by sending HTTP requests to specific URL, it listens to HTTP methods to know which operations to perform on the web service’s resources. Resource can be accessed and manipulated with HTTP requests.

### Status Code

1. 2xx: Successful Operation
2. 3xx: Redirection
3. 4xx: Client Error
4. 5xx: Server Error

### API Endpoints[Door to Web Service]

REST API exposes a set of public URLS thatclient applications use to access the resources of a webn service. These URLs are called endpoints. Each endpoint is designed for a specific purpose. Endpoint URL allow to choose the web service resource that the HTTP method wants to interact with.

reinforce

# REINFORCEMENT LEARNING

## Theoretical Foundations of Reinforcement Learning

### Markov Decision Process (MDP)

Markov process is the simplest child of the Markov family, which is also known as Markov chain. Imagine an observable system only by yourself, what you observe is called states, and the system can switch between the states. The set of the all possible states is known as state space. For Markov process, the number of possible states need to be finite. Also, the system can not be influenced by you but can be observe while it changing.

For example, looking at the simplest model of the weather in some city, we can observe the current day as sunny or rainy, which is our state space. A sequence of observations over time forms a chain of states, such as [sunny, sunny, rainy, sunny, …], and this is called history.

To call such system an Markov Process, it need to fulfill the Markov Property, which means that the future system dynamics from any state have to depend on this state only. The main point of the Markov property is to make every observable state self-contained to describe the future of the system. In this chase, **only one state is required to model the future dynamics of the system and not the whole history** or, say, the last N states.

As the system model complies with Markov property, you can capture transition probabilities with a transition matrix, which is a square matrix of the size N x N, where N is the number of states in our model. Every cell in a row, i, and a column, j, in the matrix contains the probability of the system to transition from state i to state j. The transition matrix defines the system dynamics. Additionally, Markov process implies stationarity, where there is no any factor influencing the system dynamics.

A state transition graph, where circle represents the state, arrow represents the possible transitions and self revolving arrows represents the self-state. If a model is at coffee state, then its next state is only depends on the Coffee state not any state before it.

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/03193b46-bb12-498f-911e-b8878f5c32e5/image.png)

In Markov reward processes, the Markov process is extended to a bit by adding the reward value to out transition from state to state. Reward is a another square matrix, similar to the transition matrix, with reward given for transitioning from state i to state j, which reside in row i and column j.

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/36c08b38-3e68-463c-b123-bc5b6f007385/image.png)

The return (Gt) is the sum of rewards agent collects in future from the time ‘t’ onward. The discount factor (Gamma) is applied at every step starting from the point where we calculate the return Gt. The farther the reward is in time, the higher the power of Gamma, which means bigger discount.

In RL, the agent uses these rewards to calculate:

1. **Immediate reward** (R) for each step.
2. **Return** (Gt) by summing up discounted future rewards.
3. **Value function** (V(s)) to average the returns for a state.

The State Value [V(s)] is the average return obtained from the Markov reward processes. The equation of the state value is given as:

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/ad74b21c-9945-4fbe-b64f-d5260391d76c/image.png)

This equation simply represents, **“ If I start at state s, what is the average total reward I can expect over time?”**

V(s) quantifies how **good** a state s is in terms of long-term rewards. In RL, this concept is extended to find the optimal policies that maximize V(s)

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/ad5a668d-8551-456f-bfbf-7deae856c754/image.png)

In the absence of terminal states (sink states) in infinite horizon problems, the value of Gamma = 1, the agent becomes completely far-sighted and cares about all the future rewards equally, no matter how far they are in the future which leads the agent to sums all future rewards infinitely.

The value of Gamma = 1 is idea for finite-horizon problems (tic-tac-toe) but impractical in infinite-horizon problems without a stopping condition. A larger gamma (e.g., 0.9 or 0.99) means the agent considers the **long-term future more**, but rewards farther in time still diminish in value.Gamma < 1 avoids the problem of infinite sums, which is common in infinite-horizon problems.

---

### Policy

Policy is some set of rules that controls the agent’s behavior. The main objective of RL is to gather the maximum cumulative return as possible. Mathematically policy can be represented by the given equation,

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/1306c345-53e4-4133-9b89-187b8c9f4f5e/image.png)

where ‘|’ denotes the conditional probability, P denotes the probability, At denotes the action ‘a’ chosen by the agent at time step t, St denotes the state ‘s’ the agent currently in at time t.

The equation basically ask, “ **If I am in state “s”, what action “a” should I take?”**

There are two different types of policy RL.

1. Deterministic Policy where the action ‘a’ is chosen with certainty.
2. Stochastic Policy where the actions are chosen randomly based on the probability distribution.

---

---

## Dynamic Programming and Bellman Equation

Dynamic programming consist of two different parts, dynamic and programming, the term dynamic means such problems with temporal or sequential aspect, and the programming means optimizing the policy, mathematically.

Any problem to be solved with dynamic programming requires following property:

1. Optimal substructure
2. Overlapping sub problems

These properties are satisfied by the MDP (Markov Decision Process) which allow the use of Bellman Optimality equation to creates the recursive decomposition to the problem.

---

Bellman Equation of Optimality applies for two different cases:

1. Deterministic Case
2. Stochastic Case

### Deterministic Case for Bellman Equation of Optimality

Deterministic cases are such problems with where the actions have 100% guaranteed outcome and not influenced by randomness.. The equation for the deterministic case is given by:

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/d8de4770-54a2-4451-828b-f7c7791dabdd/image.png)

Here:

- ( V^*(s) ) is the optimal value function for state ( s ).
- ( R(s, a) ) is the reward received after taking action ( a ) in state ( s ).
- ( \gamma ) is the discount factor, which determines the importance of future rewards.
- ( s' ) is the next state resulting from taking action ( a ) in state ( s ).

### Stochastic Case for Bellman Equation of Optimality

The outcomes of the actions are governed by the probabilities in the stochastic cases of Bellman Optimality equation. This means that taking an action in a given state can lead to multiple possible next states, each with certain probability.

The equation for the stochastic case is given as:

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/e8db5520-6b2f-498b-9a21-83f9b9d92829/image.png)

Here:

- ( V^*(s) ) is the optimal value function for state ( s ).
- ( R(s, a) ) is the expected reward received after taking action ( a ) in state ( s ).
- ( \gamma ) is the discount factor, which determines the importance of future rewards.
- ( P(s' | s, a) ) is the probability of transitioning to state ( s' ) from state ( s ) after taking action (a).
- ( s' ) represents the possible next states.
-

### Q(s,a) & V(s)

Q(s,a) is known as the Q-value function whereas the V(s) is known as value function. They have the given mathematical connection with each other:

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/ee9b6fec-56ac-43c8-9e68-7fecad2cd460/image.png)

The key difference between Q(s,a) and V(s) lies in what they evaluate:

- **V(s) - State Value Function:** Represents the expected return starting from state s and following the current policy. It tells us how good it is to be in a particular state.
- **Q(s,a) - State-Action Value Function:** Represents the expected return starting from state s, taking action a, and then following the current policy. It tells us how good it is to take a specific action in a particular state.

The relationship between Q(s,a) and V(s) can be expressed as:

V(s) = max Q(s,a) for all actions **a**

This means the value of a state is equal to the maximum Q-value possible from that state across all possible actions.

### Bellman Equation for General Case

According to Bellman's optimality proof, at every state the agent ends up in, it needs to select the action with the maximum expected reward, which is a sum of the immediate reward and the one-step discounted long-term reward.

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/3553e82c-6346-485a-997a-c526cfc31de4/image.png)

Representation of Q(s,a) recursively:

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/357aad7a-4bf7-4c32-a859-02d1fcc8045e/image.png)

## Value Iteration Method

deepfake

[[computer-vision]]

# DEEPFAKE DETECTION

Deepfake is a technology dedicated to creating highly realistic facial images and videos under specific conditions, which has significant application potential in fields such as entertainment, movie production and digital human creation. In addition to deepfake generation, corresponding deepfake detection technology continuously evolve to regulate the potential misuse of deepfakes, such as privacy invasion and phishing attacks.

As deepfakes become realistic and widespread around the social media, it becomes harder to identify the authenticity of all kind of information sources. The manipulation of content, such as photography or audio, also raises the ethical issues around the consent.

## How are Deepfakes Made?

There are two different ways to create the deepfakes images, videos and audios. They are listed below:

1. Generative Adversarial Networks
2. Diffusion Models

1. GAN (Generative Adversarial Networks) : GAN is composed of two models that play a game against each other. The first model, the generator either selects the image or video, or generate the fake one. The second model, the discriminator, decides whether the image or video is real or fake. The generator win the game if the discriminator, decides whether the image or video is real or fake. The generator wins the game if the discriminator can’t tell that a generate content is fake. Playing this game over and over trains the generator to generate the realistic content, whilst improving this game over discriminator’s ability to guess correctly whether the content is real.
2. Diffusion Models: A diffusion model is trained to restore an image or video to its original state after visual ‘noise’ had been added. Some diffusion models are trained with guidance such as text prompts encouraging them to generate particular images, whilst others try to decide what the likeliest output will be on their own. The resultant models can ‘inpaint’ missing patches in an image, filling the gaps with something plausible. Model such as Stable Diffusion and DALL-E 2 are both examples of diffusion models that take text prompts as part of their input. Diffusion models are newer than GANs and likely to become more prominent in deepfake generation as they are believed to be easier for train than GANs.

### Deepfake Detection

Deepfakes are becoming increasingly hard to detect due to the advancement in the generative AI methods for creating deepfakes. There are several ways that the images, videos and audio can be classified as deepfake based on the spatial and visual inconsistencies contained by the deepfake contents. Video and audio deepfakes can be given away by time-based inconsistencies, such as mismatch between speech and mouth movements. Deepfake generation methods such as GANs and diffusion models can also leave detectable ‘fingerprints’ within the pixels of images or videos.

Deep fake detection opensource projects

1. **Faceswap.dev**
2. https://github.com/shaoanlu/faceswap-GAN; Face tracking/alignment using MTCNN and kalman filter in video conversion

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/aee735a5-eac4-4d5e-b899-fc378944231e/image.png)

1. https://www.youtube.com/watch?v=x2g48Q2I2ZQ
2. https://github.com/Billy1900/Awesome-DeepFake-Learning?tab=readme-ov-file#3-curated-lists

**spatio-temporal action recognition**

https://www.creativebloq.com/features/deepfake-examples

https://github.com/jacobgil/pytorch-grad-cam

https://typeset.io/

# Classification Model Neural Architecture with PyTorch.

Recent evidence shows that the network The effect of depth on network performance is critical, and the main results on the challenging ImageNet dataset all employ very deep models, ranging from 16 to 30 layers deep (https://arxiv.org/pdf/2208.08231).

However,the first obstacle to this problem is the infamous gradient vanishing and gradient exploding problems, which hinder the convergence of the network. Later, researchers found that this problem can be alleviated by normalizing the input data and batch normalization, which is generally not a problem for a dozen-layer network.

Simply stacking the network to increase the depth of the network does not improve
the performance of the network. He et al. call this phenomenon the degradation problem,
which shows that not all systems are easy to optimize(He et al.).

## Inception-ResNetv1

The Inception-Resnet v1 is a hybrid network inspired by inception and the performance of resnet. There are two different versions of the Inception-Resnet, V1 and V2, where V2 being comparatively producing higher computational cost.

The Inception-Resnet incorporates the use of 3 different stem modules and reduction blocks. The output of Inception module is added to the input.

![Schematic for Inception-ResNet v1 AND v2 Network](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/2ebf2ce2-dad3-44ff-8636-6a14f3106f02/image.png)

Schematic for Inception-ResNet v1 AND v2 Network

The dimension of the output from the inception module and the input from the previous layer must have same dimension without any alteration. Factorization of the convolutions filters become much more important to match these dimensions. However, further studies shows that the network dies when the convolutions filters exceeds 1000(https://iq.opengenus.org/inception-resnet-v1/). This problem was later solved by introducing the concept called Activation Scaling.

![ Inception-ResNetv1](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/b35538f6-71c4-4e6d-b27a-5487e8f42699/image.png)

Inception-ResNetv1

![Inception-ResNetv2](https://prod-files-secure.s3.us-west-2.amazonaws.com/df89fc0b-ce89-458c-88fa-b18c8cdb3e6b/477b6378-9d9e-410a-9571-75f7653e5e42/image.png)

Inception-ResNetv2

# LSTM (Long Short Term Memory)

## Video Vision Transformer for Video Classification

The VivitForVideoClassification class in Hugging Face Transformers library provides a PyTorch implementation of the Video Vision Transformer specifically designed for video classification. This class requires a VivitConfig object, which contains all the necessary parameters for the model’s architecture and operation. When initializing the model with the configuration, it only load the structural details of the models instead of the model’s weights. For loading the model’s weights, the **from_pretrained()** method need to be implemented.

The ViViT model is a powerful architecture for video classification. It processes video frames as sequences of patches. Classification head is attached on top of the model. It also supports the fine-tuning process by enabling **interpolate_pos_encoding,** which adjusts position embeddings for new resolutions. This allows leveraging pre-trained weights effectively on different datasets like **Kinetics-400.**

The forward function contains the parameters such as **pixel_values, head_mask, output_attentions, output_hidden_states, interpolate_pos_encoding, return_dict, labels.**

```python
import torch
from transformers import VivitModel, VivitImageProcessor

model = VivitModel.from_pretrained("google/vivit-base")
processor = VivitImageProcessor.from_pretrained("google/vivit-base")
images = ["image1.png", "image2.png"] # Example images
inputs = processor(images, return_tensors="pt", padding=True)
head_mask = None # No masking of attention heads
labels = torch.tensor([0, 1]) # Example ground truth labels
output_attentions = False
output_hidden_states = False
interpolate_pos_encoding = True
return_dict = True

#forward pass
outputs = model.forward(
pixel_values=inputs.pixel_values,
head_mask=head_mask,
labels=labels,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
interpolate_pos_encoding=interpolate_pos_encoding,
return_dict=return_dict,
)

if labels is not None:
loss = outputs.loss # Cross-entropy or MSE loss
logits = outputs.logits # Classification scores

print("Model outputs:", logits)
if labels is not None:
print("Loss:", loss)

```

### Logits in PyTorch

The raw outputs from the output layer of the neural network are called logits which are also known as activations. Deep learning networks at the core are made up of matrices multiplication and non-linearities like ReLU, these logits can range from **(-R,R)** where **R** represents real numbers. These logits can not be interpreted as model scores due to which activations are applied to them before getting the final score.

### Init and Forward Method in PyTorch

```python
class MyNeuralNetwork(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(MyNeuralNetwork, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_size, output_size)
self.sigmoid = nn.Sigmoid()

def forward(self, x):
out = self.fc1(x)
out = self.relu(out)
out = self.fc2(out)
out = self.sigmoid(out)
return out
```

**init** is a constructor method used to initialize the parameters of the network. It is executed when an object of the class is created. For example, in PyTorch, this method is used to define the layers of the network, such as convolutional layers, linear layers, activation functions, etc. **forward** is a method that defines the forward pass of the neural network. This method takes the input data and passes it through the layers of the network to produce the output. This method is executed whenever the model is called to make prediction or to compute the loss during training.

In other words, **init** sets up the network by defining the layers while forward specifies the data flows through the network. Both methods are required to create a neural network in PyTorch and serve different purposes.

# Model Training on Celebs Dataset

```python
""" """
import os
import cv2
import torch
import numpy as np
from torch import nn
from torchvision import transforms
"""Transforms are common image transformations. They can be chained together using
Compose. There is also a functional module for transform which provides the
fine-grained control over transformations."""

from torch.utils.data import Dataset, DataLoader
"""
"""
from facenet_pytorch import InceptionResnetV1
from PIL import Image

class VideoDataset(Dataset):
def __init__(self, folder_paths, frame_count=20, transform=None):
self.frame_count = frame_count
self.transform = transform
self.videos = []
self.labels = []

# Process real celebrity videos
for video_file in os.listdir(folder_paths[0]):
if video_file.endswith(('.mp4')):
self.videos.append(os.path.join(folder_paths[0], video_file))
self.labels.append(0) # Real

# Process fake celebrity videos
for video_file in os.listdir(folder_paths[1]):
if video_file.endswith(('.mp4')):
self.videos.append(os.path.join(folder_paths[1], video_file))
self.labels.append(1) # Fake

# Process YouTube real videos
for video_file in os.listdir(folder_paths[2]):
if video_file.endswith(('.mp4')):
self.videos.append(os.path.join(folder_paths[2], video_file))
self.labels.append(0) # Real

def extract_frames(self, video_path):
frames = []
cap = cv2.VideoCapture(video_path)

total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
interval = max(total_frames // self.frame_count, 1)

frame_counter = 0
while len(frames) < self.frame_count and frame_counter < total_frames:
ret, frame = cap.read()
if not ret:
break

if frame_counter % interval == 0:
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frame = Image.fromarray(frame)
if self.transform:
frame = self.transform(frame)
frames.append(frame)

frame_counter += 1

cap.release()

# Pad sequence if necessary
while len(frames) < self.frame_count:
frames.append(torch.zeros_like(frames[0]))

return torch.stack(frames)

def __len__(self):
return len(self.videos)

def __getitem__(self, idx):
video_path = self.videos[idx]
frames = self.extract_frames(video_path)
label = self.labels[idx]
return frames, torch.tensor(label, dtype=torch.float32)

class DeepFakeDetector(nn.Module):
def __init__(self, frame_count=20, hidden_size=512):
super(DeepFakeDetector, self).__init__()

# Load pretrained InceptionResNetV1
self.feature_extractor = InceptionResnetV1(pretrained='vggface2')
# Freeze feature extractor parameters
for param in self.feature_extractor.parameters():
param.requires_grad = False

# LSTM for sequence processing
self.lstm = nn.LSTM(
input_size=512, # InceptionResNetV1 output size
hidden_size=hidden_size,
num_layers=2,
batch_first=True,
dropout=0.5
)

# Final classification layers
self.classifier = nn.Sequential(
nn.Linear(hidden_size, 256),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(256, 1),
nn.Sigmoid()
)

def forward(self, x):
batch_size, seq_len, c, h, w = x.size()

# Reshape for feature extraction
x = x.view(-1, c, h, w)

# Extract features
features = self.feature_extractor(x)

# Reshape for LSTM
features = features.view(batch_size, seq_len, -1)

# Process with LSTM
lstm_out, _ = self.lstm(features)

# Use last LSTM output
lstm_out = lstm_out[:, -1, :]

# Final classification
output = self.classifier(lstm_out)
return output

def train_model(model, train_loader, val_loader, epochs=50, device='cuda'):
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters())

model = model.to(device)
best_val_loss = float('inf')

for epoch in range(epochs):
# Training phase
model.train()
train_loss = 0
for frames, labels in train_loader:
frames, labels = frames.to(device), labels.to(device)

optimizer.zero_grad()
outputs = model(frames)
loss = criterion(outputs.squeeze(), labels)

loss.backward()
optimizer.step()

train_loss += loss.item()

# Validation phase
model.eval()
val_loss = 0
correct = 0
total = 0

with torch.no_grad():
for frames, labels in val_loader:
frames, labels = frames.to(device), labels.to(device)
outputs = model(frames)
loss = criterion(outputs.squeeze(), labels)
val_loss += loss.item()

predicted = (outputs.squeeze() > 0.5).float()
total += labels.size(0)
correct += (predicted == labels).sum().item()

train_loss /= len(train_loader)
val_loss /= len(val_loader)
accuracy = 100 * correct / total

print(f'Epoch {epoch+1}/{epochs}:')
print(f'Training Loss: {train_loss:.4f}')
print(f'Validation Loss: {val_loss:.4f}')
print(f'Validation Accuracy: {accuracy:.2f}%')

# Save best model
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save(model.state_dict(), 'best_model.pth')

# Example usage
def main():
# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

# Define transforms
transform = transforms.Compose([
transforms.Resize((160, 160)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])

# Create datasets
folder_paths = [
'datasets/Celeb-real',
'datasets/Celeb-synthesis',
'datasets/YouTube-real'
]

# Create full dataset
dataset = VideoDataset(folder_paths, frame_count=20, transform=transform)

# Split dataset
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = torch.utils.data.random_split(
dataset, [train_size, val_size]
)

# Create data loaders
train_loader = DataLoader(
train_dataset,
batch_size=8,
shuffle=True,
num_workers=4
)
val_loader = DataLoader(
val_dataset,
batch_size=8,
shuffle=False,
num_workers=4
)

# Create and train model
model = DeepFakeDetector()
train_model(model, train_loader, val_loader, epochs=50, device=device)

if __name__ == '__main__':
main()
```

### Probability Distribution

A **probability distribution** is the mathematical function that gives the probabilities of occurrence of the possible outcomes for an experiment. It is a mathematical description of a random phenomenon in terms of its sample space and probabilities of events. The sample space, often represented in notation by **Ω** (omega) is the set of all possible outcomes of a random phenomenon being observed. The sample space may be any set: a set of real numbers, a set of vectors, a set of arbitrary non-numerical values etc. Sample space of coin flip would be **Ω =** {”heads”, “tails”}.

Defining the probability of the distribution depends on the type of random variables. **Discrete and absolutely continuous.** In the discrete case, it is sufficiently to specify a probability mass function *p* assigning probability to each possible outcomes. In contrast when a random variable takes values from a continuum then by convention, any individual outcome is assigned probability zero. For such continuous random variables, only events that include infinitely many outcomes such as intervals have probability greater than zero.

Pages