transformers: datasets and finetuning

[[llms]]

# Domain Specific Dataset Curation for Effective Finetuning

[[axolotl]]

# LLM Finetuning Datasets & Methodologies: Comprehensive Technical Guide

(referenced from claude.ai)

## Models Consideration

Qwen2.5-7B, llama2-7b and llama3.2-3b

## Quantization

qlora 4-bit quantization for 7b models and standard lora for 3b models.

## Datasets

### Instruction tuning datasets

#### Multi Domain datasets

[Alpaca-52k ](https://huggingface.co/datasets/tatsu-lab/alpaca)

alpaca is the format for **single turn conversation type dataset.**

its can be used for general reasoning and creative writing with 4-6 hours of training 3b model.

[Ultrachat-200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)

multi-turn conversation dataset. *complex reasoning chain* and *natural dialogue*

[anon8231489123/ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json) ([text](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json))

chatgpt conversation 90k. for human like behavior finetuning.

#### Enhance instruction datasets

[microsoft/orca-math-word-problems-200k](http://huggingface.co/datasets/microsoft/orca-math-word-problems-200k)

for mathematical reasoning with step-by-step solutions.

upto grade 12.

format: problem statement....reasoning......final answer.

### Maths Dataset

1. GSM8K contains 8500 grade school maths problem including basic arithmetic through pre-algebra.

2. [hendrycks/competition_math](taken down)

12,500 competition-level problems (algebra, theory, calculus and number theory)

### Conversational Assistant Datasets

1. PersonaChat [bavard/personachat_truecased](https://huggingface.co/datasets/bavard/personachat_truecased)
Contains 160k dialogue with personality traits. good for **human-like engagement patterns training objective.**
2. Empathetic dialogues [empathetic_dialogues](discarded)

25k conversations

for emotional understanding and assistant like behavior development.

3. BlenderBot3-Dialog [facebook/blended_skill_talk](https://www.kaggle.com/datasets/thedevastator/multi-modal-conversation-data)

76k conversations
knowlege, empathy, personality and consistency.

### Specific Assistant Behavior Datasets

1. Assistant Conversations by Anthropic [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)

161k human-assitant dialogues.
helpful, harmless and honest response, RLHF-ready format.

2. OpenAssistant Conversations [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1)

161k human-generated conversations.
include multiple languages (might contain nepali as well)

### Nepali TTS Development

OpenSLR Nepali [openslr/43](https://openslr.org/43/)

## Can we finetune a vision language model on Maths Dataset/pictures?

### InternLM-Math [internlm/internlm2-math-plus-7b](https://huggingface.co/internlm/internlm2-math-plus-7b)

7b and 20b models which are pre-trained with ~100B math-related tokens and *SFT* with
~2M bilingual math supervised data.

{{minhash and exact number match used for decontaminate possible test set leakage.}}

InternLM-Math is solver, prover, verifier and augmentor.

It was evaluated for formal math reasoning with this evaluation set [MiniF2F-test](https://github.com/openai/miniF2F)

the dataset contains maths problems (theorem proving) from olympiads as well as high-school and undergraduate maths classes.

In informal maths reasoning MATH, MATH-Python and GSM8K are used as evaluation set.

InternLM-Math-7b performance: **34.6, 50.9, 78.1**

the 7b model outperforms the deepseek-7b-rl model

InternLM-Math will be combined with Lean 3 (for theorem proving and maths problem solving).

[Lean 3](https://lean-lang.org/doc/reference/latest/Elaboration-and-Compilation/) is a interactive theorem prover and functional programming language based on dependent kernel theory which means types can depend on terms, enabling expressive formalization of mathematics and programs

#### How does test-set leakage happens?

future data used for training in time series.

and improper cross-validation and its repeated use during hyperparameter tuning.

information from test fold influence the training process of the model causing data leakage.

## Mixture of Experts
a machine learning architecture where llm is divided into multiple networks called experts and the **gated network** dynamically selects and routes input to one or few relevant experts.

Models like Mixtral-8x7b, Youtube Recommendation system, Z-code, Switch Transformer are based on MOE.

Different modes or methods of MOE are Top-k routing,
top-1 routing (only one exper per input token),
expert choice routing (expert decides which input they can handle best),
sparse activation/routing (only subset of experts are activated)

**Capacity factor** is the hyperparameter that influences how many tokens each expert can handle during training and inference.

## Xtuner by InternLM

a finetuning toolkit for large language models, it can finetune the 7b models within 8GB V-RAM.

Supported models are **internlm, mixtral, llama and qwen**.

QLORA can be used for finetuning InternLM with publicly available datasets.

For example
```xtuner train internlm_7b_qlora_oasst_e3```

**Python3.10 support**
```conda create -n xtuner_env python=3.10```
``` pip install -U xtuner```

*Deepspeed* module not found. &&

can be installed with ```pip install deepspeed```

encountered another issue

{{ raise MissingCUDqAException("CUDA_HOME does not exist, unable to compile CUDA op(s)")
op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)
[end of output]}}

Above error encountered due to lack of CUDA Compiler, PyTorch install the CUDA runtime but *nvcc --version* checks whether the CUDA compiler is installe or not.

### [CUDA Compiler Installation](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#ubuntu-installation)

1. ```wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb```

2. ```sudo dpkg -i cuda-keyring_1.1-1_all.deb```

3. ``` sudo apt install cuda-toolkit -y```
4. ``` sudo apt install nvidia-gds```
5. ``` reboot ```

Or

1. ```sudo apt install nvidia-cuda-toolkit``` in Ubuntu 24.04.2 LTS.

{{}}

### What is Nvidia GDS packages?

means GPU direct Storage, that enables bypassing the CPU for data path.
It allow **direct memory access (DMA) transfers between GPU and Storage devices**

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb

transformers

Pages

Saturday, 6 September 2025

datasets and finetuning

No comments:

Post a Comment