LLM research and building models

2 days ago

csemachine learning

Since you're interested in LLM research and building models, generating large high-quality datasets automatically is extremely important. Below is a practical pipeline to generate ~1 million samples for training models like T5 using Python + augmentation techniques.

The idea is not just random templates, but multi-stage dataset generation.

1️⃣ The Real Dataset Generation Pipeline (Used in AI Research)

Modern LLM dataset pipelines usually follow this structure:

Seed Data
   │
Template Generation
   │
Paraphrase Augmentation
   │
Noise Injection
   │
LLM Synthetic Expansion
   │
Filtering & Deduplication
   │
Final Dataset

Goal:

100 seed samples
→ 10,000 augmented
→ 1,000,000 final samples

2️⃣ Step 1 — Generate Base Template Data

Your current method is good for initial seed data.

Example:

names = ["Ankit", "Rahul", "Priya"]
frameworks = ["PyTorch","TensorFlow","Keras","JAX"]
tasks = ["NLP","deep learning","AI research","cybersecurity"]
adjectives = ["exciting","powerful","challenging","rewarding"]

template = "{name} finds {framework} {adjective} for {task}"

This can easily generate:

10k – 100k samples

3️⃣ Step 2 — Convert to T5 Format

For T5, the dataset should follow:

input_text → target_text

Example:

input_text	target_text
classify sentiment: Ankit finds PyTorch exciting for NLP	positive

Code modification:

input_text = f"classify sentiment: {text}"
target_text = sentiment

4️⃣ Step 3 — Paraphrase Expansion

Instead of 1 sentence → create 5–10 variations.

Example:

Original

Ankit finds PyTorch exciting for NLP

Paraphrases

Ankit enjoys using PyTorch for NLP
PyTorch makes NLP exciting for Ankit
Working on NLP with PyTorch excites Ankit
Ankit loves PyTorch when doing NLP

Example code using a simple paraphrase library:

from transformers import pipeline

paraphraser = pipeline(
    "text2text-generation",
    model="t5-base"
)

def paraphrase(text):
    prompt = f"paraphrase: {text}"
    outputs = paraphraser(prompt, num_return_sequences=3)
    return [o['generated_text'] for o in outputs]

Now each sentence becomes 3–5 sentences.

5️⃣ Step 4 — Noise Injection (Very Important)

Real-world data is messy. Inject noise like:

typos
lowercase
missing punctuation
extra words

Example:

ankit finds pytorch exciting for nlp
Ankit finds PyTorch exciting for NLP!!
Ankit finds pytorch exciting

Code example:

def add_noise(text):
    if random.random() < 0.3:
        text = text.lower()

    if random.random() < 0.2:
        text = text.replace("PyTorch", "pytorch")

    return text

This makes models robust.

6️⃣ Step 5 — Self-Instruct Dataset Generation

This is a very powerful technique used in LLM training.

Instead of just templates, use an LLM to generate more examples.

Prompt example:

Generate 50 examples of sentences about machine learning
with sentiment labels (positive, neutral, negative).

Output example:

Sentence: PyTorch makes deep learning easier
Label: positive

You can automate this with APIs or local models.

7️⃣ Step 6 — Dataset Deduplication

Large synthetic datasets often contain duplicates.

Use hashing to remove duplicates.

Example:

unique_texts = list(set(texts))

Or advanced:

MinHash
Cosine similarity filtering

8️⃣ Step 7 — Dataset Scaling Strategy

Typical dataset growth:

Stage	Samples
Seed templates	10k
Paraphrase	50k
Noise injection	100k
LLM generation	500k
Final filtered	1M

9️⃣ Full Example Code (Large Dataset Generator)

import random
import csv

def generate_dataset(size=100000):

    frameworks = ["PyTorch","TensorFlow","Keras","JAX"]
    tasks = ["NLP","AI research","deep learning"]
    adjectives = ["exciting","powerful","challenging"]

    templates = [
        "{framework} is {adjective} for {task}",
        "{task} using {framework} is {adjective}"
    ]

    sentiments = ["positive","neutral","negative"]

    data = []

    for _ in range(size):

        template = random.choice(templates)

        text = template.format(
            framework=random.choice(frameworks),
            task=random.choice(tasks),
            adjective=random.choice(adjectives)
        )

        label = random.choice(sentiments)

        input_text = f"classify sentiment: {text}"

        data.append((input_text,label))

    with open("t5_dataset.csv","w",newline="",encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(["input_text","target_text"])
        writer.writerows(data)

🔟 Ideal Dataset Format for T5

Example dataset:

input_text,target_text

classify sentiment: PyTorch is exciting for NLP,positive
classify sentiment: TensorFlow feels challenging for beginners,neutral
classify sentiment: AI research with JAX is frustrating,negative

🚀 Advanced Dataset Generation (Research Level)

Large companies combine:

Template Generation
+
Web Scraping
+
LLM Synthetic Data
+
Human Filtering

Datasets like instruction datasets were generated this way.

⚠️ Important Advice

Do not train only on template data.

Mix with:

real text
forums
documentation
StackOverflow
GitHub comments

Otherwise model becomes pattern memorizer.

✅ Since you are working on LLM + cybersecurity projects, I can also show you something very powerful:

How to generate a 1M dataset for vulnerability detection using automatic exploit text generation.

That technique is used in security AI research models.

← See more posts