it generates a **synthetic NLP dataset** for training.
Your function is already good πβit generates a synthetic NLP dataset for training. Iβll improve it slightly for better tutorial-quality data and then answer your main question about CSV vs JSON for training (especially with .safetensors).
β Improved Version (Better Training Data)
This version:
- Adds more variety
- Adds balanced labels
- Avoids repetition better
- Makes it more useful for real model training
import random
import csv
def generate_and_save_custom_csv(target_size=1000, file_path='custom_dataset.csv'):
print("π₯ Generating and saving custom CSV dataset...")
names = ["Ankit", "Rahul", "Priya", "Neha"]
frameworks = ["PyTorch", "TensorFlow", "Keras", "JAX", "Scikit-learn", "Hugging Face"]
tasks = ["NLP", "deep learning", "machine learning", "AI development", "cybersecurity"]
# Better sentiment mapping
adjective_map = {
"positive": ["exciting", "powerful", "innovative", "rewarding"],
"neutral": ["okay", "average", "manageable"],
"negative": ["challenging", "difficult", "frustrating"]
}
templates = [
"{name} finds {framework} {adjective} for {task}.",
"{name}βs work in {task} is {adjective} with {framework}.",
"{name} thinks {task} using {framework} is {adjective}.",
"{task} with {framework} feels {adjective} to {name}."
]
texts = set()
data = []
while len(data) < target_size:
sentiment = random.choice(["positive", "neutral", "negative"])
adjective = random.choice(adjective_map[sentiment])
text = random.choice(templates).format(
name=random.choice(names),
framework=random.choice(frameworks),
task=random.choice(tasks),
adjective=adjective
)
if text not in texts:
texts.add(text)
data.append((text, sentiment))
# Save CSV
with open(file_path, 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['text', 'label'])
writer.writerows(data)
print(f"β
Dataset saved to {file_path} with {len(data)} samples.")
π CSV vs JSON for Training (Important π₯)
π§ First understand:
.safetensors is NOT a dataset format
It is used to store:
- model weights (like
.bin) - tensors safely
π Dataset format is separate from model format.
π CSV vs JSON
β 1. CSV Format
text,label
"AI is powerful",positive
"ML is hard",negative
βοΈ Pros:
-
Simple and lightweight
-
Fast to load
-
Works directly with:
- pandas
- PyTorch DataLoader
- Hugging Face datasets
β Cons:
- Not good for complex data
- Only flat structure (no nesting)
β 2. JSON Format
[
{"text": "AI is powerful", "label": "positive"},
{"text": "ML is hard", "label": "negative"}
]
βοΈ Pros:
-
Supports complex structure
-
Better for:
- conversations (chat models)
- multi-field data (instruction, input, output)
-
Preferred in modern LLM training
β Cons:
- Slightly heavier
- Slower than CSV
π₯ BEST FORMAT (Important for YOU)
Since you are:
- building LLM project
- doing fine-tuning
- working with transformers / safetensors
π Best choice = JSON (or JSONL)
π‘ Recommended Formats by Use Case
| Use Case | Best Format |
|---|---|
| Simple classification | CSV β |
| Transformer fine-tuning | JSON β |
| ChatGPT-style model | JSONL π₯ |
| Instruction tuning | JSONL π₯ |
π JSONL (BEST for LLM training)
Each line = one JSON object
{"text": "AI is powerful", "label": "positive"}
{"text": "ML is hard", "label": "negative"}
Why JSONL is best:
- Streamable (used by Hugging Face, OpenAI, etc.)
- Memory efficient
- Perfect for large datasets
π§ Final Answer (Simple)
π For your case:
- If learning β use CSV (easy)
- If building real LLM β use JSONL (best)
.safetensorsβ only for model weights, not dataset
β‘ Pro Tip (Advanced β for your project)
For your cybersecurity + LLM project, use this format:
{"instruction": "Analyze sentiment", "input": "AI is powerful", "output": "positive"}
π This helps you build:
- instruction-tuned models
- chat-based AI
If you want next step, I can help you:
β Convert CSV β JSONL β Create dataset for code generation (your goal) β Build full pipeline β dataset β tokenizer β fine-tune β safetensor model
Just tell me π