it generates a **synthetic NLP dataset** for training.
Your function is already good πβit generates a synthetic NLP dataset for training. Iβll improve it slightly for better tutorial-quality data and then answer your main question about CSV vs JSON for training (especially with .safetensors).
β Improved Version (Better Training Data)
This version:
- Adds more variety
- Adds balanced labels
- Avoids repetition better
- Makes it more useful for real model training
import random
import csv
def generate_and_save_custom_csv(target_size=1000, file_path='custom_dataset.csv'):
print("π₯ Generating and saving custom CSV dataset...")
names = ["Ankit", "Rahul", "Priya", "Neha"]
frameworks = ["PyTorch", "TensorFlow", "Keras", "JAX", "Scikit-learn", "Hugging Face"]
tasks = ["NLP", "deep learning", "machine learning", "AI development", "cybersecurity"]
# Better sentiment mapping
adjective_map = {
"positive": ["exciting", "powerful", "innovative", "rewarding"],
"neutral": ["okay", "average", "manageable"],
"negative": ["challenging", "difficult", "frustrating"]
}
templates = [
"{name} finds {framework} {adjective} for {task}.",
"{name}βs work in {task} is {adjective} with {framework}.",
"{name} thinks {task} using {framework} is {adjective}.",
"{task} with {framework} feels {adjective} to {name}."
]
texts = set()
data = []
while len(data) < target_size:
sentiment = random.choice(["positive", "neutral", "negative"])
adjective = random.choice(adjective_map[sentiment])
text = random.choice(templates).format(
name=random.choice(names),
framework=random.choice(frameworks),
task=random.choice(tasks),
adjective=adjective
)
if text not in texts:
texts.add(text)
data.append((text, sentiment))
# Save CSV
with open(file_path, 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['text', 'label'])
writer.writerows(data)
print(f"β
Dataset saved to {file_path} with {len(data)} samples.")
π CSV vs JSON for Training (Important π₯)
π§ First understand:
.safetensors is NOT a dataset format
It is used to store:
- model weights (like )