I Built a Free Synthetic Data Generator — Here's How (React + Tailwind)
Synthetic data for developers, QA teams, and healthcare IT — with FHIR export, custom scenarios, and zero setup
Published
•10 min read
Every developer hits the same wall eventually.
You're building a new feature. You need data. Not 5 rows — you need 10,000 realistic user records, or a batch of patient data for a healthcare demo, or transaction logs to stress-test a dashboard.
So you do what we all do:
```javascript
// The classics
const users = Array.from({ length: 100 }, (_, i) => ({
name: `User ${i}`,
email: `user${i}@test.com`,
phone: '555-0000',
}));
```
Beautiful. Professional. Definitely won't embarrass you in a demo. 🙃
I got tired of this, so I built **DataForge** — a free, browser-based synthetic data generator that creates realistic fake data instantly.
**No signup. No server. No limits. No "upgrade to premium for more than 100 rows."**
🔗 **[Try DataForge Live →](https://data-faker-tool.vercel.app)**
---
## The Problem I Was Solving
I kept running into the same situations:
1. **Building a dashboard?** Need 5,000 transaction records with realistic amounts, merchants, and dates
2. **Testing a user management system?** Need diverse names, emails, phone numbers — not `test1@test.com`
3. **Working on a healthcare app?** Need patient records with real ICD-10 codes, lab values, and vitals — but can't touch real data (HIPAA)
4. **Demoing to a client?** Need data that looks *real*, not obviously fake
The existing solutions all had problems:
| Tool | Issue |
|------|-------|
| **Faker.js** | Requires code setup, no UI |
| **Mockaroo** | Free tier limited to 1,000 rows |
| **Synthea** | Requires Java installation, CLI only |
| **Manual scripts** | Time-consuming, never realistic enough |
So I built something better.
---
## What DataForge Does
### 9 Data Types Across 2 Categories
**📊 General Data:**
| Type | Fields Generated |
|------|-----------------|
| **Users** | First/last name, email, phone, DOB, gender, company, job title, username, active status |
| **Addresses** | Street, apt, city, state, ZIP, country, lat/lng coordinates, address type |
| **Transactions** | Date, description, merchant, amount, currency, credit/debit, status, category, account number |
**🏥 Healthcare Data (HIPAA-Safe):**
| Type | Fields Generated |
|------|-----------------|
| **Patients** | MRN, demographics, blood type, allergies, conditions, insurance, emergency contact, status |
| **Medical Records** | Visit type, ICD-10 diagnosis, CPT procedure, vitals (BP, HR, temp, SpO2), clinical notes, provider |
| **Prescriptions** | Rx number, medication, dosage, form, frequency, quantity, refills, DEA number, NDC code, pharmacy |
| **Lab Results** | 26 real lab tests, result values, reference ranges, High/Low/Normal flags, specimen type, LOINC codes |
| **Insurance Claims** | Claim number, charged/allowed/paid amounts, patient responsibility, plan type, approval status |
| **Providers** | NPI number, credentials (MD/DO/PhD), specialty, facility, accepting patients, ratings |
All healthcare data uses **real medical codes and reference ranges** — not random strings.
---
## Features That Make It Powerful
### ⚡ Generate Up to 50,000 Records Instantly
Everything runs client-side in your browser. No API calls. No server. No waiting.
```
10 → 25 → 50 → 100 → 250 → 500 → 1K → 2.5K → 5K → 10K → 25K → 50K
```
I generated 50,000 patient records in **1.8 seconds** on my laptop. Try doing that with a REST API.
### 🎬 Custom Scenarios
This is my favorite feature. Instead of purely random data, you can define **rules** that shape your dataset:
**18 Preset Scenarios:**
| Category | Scenarios |
|----------|-----------|
| **General** | Clean Data, High-Value Transactions ($5K+), Fraud Patterns, Micro-Transactions, New User Signups, Churned Users |
| **Healthcare** | Elderly Patient Cohort (65+), Pediatric Cohort (0-17), Critical Lab Values, Denied Claims, High-Cost Claims ($10K+), ER Visits, Telehealth Visits, Controlled Substances |
| **Edge Cases** | Dirty/Messy Data (15% nulls), Sparse Data (40% nulls), Unicode Names, Boundary Values, Heavy Duplicates |
**Custom Builder** — Set rules per field:
```
Age: Min 65, Max 95 → Elderly cohort
Status: Fixed "Denied" → All denied claims
Null Rate: 15% → Realistic dirty data
Duplicates: 30% → Test deduplication logic
Errors: 5% → Test error handling
```
### 📤 Four Export Formats
```json
// JSON
[{ "firstName": "Sarah", "lastName": "Chen", "email": "schen@email.com" }]
```
```csv
# CSV
firstName,lastName,email
Sarah,Chen,schen@email.com
```
```sql
-- SQL
INSERT INTO users (firstName, lastName, email)
VALUES ('Sarah', 'Chen', 'schen@email.com');
```
```json
// HL7 FHIR Bundle
{
"resourceType": "Bundle",
"type": "collection",
"entry": [{
"resource": {
"resourceType": "Patient",
"name": [{ "family": "Chen", "given": ["Sarah"] }]
}
}]
}
```
### 🌱 Reproducible with Seeds
Set a seed value → get the **exact same dataset** every time.
```
Seed 42 → Always generates the same 1,000 users
Seed 42 → Same data on your machine, my machine, CI/CD
```
Perfect for consistent test suites and shared fixtures.
---
## How I Built It
### Tech Stack
- **React 18** + TypeScript
- **Tailwind CSS** — Dark HD interface with glassmorphism
- **Vite** — Sub-second builds
- **Supabase** — Authentication and user tracking
- **Custom seeded PRNG** — No external faker library
### The Secret: A Custom Random Number Generator
The core of DataForge is a **seeded pseudo-random number generator**. Unlike `Math.random()` (which can't be seeded), this gives deterministic, reproducible output:
```typescript
class SeededRandom {
private seed: number;
constructor(seed: number) {
this.seed = seed;
}
next(): number {
// Lehmer/Park-Miller PRNG
this.seed = (this.seed * 16807) % 2147483647;
return this.seed / 2147483647;
}
nextInt(min: number, max: number): number {
return Math.floor(this.next() * (max - min + 1)) + min;
}
pick<T>(array: T[]): T {
return array[this.nextInt(0, array.length - 1)];
}
weightedPick<T>(items: { value: T; weight: number }[]): T {
const totalWeight = items.reduce((sum, item) => sum + item.weight, 0);
let random = this.next() * totalWeight;
for (const item of items) {
random -= item.weight;
if (random <= 0) return item.value;
}
return items[items.length - 1].value;
}
}
```
Same seed → same sequence → same data. Every time.
### Making Healthcare Data Realistic
Random strings don't cut it for healthcare testing. I embedded real medical data:
**Blood types with real-world frequency distribution:**
```typescript
const bloodTypes = [
{ value: 'O+', weight: 37 }, // 37% of population
{ value: 'A+', weight: 36 }, // 36%
{ value: 'B+', weight: 8 }, // 8%
{ value: 'AB+', weight: 3 }, // 3%
{ value: 'O-', weight: 7 }, // 7%
{ value: 'A-', weight: 6 }, // 6%
{ value: 'B-', weight: 2 }, // 2%
{ value: 'AB-', weight: 1 }, // 1%
];
```
**26 lab tests with actual clinical reference ranges:**
```typescript
const labTests = [
{ name: 'Glucose', unit: 'mg/dL', low: 70, high: 100 },
{ name: 'Hemoglobin A1c', unit: '%', low: 4.0, high: 5.6 },
{ name: 'TSH', unit: 'mIU/L', low: 0.4, high: 4.0 },
{ name: 'Creatinine', unit: 'mg/dL', low: 0.6, high: 1.2 },
{ name: 'Total Cholesterol', unit: 'mg/dL', low: 0, high: 200 },
// ... 21 more real tests
];
```
25% of lab results are intentionally generated **outside** the reference range — because real data has abnormals. Each gets flagged as ↑ High or ↓ Low, just like a real lab report.
**Real medications with correct dosages:**
```typescript
const medications = [
{ name: 'Lisinopril', dosage: '10mg', form: 'Tablet', frequency: 'Once daily' },
{ name: 'Metformin', dosage: '500mg', form: 'Tablet', frequency: 'Twice daily' },
{ name: 'Atorvastatin', dosage: '20mg', form: 'Tablet', frequency: 'Once daily at bedtime' },
{ name: 'Omeprazole', dosage: '20mg', form: 'Capsule', frequency: 'Once daily before breakfast' },
// ... 21 more real medications
];
```
### The Dark HD Interface
I wanted DataForge to feel premium. The UI uses:
- **Deep dark background** (`#08090d`) with ambient glow effects
- **Glassmorphism header** using `backdrop-blur(20px)`
- **Inter + JetBrains Mono** fonts for crisp text and data
- **Color-coded badges** — blood types in red, lab flags in amber/emerald, claim status in semantic colors
- **Custom scrollbars** matching the dark theme
- **Subtle animations** — float-in effects, glow pulses, hover states
It's the kind of interface you actually *want* to use, not just tolerate.
---
## The Healthcare Angle: Why This Matters
If you work in Health IT, you know the pain:
❌ **Can't use real patient data** — HIPAA violation, massive fines
❌ **Epic/Cerner sandboxes** have limited pre-loaded test patients
❌ **Synthea** is powerful but requires Java + CLI + configuration files
❌ **Online generators** don't understand ICD-10 codes or lab reference ranges
❌ **Manual creation** — entering test patients one by one is soul-crushing
DataForge solves this:
✅ **100% synthetic** — Zero PHI, zero HIPAA risk
✅ **Real medical codes** — ICD-10 diagnoses, CPT procedures, not random strings
✅ **Clinical accuracy** — Lab values with actual reference ranges and abnormal flags
✅ **FHIR-native** — Export valid HL7 FHIR Bundles, ready for sandbox import
✅ **Custom clinical scenarios** — Elderly cohorts, pediatric data, critical labs, denied claims
✅ **Instant** — 50K records in seconds, no Java, no CLI, no setup
✅ **Shareable** — Send the URL to your QA team, they can use it immediately
### Real Use Cases
**EHR Integration Testing:**
> Generate 10,000 patients → Export as FHIR → Import into Epic/Cerner sandbox → Test your interface
**QA Load Testing:**
> Generate 50,000 lab results with critical values → Verify your alerting system catches them all
**Training Environments:**
> Generate diverse patient populations → Load into training EHR → Train new clinicians without PHI risk
**Client Demos:**
> Generate realistic insurance claims → Show your claims processing dashboard with convincing data
---
## Architecture Overview
```
┌─────────────────────────────────────────┐
│ Browser (Client) │
│ │
│ ┌─────────┐ ┌──────────┐ ┌────────┐ │
│ │ React │ │ Seeded │ │ Export │ │
│ │ UI + │→ │ PRNG │→ │ Engine │ │
│ │ Tailwind│ │ Engine │ │ JSON/ │ │
│ │ │ │ │ │ CSV/ │ │
│ │ Config │ │ Data │ │ SQL/ │ │
│ │ Panel │ │ Generators│ │ FHIR │ │
│ └─────────┘ └──────────┘ └────────┘ │
│ │
│ ┌─────────────────────────────────────┐ │
│ │ Scenario Engine │ │
│ │ (Field rules, null injection, │ │
│ │ error injection, duplicates) │ │
│ └─────────────────────────────────────┘ │
│ │
│ ┌─────────────┐ │
│ │ Supabase │ ← Auth + Usage Tracking │
│ │ Client │ │
│ └─────────────┘ │
└─────────────────────────────────────────┘
```
No server-side rendering. No API calls for data generation. Everything happens in the browser tab.
---
## Performance
| Records | Generation Time | File Size (JSON) |
|---------|----------------|-------------------|
| 100 | ~15ms | ~50 KB |
| 1,000 | ~80ms | ~500 KB |
| 10,000 | ~400ms | ~5 MB |
| 50,000 | ~1.8s | ~25 MB |
All measured on a standard laptop (M1 MacBook Air). Your mileage may vary, but it's fast.
---
## What I Learned Building This
### 1. Seeded randomness is underrated
Most developers reach for `Math.random()` and call it a day. But seedable PRNGs unlock:
- **Reproducible test suites** — Same seed = same data = deterministic tests
- **Shareable datasets** — "Use seed 42 to reproduce this bug"
- **Debugging** — Reproduce exact conditions that caused a failure
### 2. Healthcare data has hidden complexity
Building a realistic patient record is surprisingly complex:
- Blood types follow a specific population distribution
- Lab values need to be within realistic ranges (but sometimes abnormal)
- ICD-10 codes need to match plausible diagnoses
- Medications have specific dosages, forms, and frequencies
- Vitals are interconnected (low SpO2 often pairs with high HR)
### 3. Dark themes are harder than they look
Getting a dark UI to look premium (not just "dark gray everything") requires:
- Multiple surface levels (not just one dark color)
- Careful contrast ratios for accessibility
- Ambient light effects for depth
- Thoughtful use of color for hierarchy
### 4. Client-side generation scales well
I was worried about generating 50K records in the browser. Turns out, JavaScript is fast enough — the bottleneck is usually DOM rendering, not data generation. That's why the table caps at 500 visible rows while the full dataset is available for download.
---
## Try It Yourself
🔗 **Live App:** [https://data-faker-tool.vercel.app/)
🏆 **Product Hunt:** [DataForge on Product Hunt](https://www.producthunt.com/products/dataforge-synthetic-data-generator)
It's free. It's fast. It runs in your browser. No data ever leaves your machine.
---
## What's Coming Next
I'm actively building new features:
- [ ] **Custom schema builder** — Define your own data types with custom fields
- [ ] **API mode** — Use DataForge as a mock REST API endpoint
- [ ] **Table relationships** — Foreign keys between generated tables
- [ ] **More healthcare standards** — HL7v2 messages, C-CDA documents
- [ ] **Localization** — Non-US names, addresses, phone formats, currencies
- [ ] **Saved configurations** — Save and share your generation setups
- [ ] **CLI version** — `npx dataforge generate --type patients --count 10000`
If you have ideas, [open an issue on GitHub](YOUR_GITHUB_URL/issues) or drop a comment below.
---
## Final Thoughts
I built DataForge because I was tired of the same problem every developer faces: needing realistic test data and settling for garbage.
If this tool saves you even 30 minutes, it was worth building.
Drop a ⭐ on the [GitHub repo](YOUR_GITHUB_URL) if you find it useful. And if you're using it for something cool, I'd love to hear about it in the comments.
Happy testing! 🚀
---
*Built with React, TypeScript, Tailwind CSS, Vite, and Supabase. Deployed on Vercel.*