Skip to main content

Command Palette

Search for a command to run...

I Built a Free Synthetic Data Generator — Here's How (React + Tailwind)

Synthetic data for developers, QA teams, and healthcare IT — with FHIR export, custom scenarios, and zero setup

Published
10 min read
I Built a Free Synthetic Data Generator — Here's How (React + Tailwind)
Every developer hits the same wall eventually.

You're building a new feature. You need data. Not 5 rows — you need 10,000 realistic user records, or a batch of patient data for a healthcare demo, or transaction logs to stress-test a dashboard.

So you do what we all do:

```javascript
// The classics
const users = Array.from({ length: 100 }, (_, i) => ({
  name: `User ${i}`,
  email: `user${i}@test.com`,
  phone: '555-0000',
}));
```

Beautiful. Professional. Definitely won't embarrass you in a demo. 🙃

I got tired of this, so I built **DataForge** — a free, browser-based synthetic data generator that creates realistic fake data instantly.

**No signup. No server. No limits. No "upgrade to premium for more than 100 rows."**

🔗 **[Try DataForge Live →](https://data-faker-tool.vercel.app)**


---

## The Problem I Was Solving

I kept running into the same situations:

1. **Building a dashboard?** Need 5,000 transaction records with realistic amounts, merchants, and dates
2. **Testing a user management system?** Need diverse names, emails, phone numbers — not `test1@test.com`
3. **Working on a healthcare app?** Need patient records with real ICD-10 codes, lab values, and vitals — but can't touch real data (HIPAA)
4. **Demoing to a client?** Need data that looks *real*, not obviously fake

The existing solutions all had problems:

| Tool | Issue |
|------|-------|
| **Faker.js** | Requires code setup, no UI |
| **Mockaroo** | Free tier limited to 1,000 rows |
| **Synthea** | Requires Java installation, CLI only |
| **Manual scripts** | Time-consuming, never realistic enough |

So I built something better.

---

## What DataForge Does

### 9 Data Types Across 2 Categories

**📊 General Data:**

| Type | Fields Generated |
|------|-----------------|
| **Users** | First/last name, email, phone, DOB, gender, company, job title, username, active status |
| **Addresses** | Street, apt, city, state, ZIP, country, lat/lng coordinates, address type |
| **Transactions** | Date, description, merchant, amount, currency, credit/debit, status, category, account number |

**🏥 Healthcare Data (HIPAA-Safe):**

| Type | Fields Generated |
|------|-----------------|
| **Patients** | MRN, demographics, blood type, allergies, conditions, insurance, emergency contact, status |
| **Medical Records** | Visit type, ICD-10 diagnosis, CPT procedure, vitals (BP, HR, temp, SpO2), clinical notes, provider |
| **Prescriptions** | Rx number, medication, dosage, form, frequency, quantity, refills, DEA number, NDC code, pharmacy |
| **Lab Results** | 26 real lab tests, result values, reference ranges, High/Low/Normal flags, specimen type, LOINC codes |
| **Insurance Claims** | Claim number, charged/allowed/paid amounts, patient responsibility, plan type, approval status |
| **Providers** | NPI number, credentials (MD/DO/PhD), specialty, facility, accepting patients, ratings |

All healthcare data uses **real medical codes and reference ranges** — not random strings.

---

## Features That Make It Powerful

### ⚡ Generate Up to 50,000 Records Instantly

Everything runs client-side in your browser. No API calls. No server. No waiting.

```
10 → 25 → 50 → 100 → 250 → 500 → 1K → 2.5K → 5K → 10K → 25K → 50K
```

I generated 50,000 patient records in **1.8 seconds** on my laptop. Try doing that with a REST API.

### 🎬 Custom Scenarios

This is my favorite feature. Instead of purely random data, you can define **rules** that shape your dataset:

**18 Preset Scenarios:**

| Category | Scenarios |
|----------|-----------|
| **General** | Clean Data, High-Value Transactions ($5K+), Fraud Patterns, Micro-Transactions, New User Signups, Churned Users |
| **Healthcare** | Elderly Patient Cohort (65+), Pediatric Cohort (0-17), Critical Lab Values, Denied Claims, High-Cost Claims ($10K+), ER Visits, Telehealth Visits, Controlled Substances |
| **Edge Cases** | Dirty/Messy Data (15% nulls), Sparse Data (40% nulls), Unicode Names, Boundary Values, Heavy Duplicates |

**Custom Builder** — Set rules per field:

```
Age:        Min 65, Max 95       → Elderly cohort
Status:     Fixed "Denied"       → All denied claims
Null Rate:  15%                  → Realistic dirty data
Duplicates: 30%                  → Test deduplication logic
Errors:     5%                   → Test error handling
```

### 📤 Four Export Formats

```json
// JSON
[{ "firstName": "Sarah", "lastName": "Chen", "email": "schen@email.com" }]
```

```csv
# CSV
firstName,lastName,email
Sarah,Chen,schen@email.com
```

```sql
-- SQL
INSERT INTO users (firstName, lastName, email) 
VALUES ('Sarah', 'Chen', 'schen@email.com');
```

```json
// HL7 FHIR Bundle
{
  "resourceType": "Bundle",
  "type": "collection",
  "entry": [{
    "resource": {
      "resourceType": "Patient",
      "name": [{ "family": "Chen", "given": ["Sarah"] }]
    }
  }]
}
```

### 🌱 Reproducible with Seeds

Set a seed value → get the **exact same dataset** every time.

```
Seed 42 → Always generates the same 1,000 users
Seed 42 → Same data on your machine, my machine, CI/CD
```

Perfect for consistent test suites and shared fixtures.

---

## How I Built It

### Tech Stack

- **React 18** + TypeScript
- **Tailwind CSS** — Dark HD interface with glassmorphism
- **Vite** — Sub-second builds
- **Supabase** — Authentication and user tracking
- **Custom seeded PRNG** — No external faker library

### The Secret: A Custom Random Number Generator

The core of DataForge is a **seeded pseudo-random number generator**. Unlike `Math.random()` (which can't be seeded), this gives deterministic, reproducible output:

```typescript
class SeededRandom {
  private seed: number;

  constructor(seed: number) {
    this.seed = seed;
  }

  next(): number {
    // Lehmer/Park-Miller PRNG
    this.seed = (this.seed * 16807) % 2147483647;
    return this.seed / 2147483647;
  }

  nextInt(min: number, max: number): number {
    return Math.floor(this.next() * (max - min + 1)) + min;
  }

  pick<T>(array: T[]): T {
    return array[this.nextInt(0, array.length - 1)];
  }

  weightedPick<T>(items: { value: T; weight: number }[]): T {
    const totalWeight = items.reduce((sum, item) => sum + item.weight, 0);
    let random = this.next() * totalWeight;
    for (const item of items) {
      random -= item.weight;
      if (random <= 0) return item.value;
    }
    return items[items.length - 1].value;
  }
}
```

Same seed → same sequence → same data. Every time.

### Making Healthcare Data Realistic

Random strings don't cut it for healthcare testing. I embedded real medical data:

**Blood types with real-world frequency distribution:**

```typescript
const bloodTypes = [
  { value: 'O+', weight: 37 },  // 37% of population
  { value: 'A+', weight: 36 },  // 36%
  { value: 'B+', weight: 8 },   // 8%
  { value: 'AB+', weight: 3 },  // 3%
  { value: 'O-', weight: 7 },   // 7%
  { value: 'A-', weight: 6 },   // 6%
  { value: 'B-', weight: 2 },   // 2%
  { value: 'AB-', weight: 1 },  // 1%
];
```

**26 lab tests with actual clinical reference ranges:**

```typescript
const labTests = [
  { name: 'Glucose', unit: 'mg/dL', low: 70, high: 100 },
  { name: 'Hemoglobin A1c', unit: '%', low: 4.0, high: 5.6 },
  { name: 'TSH', unit: 'mIU/L', low: 0.4, high: 4.0 },
  { name: 'Creatinine', unit: 'mg/dL', low: 0.6, high: 1.2 },
  { name: 'Total Cholesterol', unit: 'mg/dL', low: 0, high: 200 },
  // ... 21 more real tests
];
```

25% of lab results are intentionally generated **outside** the reference range — because real data has abnormals. Each gets flagged as ↑ High or ↓ Low, just like a real lab report.

**Real medications with correct dosages:**

```typescript
const medications = [
  { name: 'Lisinopril', dosage: '10mg', form: 'Tablet', frequency: 'Once daily' },
  { name: 'Metformin', dosage: '500mg', form: 'Tablet', frequency: 'Twice daily' },
  { name: 'Atorvastatin', dosage: '20mg', form: 'Tablet', frequency: 'Once daily at bedtime' },
  { name: 'Omeprazole', dosage: '20mg', form: 'Capsule', frequency: 'Once daily before breakfast' },
  // ... 21 more real medications
];
```

### The Dark HD Interface

I wanted DataForge to feel premium. The UI uses:

- **Deep dark background** (`#08090d`) with ambient glow effects
- **Glassmorphism header** using `backdrop-blur(20px)`
- **Inter + JetBrains Mono** fonts for crisp text and data
- **Color-coded badges** — blood types in red, lab flags in amber/emerald, claim status in semantic colors
- **Custom scrollbars** matching the dark theme
- **Subtle animations** — float-in effects, glow pulses, hover states

It's the kind of interface you actually *want* to use, not just tolerate.

---

## The Healthcare Angle: Why This Matters

If you work in Health IT, you know the pain:

❌ **Can't use real patient data** — HIPAA violation, massive fines
❌ **Epic/Cerner sandboxes** have limited pre-loaded test patients
❌ **Synthea** is powerful but requires Java + CLI + configuration files
❌ **Online generators** don't understand ICD-10 codes or lab reference ranges
❌ **Manual creation** — entering test patients one by one is soul-crushing

DataForge solves this:

✅ **100% synthetic** — Zero PHI, zero HIPAA risk
✅ **Real medical codes** — ICD-10 diagnoses, CPT procedures, not random strings
✅ **Clinical accuracy** — Lab values with actual reference ranges and abnormal flags
✅ **FHIR-native** — Export valid HL7 FHIR Bundles, ready for sandbox import
✅ **Custom clinical scenarios** — Elderly cohorts, pediatric data, critical labs, denied claims
✅ **Instant** — 50K records in seconds, no Java, no CLI, no setup
✅ **Shareable** — Send the URL to your QA team, they can use it immediately

### Real Use Cases

**EHR Integration Testing:**
> Generate 10,000 patients → Export as FHIR → Import into Epic/Cerner sandbox → Test your interface

**QA Load Testing:**
> Generate 50,000 lab results with critical values → Verify your alerting system catches them all

**Training Environments:**
> Generate diverse patient populations → Load into training EHR → Train new clinicians without PHI risk

**Client Demos:**
> Generate realistic insurance claims → Show your claims processing dashboard with convincing data

---

## Architecture Overview

```
┌─────────────────────────────────────────┐
│              Browser (Client)            │
│                                          │
│  ┌─────────┐  ┌──────────┐  ┌────────┐ │
│  │ React   │  │ Seeded   │  │ Export │  │
│  │ UI +    │→ │ PRNG     │→ │ Engine │  │
│  │ Tailwind│  │ Engine   │  │ JSON/  │  │
│  │         │  │          │  │ CSV/   │  │
│  │ Config  │  │ Data     │  │ SQL/   │  │
│  │ Panel   │  │ Generators│ │ FHIR   │  │
│  └─────────┘  └──────────┘  └────────┘ │
│                                          │
│  ┌─────────────────────────────────────┐ │
│  │  Scenario Engine                     │ │
│  │  (Field rules, null injection,      │ │
│  │   error injection, duplicates)      │ │
│  └─────────────────────────────────────┘ │
│                                          │
│  ┌─────────────┐                         │
│  │  Supabase   │ ← Auth + Usage Tracking │
│  │  Client     │                         │
│  └─────────────┘                         │
└─────────────────────────────────────────┘
```

No server-side rendering. No API calls for data generation. Everything happens in the browser tab.

---

## Performance

| Records | Generation Time | File Size (JSON) |
|---------|----------------|-------------------|
| 100 | ~15ms | ~50 KB |
| 1,000 | ~80ms | ~500 KB |
| 10,000 | ~400ms | ~5 MB |
| 50,000 | ~1.8s | ~25 MB |

All measured on a standard laptop (M1 MacBook Air). Your mileage may vary, but it's fast.

---

## What I Learned Building This

### 1. Seeded randomness is underrated

Most developers reach for `Math.random()` and call it a day. But seedable PRNGs unlock:
- **Reproducible test suites** — Same seed = same data = deterministic tests
- **Shareable datasets** — "Use seed 42 to reproduce this bug"
- **Debugging** — Reproduce exact conditions that caused a failure

### 2. Healthcare data has hidden complexity

Building a realistic patient record is surprisingly complex:
- Blood types follow a specific population distribution
- Lab values need to be within realistic ranges (but sometimes abnormal)
- ICD-10 codes need to match plausible diagnoses
- Medications have specific dosages, forms, and frequencies
- Vitals are interconnected (low SpO2 often pairs with high HR)

### 3. Dark themes are harder than they look

Getting a dark UI to look premium (not just "dark gray everything") requires:
- Multiple surface levels (not just one dark color)
- Careful contrast ratios for accessibility
- Ambient light effects for depth
- Thoughtful use of color for hierarchy

### 4. Client-side generation scales well

I was worried about generating 50K records in the browser. Turns out, JavaScript is fast enough — the bottleneck is usually DOM rendering, not data generation. That's why the table caps at 500 visible rows while the full dataset is available for download.

---

## Try It Yourself

🔗 **Live App:** [https://data-faker-tool.vercel.app/)



🏆 **Product Hunt:** [DataForge on Product Hunt](https://www.producthunt.com/products/dataforge-synthetic-data-generator)

It's free. It's fast. It runs in your browser. No data ever leaves your machine.

---

## What's Coming Next

I'm actively building new features:

- [ ] **Custom schema builder** — Define your own data types with custom fields
- [ ] **API mode** — Use DataForge as a mock REST API endpoint
- [ ] **Table relationships** — Foreign keys between generated tables
- [ ] **More healthcare standards** — HL7v2 messages, C-CDA documents
- [ ] **Localization** — Non-US names, addresses, phone formats, currencies
- [ ] **Saved configurations** — Save and share your generation setups
- [ ] **CLI version** — `npx dataforge generate --type patients --count 10000`

If you have ideas, [open an issue on GitHub](YOUR_GITHUB_URL/issues) or drop a comment below.

---

## Final Thoughts

I built DataForge because I was tired of the same problem every developer faces: needing realistic test data and settling for garbage.

If this tool saves you even 30 minutes, it was worth building.

Drop a ⭐ on the [GitHub repo](YOUR_GITHUB_URL) if you find it useful. And if you're using it for something cool, I'd love to hear about it in the comments.

Happy testing! 🚀

---

*Built with React, TypeScript, Tailwind CSS, Vite, and Supabase. Deployed on Vercel.*