We Never See Your Data

The Contradiction

We build enterprise applications that touch sensitive data — financial records, healthcare information, acquisition details, personnel systems. To build well, we need realistic data: correct schemas, plausible distributions, proper relationships between tables, enough volume to test performance. To build safely, we need to never see any of it.

This isn't theoretical caution. We're an AI-first shop. AI is how we deliver fast. That means every dataset we work with is one careless prompt away from being sent to a model provider, one test fixture away from being committed to a repository, one log statement away from being written to disk. The standard approach — NDAs, access controls, "just be careful" — doesn't eliminate the risk. It manages it. We wanted to eliminate it.

So we built Decoy.

How It Works

Decoy is a single compiled binary that deploys inside your trust boundary — your ATO, your VPC, your air-gapped network. It never phones home. It connects to your data sources with read-only access: S3 buckets, databases, data lakes, Dataverse, flat files. It supports over 310 data types out of the box, from common types like names, addresses, dates, phone numbers, and currency amounts to specialized generators for UUIDs, IBAN numbers, latitude/longitude pairs, MAC addresses, and NAICS codes.

Inside your boundary, Decoy pairs with an AI agent that is authorized to see the data — a model running in your environment, like Bedrock inside your AWS ATO. The AI agent does the work a human data analyst would do: it reads schemas, understands field semantics, discovers relationships (both explicit foreign keys and implied soft-references like matching GUIDs across tables), identifies embedded structures (pipe-delimited values, nested JSON, arrays), and notes data quirks that developers would need to know.

Then the AI writes code. It builds custom data providers — generation rules for every field, every relationship, every distribution. Not just types, but relationships: if your vendor table has 200 entries and your invoice table references them with realistic frequency distributions, the synthetic data preserves that. If 20% of rates across all vendors fall between $80 and $120/hour, the synthetic distribution matches. This is configurable — if even the distribution shape is sensitive, you can mask it.

The generated provider code runs, produces synthetic data, and if it fails — wrong types, broken constraints, missing relationships — the AI agent debugs and fixes it. The loop continues until the output is clean.

Only synthetic data crosses the trust boundary - real data never leaves

What Crosses the Trust Boundary

Only three things leave your environment, and only after you review them:

Synthetic Dataset

Same schema, same volume, same statistical properties. If your real JSON has 1.2M lines, the synthetic one has 1.2M lines. Same field lengths, same nested structures, same referential integrity. Zero real values.

Documentation Package

Schema maps, field profiles, relationship graphs, data quirks. Everything a developer needs to understand the data without seeing it. Great for onboarding and transitions too — you keep a copy.

Provider Code

The generation rules themselves. No real data — just the logic for producing synthetic data that matches the original's shape. Useful for regeneration when schemas evolve.

Decoy writes its output to a location you specify — an S3 bucket, a shared drive, wherever you control. You review everything before we ever see it. Out of an abundance of caution, the default posture is that the only information that could possibly leak is statistical: distributions of numeric fields. No names, no identifiers, no text content, no specific values. And even distribution masking is a config toggle away.

The Documentation Bonus

The AI agent that builds your synthetic data also builds something arguably as valuable: a complete, non-sensitive guide to your data. Because it has to deeply understand the data to generate convincing fakes, it produces documentation as a side effect:

Schema Map

Every table, collection, and file — fields, types, nullability, constraints. Auto-generated, always current.

Field Profiles

Per field: cardinality, min/max/average for numerics, max string lengths, enum values for categoricals, embedded structure notes (pipe-delimited, nested JSON, arrays).

Relationship Graph

Explicit foreign keys plus discovered soft-references — matching GUIDs, identical long numbers across tables, implied parent-child patterns the original developers never documented.

Data Quirks Log

Things a human analyst would discover after a week of exploration: "field labeled 'date' contains epoch milliseconds," "vendor_code uses 3 different formats across time periods," "free text field maxes at 4,200 characters but averages 80."

Provider Reference

What each custom generator does and why it exists, so developers on our side understand exactly what the synthetic data represents.

This documentation package is yours to keep. Not just for us — it's valuable for your own onboarding, transitions, and institutional knowledge. Most organizations don't have documentation this thorough about their own data. Now you do.

Not Just a Faker

Off-the-shelf fake data libraries generate random values by type. They'll give you a plausible name, a plausible address, a plausible phone number. That's useful for unit tests. It's useless for building real applications.

Real data has relationships. Invoices reference vendors. Vendors have rate histories. Rate histories cluster by job category. Job categories map to contract vehicles. None of these relationships exist in randomly generated data, and without them your application logic has nothing meaningful to exercise.

Decoy preserves all of it. Explicit foreign keys, discovered soft-references (matching GUIDs across tables that nobody documented as a relationship), frequency distributions, cardinality ratios, even temporal patterns. The AI agent discovers relationships the original developers may not have documented — because it's looking at the actual data, not the schema alone.

Standard Fake Data

Random values by type
No cross-table relationships
Uniform distributions
Fixed row counts
Breaks application logic on first real query

Decoy

AI-analyzed field semantics
Full referential integrity preserved
Real distribution shapes (configurable)
Matching volume for performance testing
Applications work identically on synthetic data

Why This Matters

Our focus is fast development of high-quality, enterprise-ready applications. To deliver at that speed, we need high-quality inputs we can use freely — pipe into AI assistants, commit to repositories, share in code reviews, run in CI/CD, test at full scale on developer laptops. That requirement clashes directly with the nature of the data our systems touch.

Decoy resolves the contradiction. We can work on your financial data without ever seeing it. We can modernize your PHI-touching system without ever seeing it. We can build your acquisition management tools without ever seeing a single vendor rate or contract value. And you can check — at every step — that we never saw it.

This isn't just about compliance. It's about working fearlessly. When developers know that every byte of data on their machine is synthetic, they stop second-guessing. They stop worrying about accidentally including a test fixture in a commit. They stop avoiding AI tools because "what if the data leaks." They just build.

310+

Built-in Data Types

Real Values Exposed

1:1

Volume Match

100%

Customer Auditable

Configuration

Decoy is configured with a straightforward file that specifies data source connections (S3 bucket addresses, database connection strings, file paths) and optional prompting for the AI agent to handle corner cases. The customer or Intelligrit can write this — or both collaboratively. We recommend granting Decoy read-only permissions on all data sources, because it never needs to write to any of them.

The AI agent running in your environment handles the rest: schema discovery, relationship mapping, provider generation, synthetic data output, and documentation. When your data structures change — new columns, new tables, new relationships — you rerun Decoy. Most government data structures are fairly locked in, so this is an occasional operation, not a continuous pipeline.

A note on availability: Yes, we know there's interest in buying Decoy as a standalone product. As of now, you have to hire us to get it.

Inside customer boundary	Data sources, Decoy binary, authorized AI agent, validation loop
Customer-reviewed outputs	Synthetic dataset, documentation, provider code
Development side	Application development, testing, and AI-assisted work using synthetic data only