Differential Privacy

A rigorous mathematical definition of privacy. It guarantees that the output of an analysis is essentially the same whether your data is included or not.

1. Intuition: Plausible Deniability

Before complex algorithms, we start with Randomized Response. Imagine we ask a sensitive question: "Have you ever cheated on a test?" To protect your privacy, we introduce a coin flip mechanism. This gives you plausible deniability—even if you say "Yes", we don't know if it's the truth or the coin.

The Protocol

  1. Flip a coin (hidden from the analyst).
  2. If Heads: Answer Truthfully.
  3. If Tails: Flip again. Answer Yes if Heads, No if Tails (Random Answer).

Step 1: The Coin Flip

?

Click to flip

Simulated Users: 0

Analyst View

We recover the true statistics by removing the expected noise mathmatically.

Insight: Even though ~25% of people are forced to lie, with enough data, we can estimate the true percentage accurately, while no single user can be proven to have answered truthfully.

2. The Privacy Budget (ε)

Modern DP adds noise from a specific distribution (like Laplace). The amount of noise is controlled by Epsilon (ε).

High ε (e.g., 10)
Low Privacy, High Accuracy
Low ε (e.g., 0.1)
High Privacy, Low Accuracy
More Privacy (Noise) More Utility (Accuracy)

Scenario: Employee Salaries

We are querying the average salary of a department. To protect any single high-earner, we add Laplace Noise.

True Average: $75,000
DP Reported: $75,000
Error: 0%

The Laplace Mechanism

Noise is drawn from a Laplace distribution:
Lap(sensitivity / ε).
Lower ε = Wider curve = More possible noise.

Noise Distribution Visualization

The blue area is the probability distribution of the noise. The orange bar is a single random sample added to the data.

Key Algorithms

Different data types require different noise mechanisms.

1. Laplace Mechanism

Numerical Data

Used for counting queries (e.g., "How many people have diabetes?"). Adds noise from the Laplace distribution centered at 0. Simple and effective for numerical answers.

2. Exponential Mechanism

Categorical / Best Item

Used when picking the "best" item (e.g., "What is the most common disease?"). Instead of adding noise to a count, it makes the probability of selecting an item proportional to its score.

3. Local DP (Randomized Response)

Client-Side

Noise is added on the user's device before data is ever sent to the server. Used by Apple and Google for keyboard metrics.

DP vs. Others

Why Differential Privacy is the gold standard.

Method Technique Vulnerability
Anonymization Removing names/IDs Linkage Attacks (Netflix Prize)
K-Anonymity Grouping k-users together Homogeneity Attacks
Differential Privacy Adding Mathematical Noise Provably Secure
* Comparison shows that removing identifiers is insufficient. DP provides a mathematical guarantee regardless of auxiliary information an attacker might possess.

How to Apply (The Recipe)

1

Identify Sensitivity

Determine the "Sensitivity" of your query. How much can one person change the result? (e.g., for a count query, sensitivity is 1).

2

Set Budget (ε)

Choose your Epsilon based on privacy needs. Common values are between 0.1 (strict) and 10 (loose).

3

Add Noise

Sample from the mechanism (Laplace/Gaussian) and add it to the final aggregate result. Never release raw data.