Detect PII Before Sending to an AI API

Sending personally identifiable information (PII) to third-party AI APIs creates legal and compliance risks under GDPR, CCPA, and HIPAA. Before forwarding any user-generated text to an LLM, production applications should scan for and redact PII — emails, phone numbers, social security numbers, credit card numbers, dates of birth, and names in certain contexts. This example shows a support ticket containing multiple types of PII and demonstrates how the detector identifies and categorizes each one. The detector uses a combination of regex patterns for structured PII (phone numbers, SSNs, credit cards, emails) and named entity recognition for unstructured PII (person names, addresses). Structured PII like email addresses follows predictable formats that regex handles reliably. Unstructured PII like person names requires model-based extraction because any combination of words can be a name. For compliance, detecting PII is only the first step. The recommended pattern is to redact PII before the text leaves your infrastructure, replacing it with placeholder tokens like [EMAIL_1] and [PHONE_1]. If your LLM response references those placeholders, you can substitute back the original values before displaying the result to the authorized user.

Example
Hi, I'm having trouble with my account. My name is John Mitchell and I registered with [email protected]. My phone number is (555) 234-7890 and my account was created on 03/15/1985. My SSN ending in 6742 is on file — please look it up. I also tried paying with my card ending in 4242. Can someone call me back at 555-234-7890?
[ open in AI PII Detector → ]

FAQ

What types of PII does the detector find?
The detector identifies emails, US phone numbers, social security numbers (SSNs), credit card numbers, dates of birth, IP addresses, and person names. Coverage varies by jurisdiction — additional patterns are needed for non-US national ID formats.
Is regex enough for PII detection?
Regex works well for structured PII like emails, phone numbers, and SSNs that follow predictable patterns. Person names, addresses, and medical terms require NLP-based entity recognition that can handle context-dependent identification.
How should I handle detected PII in my pipeline?
Replace detected PII with typed placeholders before sending to the API, then substitute back after receiving the response. Log that PII was detected but never log the actual PII values themselves.

Related Examples