dataOpenai

CSV Data Processing Prompt (ChatGPT)

CSV processing tasks involve numerous small decisions about null handling, column types, and deduplication logic. This prompt makes those decisions explicit before code generation, which prevents the AI from making silent assumptions. The validation section (row count before/after) is essential for catching bugs where transformations silently drop more rows than expected. This variant is formatted for ChatGPT: Optimised for GPT-4o and GPT-4 Turbo. Uses markdown formatting and system/user message separation.

Prompt Template
## System
You are an expert AI assistant. Respond using clear markdown formatting.

## User
You are a data engineer. Write {{language}} code to process the following CSV data.

Task: {{task}}
Output format: {{output_format}}

CSV sample (first 5 rows):
```
{{csv_sample}}
```

Full column list: {{columns}}
Expected output columns: {{output_columns}}

Requirements:
- Handle missing values by {{missing_value_strategy}}
- {{additional_requirements}}

Provide:
1. Complete, runnable {{language}} code
2. Comments explaining each transformation step
3. A short validation section that prints row counts before and after to confirm no unexpected data loss

Variables

{{language}}Code language: Python (pandas), JavaScript, R, or SQL
{{task}}What to do with the data, e.g., "deduplicate by email, normalise phone numbers, pivot sales by month"
{{output_format}}Output format: CSV, JSON, pandas DataFrame, SQL INSERT statements
{{csv_sample}}The first 5 rows of your CSV including headers
{{columns}}All column names and their data types
{{output_columns}}The columns expected in the output
{{missing_value_strategy}}How to handle nulls, e.g., "filling with 0", "dropping the row", "keeping as NaN"
{{additional_requirements}}Any other constraints, or "None"

Example

Input
language: Python (pandas)
task: Deduplicate by email (keep most recent row), normalise phone numbers to E.164 format
output_format: CSV
csv_sample:
name,email,phone,signup_date
Alice,[email protected],(555) 123-4567,2024-01-15
Bob,[email protected],5559876543,2024-02-20
Alice,[email protected],555-123-4567,2024-03-01
Output
```python
import pandas as pd
import re

df = pd.read_csv('input.csv')
print(f"Rows before: {len(df)}")

# Normalise phone numbers to E.164 (+1XXXXXXXXXX for US)
def normalise_phone(phone):
    digits = re.sub(r'\D', '', str(phone))
    return f"+1{digits}" if len(digits) == 10 else None

df['phone'] = df['phone'].apply(normalise_phone)

# Deduplicate by email, keeping the most recent signup_date
df['signup_date'] = pd.to_datetime(df['signup_date'])
df = df.sort_values('signup_date').drop_duplicates('email', keep='last')

print(f"Rows after: {len(df)}")
df.to_csv('output.csv', index=False)
```

Related Tools

FAQ

Can this process files that are too large to paste?
Paste only the header row and first 5 data rows as the csv_sample, then list all column names in the columns field. The AI generates code based on the structure, which then runs against your full file.
Can I use this to convert CSV to JSON or SQL?
Yes. Set output_format to "JSON array" or "SQL INSERT statements" and specify the target table name in additional_requirements.
What if my CSV has inconsistent encoding or BOM characters?
Add to additional_requirements: "Read the file with encoding='utf-8-sig' to handle BOM characters and use errors='replace' for encoding errors." The AI will add the appropriate read_csv parameters.

Related Prompts