Moderate User Content with the OpenAI Moderation API
The OpenAI Moderation API is a free endpoint that classifies text across eleven harm categories including hate speech, harassment, violence, sexual content, and self-harm. Running user input through the moderation endpoint before passing it to your chat completion is a defense-in-depth measure that catches obviously harmful content without consuming expensive GPT tokens. This example shows a moderation request and explains how to interpret the category scores and flagged boolean. The moderation response returns a flagged boolean and per-category scores from 0.0 to 1.0. A score above the threshold for any category sets flagged to true. The threshold for each category is tuned by OpenAI for precision/recall balance, but you can implement your own thresholds using the raw scores — for example, flagging content at a lower threshold for children's applications or at a higher threshold for adult platforms that allow mature content. For most applications, the recommended flow is: run moderation → if flagged, return a policy violation message without calling the main model → if not flagged, proceed to chat completion. This prevents jailbreaks and inappropriate requests from ever reaching the main model, saving money and preventing unwanted responses.
{
"model": "omni-moderation-latest",
"input": [
"How do I set up two-factor authentication?",
"Write a helpful guide for new users",
"I need help with my account settings"
]
}FAQ
- Is the OpenAI Moderation API free?
- Yes, the Moderation API is free to use for content generated by OpenAI models. It is also available for content from other sources at no charge as of 2024.
- Should I moderate both input and output?
- Yes. Moderate user input to prevent harmful prompts from reaching the model, and moderate model output to catch the rare cases where the model generates policy-violating content despite the input being benign.
- What is the omni-moderation-latest model?
- omni-moderation-latest is the current recommended moderation model. It supports both text and image inputs and uses OpenAI's latest safety classifier. text-moderation-latest is text-only and being phased out.
Related Examples
The Chat Completion API is the primary interface for all GPT models and the foun...
Detect Prompt Injection AttemptsPrompt injection is the most critical security vulnerability in LLM-based applic...
Detect Jailbreak Attempts in User InputJailbreak attacks attempt to make LLMs ignore their safety guidelines by trickin...