The OpenAI API is the fastest path from "I have a cool idea" to a working AI-powered product. Whether you want to build a customer support bot, a code reviewer, an image-analysis tool, or a fully autonomous agent — it all starts with a handful of API calls. This guide takes you from account creation to production-ready patterns, with real Python code you can run today.

Getting Your API Key & Setup

Everything begins at platform.openai.com. Sign up or log in, navigate to API Keys in the left sidebar, and click Create new secret key. Copy it immediately — you won't see it again after closing the dialog.

Next, install the official Python SDK and set up your environment:

bash
# Create a virtual environment (recommended) python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate # Install the latest OpenAI SDK (v1+) pip install openai python-dotenv

Store your key as an environment variable — never paste it directly in code. Create a .env file:

.env
OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Add .env to your .gitignore right now. One accidental commit can expose your key to the world and rack up charges you didn't expect.

2025 Model Pricing Cheat Sheet

Choosing the right model is the single biggest cost lever you have. Here's the current pricing landscape (per 1 million tokens):

Model Input ($/1M tokens) Output ($/1M tokens) Best For Context Window
gpt-4o $2.50 $10.00 Complex reasoning, vision, production quality 128k tokens
gpt-4o-mini $0.15 $0.60 High-volume tasks, prototyping, classification 128k tokens
gpt-3.5-turbo $0.50 $1.50 Legacy integrations, simple Q&A 16k tokens
o4-mini $1.10 $4.40 Math, code, multi-step reasoning tasks 128k tokens
text-embedding-3-small $0.02 Semantic search, RAG, similarity 8k tokens
💡

Start with gpt-4o-mini

At $0.15/1M input tokens, gpt-4o-mini is roughly 17× cheaper than gpt-4o and handles the vast majority of tasks excellently. Switch to gpt-4o only when you hit a real quality ceiling, not before.

Your First API Call

The OpenAI SDK v1+ uses a clean, synchronous client pattern. Every chat request sends an array of messages, each with a role (system, user, or assistant) and content. The system message sets the model's persona and constraints; the user message is what you're asking.

Python
import os from openai import OpenAI from dotenv import load_dotenv load_dotenv() # loads OPENAI_API_KEY from .env # The client automatically reads OPENAI_API_KEY from the environment client = OpenAI() response = client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": "You are a concise technical assistant. Answer in plain English with no markdown." }, { "role": "user", "content": "Explain what a transformer attention mechanism does in two sentences." } ], temperature=0.3, # lower = more deterministic max_tokens=200, # cap output length to control costs ) # The model's reply lives here print(response.choices[0].message.content) # Check how many tokens were used this request print(f"Tokens used — prompt: {response.usage.prompt_tokens}, " f"completion: {response.usage.completion_tokens}")

Running this prints something like: "Attention lets every token in a sequence directly look at every other token to figure out which ones matter most for understanding the current word. It does this by computing weighted sums of value vectors, where the weights come from similarity scores between query and key vectors." Clean, factual, two sentences.

⚠️

Never Hardcode Your API Key

Writing api_key="sk-proj-..." directly in Python is a critical security mistake. If that file ever touches version control — even a private repo — bots scan GitHub 24/7 for leaked keys. Always use environment variables or a secrets manager like AWS Secrets Manager or HashiCorp Vault.

Streaming Responses

By default, the API waits until the full response is generated before returning anything. For a 500-token response at modest speed, that could be 10+ seconds of staring at a blank screen. Streaming fixes this: the API sends tokens to your client as they're generated, just like you see in ChatGPT.

Enable streaming by passing stream=True and iterating over the returned object:

Python
from openai import OpenAI from dotenv import load_dotenv load_dotenv() client = OpenAI() with client.chat.completions.stream( model="gpt-4o-mini", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write a haiku about building software at 2am."} ], max_tokens=100, ) as stream: for text in stream.text_stream(): print(text, end="", flush=True) print() # newline after stream ends

The SDK's stream() context manager (v1.8+) is the cleanest approach. If you're on an older SDK version and need the raw approach, here's the low-level loop equivalent:

Python
# Low-level streaming (compatible with all SDK v1+ versions) stream = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": "List 5 Python tips for beginners."}], stream=True, ) for chunk in stream: delta = chunk.choices[0].delta.content if delta is not None: print(delta, end="", flush=True) print()

Each chunk is a partial response object. chunk.choices[0].delta.content holds the new token(s) for this chunk — it can be None on the final chunk, so always guard against that. In a web app, you'd pipe these chunks straight into a Server-Sent Events (SSE) response for a ChatGPT-like effect in the browser.

Vision: Analyzing Images with GPT-4o

GPT-4o is natively multimodal — it can look at images and answer questions about them. You pass images either as a public URL or as a base64-encoded string, both inline in the messages array.

Sending an Image URL

Python
response = client.chat.completions.create( model="gpt-4o", # gpt-4o-mini also supports vision messages=[ { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png", "detail": "low" # "low" or "high" — low saves tokens } }, { "type": "text", "text": "Describe what you see in this image in one sentence." } ] } ], max_tokens=150, ) print(response.choices[0].message.content)

Sending a Local Image as Base64

Python
import base64 def encode_image(image_path: str) -> str: with open(image_path, "rb") as f: return base64.b64encode(f.read()).decode("utf-8") image_data = encode_image("screenshot.png") response = client.chat.completions.create( model="gpt-4o", messages=[ { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": f"data:image/png;base64,{image_data}" } }, { "type": "text", "text": "Identify any errors or warnings visible in this screenshot." } ] } ], max_tokens=300, ) print(response.choices[0].message.content)
ℹ️

Supported Image Formats & Limits

GPT-4o Vision accepts PNG, JPEG, WEBP, and GIF (non-animated). Maximum image size is 20 MB per image. You can include up to 10 images in a single request. The detail: "low" setting costs a flat 85 tokens per image — great for simple descriptions. Use detail: "high" when you need fine-grained analysis of charts, screenshots, or diagrams, which tiles the image and costs proportionally more.

Function Calling (Tool Use)

Function calling is the mechanism behind AI agents. Instead of just returning text, the model can decide to call a function you've defined — returning structured JSON arguments you can pass to real code. This is how you connect an LLM to databases, APIs, calculators, or any external service.

The flow works like this: you describe available tools in JSON Schema → the model decides which tool to call and with what arguments → you execute the function → you feed the result back to the model → the model produces a final response.

Step 1 — Define the Tool Schema

Python
import json from openai import OpenAI from dotenv import load_dotenv load_dotenv() client = OpenAI() # Define the tools the model can call tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get the current weather for a given city. Call this whenever the user asks about weather.", "parameters": { "type": "object", "properties": { "city": { "type": "string", "description": "The city name, e.g. 'London' or 'Tokyo'" }, "unit": { "type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit" } }, "required": ["city"] } } } ]

Step 2 — Let the Model Decide to Call a Tool

Python
messages = [ {"role": "user", "content": "What's the weather like in Karachi right now?"} ] response = client.chat.completions.create( model="gpt-4o-mini", messages=messages, tools=tools, tool_choice="auto", # let the model decide when to use tools ) assistant_msg = response.choices[0].message # Check if the model wants to call a function if assistant_msg.tool_calls: tool_call = assistant_msg.tool_calls[0] func_name = tool_call.function.name func_args = json.loads(tool_call.function.arguments) print(f"Model wants to call: {func_name}({func_args})") # → Model wants to call: get_weather({'city': 'Karachi', 'unit': 'celsius'})

Step 3 — Execute the Function & Return the Result

Python
# Your real implementation would call a weather API here def get_weather(city: str, unit: str = "celsius") -> dict: # Stub — replace with requests.get("https://api.openweathermap.org/...") return {"city": city, "temperature": 34, "unit": unit, "condition": "Sunny and humid"} # Call our function with the model-provided arguments result = get_weather(**func_args) # Feed the function result back to the model for the final response messages.append(assistant_msg) # append the tool_call assistant message messages.append({ "role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(result) }) final_response = client.chat.completions.create( model="gpt-4o-mini", messages=messages, ) print(final_response.choices[0].message.content) # → "Right now in Karachi it's 34°C and sunny with high humidity. Quite hot!"
🤖

This Is How AI Agents Work

Agents are just this loop — repeated. The model calls a tool, gets a result, decides if it needs more information, calls another tool, and so on until it has enough to answer. Frameworks like LangChain, LlamaIndex, and OpenAI's own Assistants API automate this loop for you, but the underlying primitive is always the same function-calling mechanism shown above.

Generating Embeddings

Embeddings convert text into dense numerical vectors that capture semantic meaning. Similar texts end up close together in vector space — which makes them the foundation of semantic search, RAG pipelines, recommendation systems, and anomaly detection. OpenAI's text-embedding-3-small model gives you 1536-dimensional vectors at an extremely low cost ($0.02/1M tokens).

Python
import numpy as np from openai import OpenAI from dotenv import load_dotenv load_dotenv() client = OpenAI() def get_embedding(text: str) -> list[float]: response = client.embeddings.create( model="text-embedding-3-small", input=text, encoding_format="float" ) return response.data[0].embedding def cosine_similarity(a: list, b: list) -> float: a, b = np.array(a), np.array(b) return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))) # Embed three sentences and find the most similar pair sentences = [ "How do I reset my password?", "I forgot my login credentials and can't access my account.", "What are the opening hours of your store?" ] embeddings = [get_embedding(s) for s in sentences] sim_01 = cosine_similarity(embeddings[0], embeddings[1]) sim_02 = cosine_similarity(embeddings[0], embeddings[2]) print(f"'Reset password' ↔ 'Forgot credentials': {sim_01:.3f}") # → ~0.89 print(f"'Reset password' ↔ 'Store opening hours': {sim_02:.3f}") # → ~0.31

The high similarity (~0.89) between the first two sentences means they'd be retrieved together in a semantic search, even though they share no keywords. This is the core insight behind RAG: instead of keyword matching, you match meaning. Store your embeddings in a vector database (Pinecone, Qdrant, pgvector) for production-scale retrieval.

Cost & Best Practices

The difference between a $50/month AI app and a $5,000/month one often comes down to a few engineering decisions made early. These are the levers that matter most:

Strategy Typical Savings Implementation Effort Notes
Use gpt-4o-mini instead of gpt-4o Up to 94% Low — change model string Quality gap is smaller than you think for most tasks
Set a max_tokens limit 10–40% Low — one parameter Prevents runaway outputs; tune per use case
Cache repeated prompts 30–80% on cache-hit requests Medium — add Redis/disk cache OpenAI auto-discounts prompts >1024 tokens that are reused
Trim system prompts 5–20% Low — review prompt length Every token in every request adds up at scale
Batch embedding requests Latency savings Low — pass a list Pass up to 2048 texts in one API call instead of N calls
Use temperature=0 for deterministic tasks Indirect (fewer retries) Low — one parameter Reduces hallucinations in classification, extraction tasks

Four Essential Best Practices

💾
Prompt Caching
OpenAI automatically caches the common prefix of prompts longer than 1,024 tokens, giving a 50% discount on cached input tokens. Structure your system prompt first and keep it stable across requests to maximize cache hits.
🎯
Always Set max_tokens
Without a limit, the model can generate thousands of tokens for a simple question. Set max_tokens appropriate to your use case — 100 for summaries, 500 for explanations, 2000 for code generation. This also prevents prompt injection attacks that try to elicit huge outputs.
⚖️
Right-Size Your Model
Use a routing layer: classify the incoming request first (cheap), then route simple queries to gpt-4o-mini and complex reasoning or vision tasks to gpt-4o or o4-mini. Many companies achieve GPT-4-quality results at 80% lower cost this way.
🔄
Handle Rate Limits Gracefully
The OpenAI API enforces both RPM (requests per minute) and TPM (tokens per minute) limits. Use exponential backoff with jitter on RateLimitError and APITimeoutError. The tenacity library makes this trivial to implement and is worth adding on day one.

Production-Ready Error Handling

Python
import time from openai import OpenAI, RateLimitError, APITimeoutError, APIConnectionError client = OpenAI() def chat_with_retry(messages: list, max_retries: int = 5) -> str: """Call the API with exponential backoff on transient errors.""" for attempt in range(max_retries): try: response = client.chat.completions.create( model="gpt-4o-mini", messages=messages, max_tokens=500, timeout=30.0, ) return response.choices[0].message.content except RateLimitError: wait = 2 ** attempt # 1, 2, 4, 8, 16 seconds print(f"Rate limited. Retrying in {wait}s… (attempt {attempt + 1}/{max_retries})") time.sleep(wait) except (APITimeoutError, APIConnectionError) as e: if attempt == max_retries - 1: raise wait = 2 ** attempt print(f"Connection error: {e}. Retrying in {wait}s…") time.sleep(wait) raise RuntimeError("Max retries exceeded") # Usage answer = chat_with_retry([{"role": "user", "content": "What is 2+2?"}]) print(answer)
🚀

Where to Go Next

Once you're comfortable with the basics, explore the Assistants API for built-in thread management and file retrieval, Structured Outputs (response_format={"type": "json_schema"}) to get guaranteed-valid JSON, and the Realtime API for low-latency voice applications. Each of these unlocks an entirely new class of products.