Why Claude for Production Apps
When choosing an LLM for production, I evaluated GPT-4, Gemini, and Claude. For the use cases I build - document processing, academic assistance, and structured data extraction - Claude's instruction-following and long-context performance won out.
Basic Integration with Django
import anthropic
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
def ask_claude(system_prompt: str, user_message: str) -> str:
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": user_message}]
)
return message.content[0].text
Prompt Caching - The Feature That Cuts Costs by 80%
If your system prompt is long (instructions, context, documents), prompt caching is essential. Anthropic caches up to 10 minutes of content.
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT, # your 2000-token instructions
"cache_control": {"type": "ephemeral"} # cache this!
}
],
messages=[{"role": "user", "content": user_question}]
)
Cache hits cost 90% less than regular input tokens. On one app processing ~500 questions/day with a 1,500-token system prompt, this saved roughly $40/month.
Streaming for Better UX
For chat interfaces, stream the response so users see text as it's generated:
# Django view with streaming response
from django.http import StreamingHttpResponse
def chat_stream(request):
user_msg = request.GET.get("message", "")
def generate():
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": user_msg}]
) as stream:
for text in stream.text_stream:
yield f"data: {json.dumps({'text': text})}
"
return StreamingHttpResponse(generate(), content_type="text/event-stream")
Real Cost Numbers
For a student Q&A app averaging 200 sessions/day:
- Without caching: ~$45/month
- With prompt caching: ~$9/month
- With caching + response length limits: ~$5/month
Plan your token budgets before launch.