LlamaIndex Streaming Server (FastAPI)
LlamaIndex Streaming Server (FastAPI)
Section titled “LlamaIndex Streaming Server (FastAPI)”Latest: 0.14.20 | Updated: April 2026 Last verified: 2025-11
from fastapi import FastAPIfrom fastapi.responses import StreamingResponsefrom llama_index.core import VectorStoreIndex, Document
app = FastAPI()
docs = [Document(text="LangGraph is a graph framework for agents.")]index = VectorStoreIndex.from_documents(docs)query_engine = index.as_query_engine(streaming=True)
@app.get("/stream")def stream(q: str): def gen(): resp = query_engine.query(q) for token in resp.response_gen: yield f"data: {token}\n\n" return StreamingResponse(gen(), media_type="text/event-stream")Deployment
Section titled “Deployment”Dockerfile
Section titled “Dockerfile”FROM python:3.11-slimWORKDIR /appCOPY requirements.txt .RUN pip install -r requirements.txtCOPY . .EXPOSE 8080CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]Kubernetes
Section titled “Kubernetes”apiVersion: apps/v1kind: Deploymentmetadata: { name: llamaindex-stream }spec: replicas: 2 selector: { matchLabels: { app: llamaindex-stream } } template: metadata: { labels: { app: llamaindex-stream } } spec: containers: - name: app image: ghcr.io/yourorg/llamaindex-stream:latest ports: [{ containerPort: 8080 }]---apiVersion: v1kind: Servicemetadata: { name: llamaindex-stream }spec: { selector: { app: llamaindex-stream }, ports: [{ port: 80, targetPort: 8080 }] }Security Best Practices
Section titled “Security Best Practices”- Rate limit queries and set maximum lengths
- Sanitize logs and avoid indexing sensitive prompts