Ollama Integration¶
This page details how the RAG pipeline will integrate with Ollama to provide context-aware responses.
Integration Architecture¶
The RAG pipeline will interact with Ollama through its API:
sequenceDiagram
participant User
participant OpenWebUI as Open WebUI
participant RAGMiddleware as RAG Middleware
participant VectorDB as Vector Database
participant Ollama
User->>OpenWebUI: Ask question
OpenWebUI->>RAGMiddleware: Forward query
RAGMiddleware->>VectorDB: Retrieve context
VectorDB-->>RAGMiddleware: Return relevant docs
RAGMiddleware->>Ollama: Send query + context
Ollama-->>OpenWebUI: Return enhanced response
OpenWebUI-->>User: Display response
Ollama API Interaction¶
Model Generation Endpoint¶
The RAG pipeline will use Ollama's generation API:
POST /api/generate HTTP/1.1
Host: localhost:11434
Content-Type: application/json
{
"model": "mistral",
"prompt": "System: You are a helpful assistant. Use the following context to answer the question.\n\nContext: {retrieved_context}\n\nQuestion: {user_question}",
"stream": false,
"options": {
"temperature": 0.7,
"top_p": 0.9,
"top_k": 40,
"num_ctx": 4096
}
}
Context Formatting¶
The context will be formatted before being sent to Ollama:
# Future implementation example
def format_context_for_ollama(retrieved_chunks):
"""Format retrieved chunks into a context string for Ollama."""
formatted_chunks = []
for i, chunk in enumerate(retrieved_chunks):
formatted_chunks.append(
f"[Document {i+1}] {chunk['metadata']['source']}\n"
f"---\n"
f"{chunk['text']}\n"
)
return "\n\n".join(formatted_chunks)
Custom Modelfiles¶
Obelisk will provide custom Modelfiles optimized for RAG:
FROM mistral:latest
# Optimize for RAG
PARAMETER num_ctx 8192
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
# System instruction for RAG
SYSTEM You are a helpful documentation assistant for the Obelisk documentation system.
SYSTEM When given context from the documentation, use this information to answer the user's question.
SYSTEM Always attribute your sources and only provide information contained in the given context.
SYSTEM If the information can't be found in the context, acknowledge this and suggest where the user might find more information.
Embedding Models¶
The RAG pipeline will use embedding models via Ollama:
Model | Description | Size | Performance |
---|---|---|---|
nomic-embed-text | General text embeddings | 137M | High quality |
mxbai-embed-large | Multilingual embeddings | 137M | Multilingual support |
all-mxbai-embed-large | Specialized code embeddings | 137M | Code-focused |
Example embedding request:
POST /api/embeddings HTTP/1.1
Host: localhost:11434
Content-Type: application/json
{
"model": "nomic-embed-text",
"prompt": "The text to be embedded"
}
Memory Management¶
When dealing with limited resources:
- Context pruning: Dynamically adjust context size based on available memory
- Quantization selection: Use higher quantization levels (Q4_0) for large models
- Context batching: Process chunks in batches for large documents
- Model swapping: Automatically switch between models based on query complexity
Response Processing¶
After receiving responses from Ollama:
- Citation extraction: Identify and format source citations
- Response validation: Verify that the response uses the provided context
- Confidence scoring: Assess confidence in the generated response
- Metadata enrichment: Add metadata about sources used
Example response processing:
# Future implementation example
def process_ollama_response(response, retrieved_chunks):
"""Process and enhance response from Ollama."""
text = response["response"]
# Extract and verify citations
citations = extract_citations(text)
valid_citations = verify_citations(citations, retrieved_chunks)
# Format citations and append source information
enhanced_response = format_response_with_citations(text, valid_citations)
# Add metadata
metadata = {
"sources_used": [chunk["metadata"]["source"] for chunk in retrieved_chunks],
"confidence_score": calculate_confidence(text, retrieved_chunks),
"model_used": response["model"],
"response_time": response["total_duration"] / 1000000000 # Convert ns to s
}
return {
"response": enhanced_response,
"metadata": metadata
}
Performance Optimization¶
To optimize performance with Ollama:
- Batched embeddings: Process multiple chunks in a single API call
- Connection pooling: Maintain persistent connections
- Response streaming: Stream responses for faster initial display
- Caching layer: Cache common queries and embeddings
- Load balancing: Support multiple Ollama instances for high demand
Fallback Mechanisms¶
The system will include fallbacks when Ollama cannot provide good answers:
- General knowledge fallback: Use model's general knowledge when no context is found
- Search suggestions: Recommend search terms for more relevant results
- Documentation navigation: Suggest navigation paths in the documentation
- Human handoff: Provide contact information for human assistance
Security Considerations¶
Important security aspects of the Ollama integration:
- Input validation: Sanitize all inputs to prevent prompt injection
- Rate limiting: Prevent abuse with rate limits
- Output filtering: Filter sensitive information from responses
- Network isolation: Restrict network access to the Ollama API
- Authentication: Add optional authentication for API access
Advanced Models¶
Obelisk's RAG system supports various advanced models through Ollama integration, each offering different capabilities and performance characteristics.
Recommended Models¶
Model | Size | Use Case | Performance |
---|---|---|---|
Llama3 | 8B | General purpose, balanced | Good general performance |
Phi-3-mini | 3.8B | Lightweight, efficient | Excellent for resource-constrained systems |
Mistral | 7B | Technical content | Strong reasoning capabilities |
Mixtral | 8x7B | Complex questions | High quality with MoE architecture |
DeepSeek Coder | 6.7B | Code-focused | Excellent for developer documentation |
Model Configuration¶
Advanced models can be configured in various ways:
# Pull recommended models
ollama pull llama3
ollama pull phi:latest
ollama pull mistral
ollama pull mixtral
ollama pull deepseek-coder
# Configure Obelisk to use a specific model
export OLLAMA_MODEL="mixtral"
obelisk-rag config --set "ollama_model=mixtral"
Custom Model Parameters¶
For advanced users, Obelisk allows customizing model parameters:
# Example future configuration
rag:
ollama:
model: "llama3"
parameters:
temperature: 0.1
top_p: 0.9
top_k: 40
num_ctx: 8192
repeat_penalty: 1.1
Multi-Model Strategy¶
The RAG system can use different models for different types of queries:
- Technical questions: Use models like Mistral or DeepSeek Coder
- General questions: Use Llama3 or Phi-3
- Complex reasoning: Use Mixtral or larger variants
To implement a multi-model strategy, you can use the model selection feature:
# Different models for different queries
obelisk-rag query "How do I use Docker with Obelisk?" --model deepseek-coder
obelisk-rag query "Explain the architecture of Obelisk" --model mixtral
GPU Acceleration¶
For optimal performance with advanced models, GPU acceleration is recommended:
- NVIDIA GPUs: Fully supported through CUDA
- AMD GPUs: Experimental support through ROCm
- Metal (Apple Silicon): Native support on Mac
Configure GPU usage with: