Foundry Local is the on-device path. You run a small Phi model next to your app, index your docs locally, and answer questions without sending data out. I set this up with Ollama (a local LLM runner) and Phi-3 mini, plus a tiny API for your app. You test features fast, keep files on disk, and skip cloud bills while you prototype.
You will set up a local Phi-3 model, verify it can chat, then wire a tiny API that answers questions over your PDFs. Everything runs on your laptop. No uploads.
First, install Ollama (a local model runner) so you can run Phi-3 with one command. On macOS:
bash
brew install ollama
Start the server, pull a small Phi model, and a tiny embedding model for indexing:
bash
ollama serve &
ollama pull phi3:mini
ollama pull nomic-embed-text
Verify the model answers locally. This runs the chat in your terminal, no internet needed:
bash
ollama run phi3:mini
> You are a helpful assistant.
> What is 2+2?
If you see "4" in the reply, your local LLM is live. Next you will stand up a small FastAPI server that exposes /ask to your app, chunk PDFs, embed with nomic-embed-text, and retrieve top passages for Phi-3. Stop here if you just wanted a quick sanity check.
You will set up a local Phi-3 model, verify it can chat, then wire a tiny API that answers questions over your PDFs. Everything runs on your laptop. No uploads.
First, install Ollama (a local model runner) so you can run Phi-3 with one command. On macOS:
bash
brew install ollama
Start the server, pull a small Phi model, and a tiny embedding model for indexing:
bash
ollama serve &
ollama pull phi3:mini
ollama pull nomic-embed-text
Verify the model answers locally. This runs the chat in your terminal, no internet needed:
bash
ollama run phi3:mini
> You are a helpful assistant.
> What is 2+2?
If you see "4" in the reply, your local LLM is live. Next you will stand up a small FastAPI server that exposes /ask to your app, chunk PDFs, embed with nomic-embed-text, and retrieve top passages for Phi-3. Stop here if you just wanted a quick sanity check.
Setup
Pick your platform and get Ollama running. Then pull the Phi and embedding models.
macOS (Apple Silicon or Intel):
bash
brew install ollama
ollama serve &
ollama pull phi3:mini
ollama pull nomic-embed-text
Linux (Debian, Ubuntu, Fedora, Arch):
bash
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
ollama pull phi3:mini
ollama pull nomic-embed-text
Windows 11:
bash
winget install Ollama.Ollama
# Open a new terminal after install
ollama serve
# In another terminal
ollama pull phi3:mini
ollama pull nomic-embed-text
Now create a minimal local RAG service. It indexes a docs/ folder, then answers questions via HTTP.
Install Python deps in a virtualenv:
bash
python -m venv .venv
source .venv/bin/activate # Windows: .venv\\Scripts\\activate
pip install fastapi uvicorn[standard] faiss-cpu pypdf requests
Create main.py:
py
from fastapi import FastAPI
from pydantic import BaseModel
from typing import List
import os, glob
from pypdf import PdfReader
import requests
import faiss
import numpy as np
OLLAMA = "http://localhost:11434"
EMBED_MODEL = "nomic-embed-text"
GEN_MODEL = "phi3:mini"
INDEX_PATH = "index.faiss"
CHUNK_SIZE = 800
CHUNK_OVERLAP = 120
DOCS_DIR = "docs"
app = FastAPI()
# Simple in-memory store for chunks
docs_chunks: List[str] = []
index = None
def embed(texts: List[str]) -> np.ndarray:
vecs = []
for t in texts:
r = requests.post(f"{OLLAMA}/api/embeddings", json={"model": EMBED_MODEL, "prompt": t})
r.raise_for_status()
vecs.append(r.json()["embedding"])
return np.array(vecs).astype("float32")
def load_pdfs(dirpath: str) -> List[str]:
chunks = []
for path in glob.glob(os.path.join(dirpath, "**", "*.pdf"), recursive=True):
reader = PdfReader(path)
buf = []
for page in reader.pages:
buf.append(page.extract_text() or "")
text = "\n".join(buf)
i = 0
while i < len(text):
chunks.append(text[i:i+CHUNK_SIZE])
i += CHUNK_SIZE - CHUNK_OVERLAP
for path in glob.glob(os.path.join(dirpath, "**", "*.txt"), recursive=True):
with open(path, "r", encoding="utf-8", errors="ignore") as f:
t = f.read()
i = 0
while i < len(t):
chunks.append(t[i:i+CHUNK_SIZE])
i += CHUNK_SIZE - CHUNK_OVERLAP
return [c.strip() for c in chunks if c.strip()]
def build_index():
global docs_chunks, index
docs_chunks = load_pdfs(DOCS_DIR)
if not docs_chunks:
raise RuntimeError("No docs found. Put PDFs or .txt files under ./docs")
vecs = embed(docs_chunks)
dim = vecs.shape[1]
index = faiss.IndexFlatIP(dim)
# Normalize for cosine similarity
faiss.normalize_L2(vecs)
index.add(vecs)
faiss.write_index(index, INDEX_PATH)
if os.path.exists(INDEX_PATH):
index = faiss.read_index(INDEX_PATH)
docs_chunks = load_pdfs(DOCS_DIR)
else:
os.makedirs(DOCS_DIR, exist_ok=True)
build_index()
class AskBody(BaseModel):
question: str
k: int = 4
@app.post("/ask")
def ask(body: AskBody):
qv = embed([body.question])
faiss.normalize_L2(qv)
D, I = index.search(qv, body.k)
ctx = "\n\n".join([docs_chunks[i] for i in I[0] if i < len(docs_chunks)])
prompt = f"Use the context to answer. If unknown, say you do not know.\n\nContext:\n{ctx}\n\nQuestion: {body.question}"
r = requests.post(f"{OLLAMA}/api/chat", json={
"model": GEN_MODEL,
"messages": [
{"role": "system", "content": "You are a concise assistant."},
{"role": "user", "content": prompt},
],
"options": {"num_ctx": 4096}
})
r.raise_for_status()
data = r.json()
# streaming disabled by default in this call
content = data.get("message", {}).get("content", "") or data.get("output", "")
return {"answer": content.strip()}
Create a docs/ folder with a couple of PDFs or .txt files, then run the API:
bash
mkdir -p docs
uvicorn main:app --reload --port 8000
Test it from another terminal:
bash
curl -s -X POST localhost:8000/ask \
-H 'content-type: application/json' \
-d '{"question": "Summarize the key points in our doc"}' | jq
Verify
Quick model sanity check:
bash
ollama run phi3:mini <<'EOF'
You are a helpful assistant.
What is the capital of France?
EOF
Expected: a short text reply that includes "Paris". If the model streams tokens in your terminal, the local runtime is working.
API sanity check:
bash
curl -s localhost:8000/ask \
-H 'content-type: application/json' \
-d '{"question": "What does the first page say?"}'
Expected JSON with an "answer" field that mentions content from your docs.
Configuration tips
- Choose model size. For speed on laptops, use phi3:mini. For better recall on longer questions, try phi3:medium and set options.num_ctx to 4096 or higher.
- Control context window. Set num_ctx in the chat payload to fit your chunking strategy. Keep chunk size plus question under the limit.
- Speed up retrieval. Precompute and persist FAISS once. Rebuild only when docs change. You already write index.faiss to disk.
- GPU acceleration. On Apple Silicon, Ollama uses Metal by default. On NVIDIA, install CUDA, then set OLLAMA_NUM_GPU=1 before starting the server.
- Caching. Wrap embed() with a tiny sqlite or disk cache keyed by sha256(text). This avoids re-embedding unchanged chunks.
Troubleshooting
- Connection refused at http://localhost:11434. Fix: start the server with
ollama serveand keep it running. Then retry pulls and API calls. - Model not found error in chat or embeddings. Fix:
ollama pull phi3:miniandollama pull nomic-embed-text. Restart your API after new pulls. - Embedding dimension mismatch in FAISS. Fix: delete index.faiss and rebuild after switching embedding models to keep vector dims consistent.
When it beats cloud chat tools
- Prototyping doc chat inside your app. You change prompts or chunking and get instant feedback, with no queue time or per-message costs.
- Handling private support docs during early testing. You keep files on disk and avoid accidental uploads while you tune retrieval.
- Shipping offline helpers. Field teams with spotty internet still get answers over local manuals and notes.
- Cost control for long docs. Embeddings run locally, so you test with 100 to 1,000 pages without spiking a cloud bill.
Sources
-
Want a hand?
Book a 30-min call.
Walk through your stack with us. We'll find the bottleneck and map out the exact wiring you need — free.