Foundry Local: run Phi on your machine for fast offline document Q&A

Foundry Local is the on-device path. You run a small Phi model next to your app, index your docs locally, and answer questions without sending data out. I set this up with Ollama (a local LLM runner) and Phi-3 mini, plus a tiny API for your app. You test features fast, keep files on disk, and skip cloud bills while you prototype.

You will set up a local Phi-3 model, verify it can chat, then wire a tiny API that answers questions over your PDFs. Everything runs on your laptop. No uploads.

First, install Ollama (a local model runner) so you can run Phi-3 with one command. On macOS:

bash
brew install ollama

Start the server, pull a small Phi model, and a tiny embedding model for indexing:

bash
ollama serve &
ollama pull phi3:mini
ollama pull nomic-embed-text

Verify the model answers locally. This runs the chat in your terminal, no internet needed:

bash
ollama run phi3:mini
> You are a helpful assistant.
> What is 2+2?

If you see "4" in the reply, your local LLM is live. Next you will stand up a small FastAPI server that exposes /ask to your app, chunk PDFs, embed with nomic-embed-text, and retrieve top passages for Phi-3. Stop here if you just wanted a quick sanity check.

You will set up a local Phi-3 model, verify it can chat, then wire a tiny API that answers questions over your PDFs. Everything runs on your laptop. No uploads.

First, install Ollama (a local model runner) so you can run Phi-3 with one command. On macOS:

bash
brew install ollama

Start the server, pull a small Phi model, and a tiny embedding model for indexing:

bash
ollama serve &
ollama pull phi3:mini
ollama pull nomic-embed-text

Verify the model answers locally. This runs the chat in your terminal, no internet needed:

bash
ollama run phi3:mini
> You are a helpful assistant.
> What is 2+2?

Setup

Pick your platform and get Ollama running. Then pull the Phi and embedding models.

macOS (Apple Silicon or Intel):

bash
brew install ollama
ollama serve &
ollama pull phi3:mini
ollama pull nomic-embed-text

Linux (Debian, Ubuntu, Fedora, Arch):

bash
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
ollama pull phi3:mini
ollama pull nomic-embed-text

Windows 11:

bash
winget install Ollama.Ollama
# Open a new terminal after install
ollama serve
# In another terminal
ollama pull phi3:mini
ollama pull nomic-embed-text

Now create a minimal local RAG service. It indexes a docs/ folder, then answers questions via HTTP.

Install Python deps in a virtualenv:

bash
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\\Scripts\\activate
pip install fastapi uvicorn[standard] faiss-cpu pypdf requests

Create main.py:

py
from fastapi import FastAPI
from pydantic import BaseModel
from typing import List
import os, glob
from pypdf import PdfReader
import requests
import faiss
import numpy as np

OLLAMA = "http://localhost:11434"
EMBED_MODEL = "nomic-embed-text"
GEN_MODEL = "phi3:mini"
INDEX_PATH = "index.faiss"
CHUNK_SIZE = 800
CHUNK_OVERLAP = 120
DOCS_DIR = "docs"

app = FastAPI()

# Simple in-memory store for chunks
docs_chunks: List[str] = []
index = None

def embed(texts: List[str]) -> np.ndarray:
    vecs = []
    for t in texts:
        r = requests.post(f"{OLLAMA}/api/embeddings", json={"model": EMBED_MODEL, "prompt": t})
        r.raise_for_status()
        vecs.append(r.json()["embedding"]) 
    return np.array(vecs).astype("float32")

def load_pdfs(dirpath: str) -> List[str]:
    chunks = []
    for path in glob.glob(os.path.join(dirpath, "**", "*.pdf"), recursive=True):
        reader = PdfReader(path)
        buf = []
        for page in reader.pages:
            buf.append(page.extract_text() or "")
        text = "\n".join(buf)
        i = 0
        while i < len(text):
            chunks.append(text[i:i+CHUNK_SIZE])
            i += CHUNK_SIZE - CHUNK_OVERLAP
    for path in glob.glob(os.path.join(dirpath, "**", "*.txt"), recursive=True):
        with open(path, "r", encoding="utf-8", errors="ignore") as f:
            t = f.read()
            i = 0
            while i < len(t):
                chunks.append(t[i:i+CHUNK_SIZE])
                i += CHUNK_SIZE - CHUNK_OVERLAP
    return [c.strip() for c in chunks if c.strip()]

def build_index():
    global docs_chunks, index
    docs_chunks = load_pdfs(DOCS_DIR)
    if not docs_chunks:
        raise RuntimeError("No docs found. Put PDFs or .txt files under ./docs")
    vecs = embed(docs_chunks)
    dim = vecs.shape[1]
    index = faiss.IndexFlatIP(dim)
    # Normalize for cosine similarity
    faiss.normalize_L2(vecs)
    index.add(vecs)
    faiss.write_index(index, INDEX_PATH)

if os.path.exists(INDEX_PATH):
    index = faiss.read_index(INDEX_PATH)
    docs_chunks = load_pdfs(DOCS_DIR)
else:
    os.makedirs(DOCS_DIR, exist_ok=True)
    build_index()

class AskBody(BaseModel):
    question: str
    k: int = 4

@app.post("/ask")
def ask(body: AskBody):
    qv = embed([body.question])
    faiss.normalize_L2(qv)
    D, I = index.search(qv, body.k)
    ctx = "\n\n".join([docs_chunks[i] for i in I[0] if i < len(docs_chunks)])
    prompt = f"Use the context to answer. If unknown, say you do not know.\n\nContext:\n{ctx}\n\nQuestion: {body.question}"
    r = requests.post(f"{OLLAMA}/api/chat", json={
        "model": GEN_MODEL,
        "messages": [
            {"role": "system", "content": "You are a concise assistant."},
            {"role": "user", "content": prompt},
        ],
        "options": {"num_ctx": 4096}
    })
    r.raise_for_status()
    data = r.json()
    # streaming disabled by default in this call
    content = data.get("message", {}).get("content", "") or data.get("output", "")
    return {"answer": content.strip()}

Create a docs/ folder with a couple of PDFs or .txt files, then run the API:

bash
mkdir -p docs
uvicorn main:app --reload --port 8000

Test it from another terminal:

bash
curl -s -X POST localhost:8000/ask \
  -H 'content-type: application/json' \
  -d '{"question": "Summarize the key points in our doc"}' | jq

Verify

Quick model sanity check:

bash
ollama run phi3:mini <<'EOF'
You are a helpful assistant.
What is the capital of France?
EOF

Expected: a short text reply that includes "Paris". If the model streams tokens in your terminal, the local runtime is working.

API sanity check:

bash
curl -s localhost:8000/ask \
  -H 'content-type: application/json' \
  -d '{"question": "What does the first page say?"}'

Expected JSON with an "answer" field that mentions content from your docs.

Configuration tips

Choose model size. For speed on laptops, use phi3:mini. For better recall on longer questions, try phi3:medium and set options.num_ctx to 4096 or higher.
Control context window. Set num_ctx in the chat payload to fit your chunking strategy. Keep chunk size plus question under the limit.
Speed up retrieval. Precompute and persist FAISS once. Rebuild only when docs change. You already write index.faiss to disk.
GPU acceleration. On Apple Silicon, Ollama uses Metal by default. On NVIDIA, install CUDA, then set OLLAMA_NUM_GPU=1 before starting the server.
Caching. Wrap embed() with a tiny sqlite or disk cache keyed by sha256(text). This avoids re-embedding unchanged chunks.

Troubleshooting

Connection refused at http://localhost:11434. Fix: start the server with ollama serve and keep it running. Then retry pulls and API calls.
Model not found error in chat or embeddings. Fix: ollama pull phi3:mini and ollama pull nomic-embed-text. Restart your API after new pulls.
Embedding dimension mismatch in FAISS. Fix: delete index.faiss and rebuild after switching embedding models to keep vector dims consistent.

When it beats cloud chat tools

Prototyping doc chat inside your app. You change prompts or chunking and get instant feedback, with no queue time or per-message costs.
Handling private support docs during early testing. You keep files on disk and avoid accidental uploads while you tune retrieval.
Shipping offline helpers. Field teams with spotty internet still get answers over local manuals and notes.
Cost control for long docs. Embeddings run locally, so you test with 100 to 1,000 pages without spiking a cloud bill.

Sources

Want a hand?

Book a 30-min call.

Walk through your stack with us. We'll find the bottleneck and map out the exact wiring you need — free.

Book the callFree intro · 30 min · cal.com