You slot the card in a desktop, add Furiosa's software, pick a ready model like Llama or Qwen, and chat without renting a server. This guide shows the free stack you can stand up today, and where the RNGD card fits when you add it.
You need a box with a free PCIe x16 slot, a 650 W or better PSU, and Ubuntu 22.04 or Windows 11. Even before the card shows up, set up your local LLM stack so you have a clean baseline. I used a mid-tower with 64 GB RAM and Ubuntu 22.04.
First, install Ollama (a free local LLM runner) and pull a chat model. On Linux or macOS:
bash
curl -fsSL https://ollama.com/install.sh | sh
Then start a model. Llama 3 and Qwen work well:
bash
ollama run llama3:8b
Verify it answers locally. Ask a direct question. You should see a response within a few seconds and your network monitor should stay quiet:
bash
You: Summarize the 3 shell commands I should know for disk space.
When your Furiosa RNGD is installed, the vendor runtime slots under the same local stack so you keep the same chat workflow. In the full guide I walk through Linux and Windows setup, a clean verify routine, and the knobs that cut latency without wrecking accuracy. Stop here until you want driver, firmware, and service tuning.
You need a box with a free PCIe x16 slot, a 650 W or better PSU, and Ubuntu 22.04 or Windows 11. Even before the card shows up, set up your local LLM stack so you have a clean baseline. I used a mid-tower with 64 GB RAM and Ubuntu 22.04.
First, install Ollama (a free local LLM runner) and pull a chat model. On Linux or macOS:
bash
curl -fsSL https://ollama.com/install.sh | sh
Then start a model. Llama 3 and Qwen work well:
bash
ollama run llama3:8b
Verify it answers locally. Ask a direct question. You should see a response within a few seconds and your network monitor should stay quiet:
bash
You: Summarize the 3 shell commands I should know for disk space.
When your Furiosa RNGD is installed, the vendor runtime slots under the same local stack so you keep the same chat workflow. In the full guide I walk through Linux and Windows setup, a clean verify routine, and the knobs that cut latency without wrecking accuracy.
Setup
Linux (Ubuntu 22.04)
1) Prepare the box
- Update packages and basic tools.
bash
sudo apt update && sudo apt -y upgrade
sudo apt -y install build-essential curl git pciutils htop nvme-cli
- Check the free slot and lane width.
bash
sudo lspci -vv | grep -i "pci bridge\|root port\|non-native" -n
- Set BIOS to Resizable BAR off and Above 4G decoding on if your board exposes them. Save and reboot.
2) Install Ollama and a baseline model
bash
curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3:8b
Optional second model for comparison:
bash
ollama pull qwen2:7b
3) Prep the system for steady throughput
bash
# Performance governor
sudo apt -y install linux-tools-common linux-tools-generic
for c in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance | sudo tee $c; done
# Disable CPU frequency turbo jitter during tests
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo >/dev/null 2>&1 || true
4) Install your Furiosa RNGD card
- Power down, slot the card in a x16 mechanical slot, connect any required aux power, and boot.
- Confirm the PCIe device enumerates.
bash
sudo lspci | grep -i "accelerator\|npu\|ai\|coprocessor"
- Install the vendor driver and runtime using their installer for your distro. Reboot when the installer tells you.
5) Keep your chat workflow
- Keep using Ollama for prompts while you validate system stability. In the next section you will confirm local-only traffic and speed.
Windows 11
1) System prep
- Update BIOS and chipset drivers. Set power plan to High performance.
2) Install Ollama
bash
winget install Ollama.Ollama --silent
Then run from PowerShell:
bash
ollama run llama3:8b
3) Install the Furiosa RNGD driver and runtime using the Windows installer from the vendor. Reboot.
4) Verify the device shows in Device Manager under System devices or a dedicated NPU section.
macOS (Apple Silicon or Intel)
RNGD is a PCIe card. Macs do not accept internal PCIe cards. You can still run the same local stack to test prompts and models now.
1) Install Ollama
bash
curl -fsSL https://ollama.com/install.sh | sh
2) Run a baseline model
bash
ollama run llama3:8b
Verify
Goal: prove local answers, measure baseline latency, and confirm the accelerator is present on Linux.
1) Local only
bash
# Start a quick chat and watch connections
sudo ss -tupn | grep -v ESTAB &
OLLAMA_HOST=127.0.0.1:11434 ollama run llama3:8b
Type a question. You should not see outbound TCP connections open to the internet.
2) Model list and pull cache
bash
ollama list
# Expected entry after first run
# NAME ID SIZE MODIFIED
# llama3:8b sha256:... 4.7 GB Just now
3) Baseline latency
bash
time ollama run llama3:8b "Say hello in 10 words."
Look at real time. Save it for later comparison.
4) Accelerator presence on Linux
bash
sudo lspci -nn | grep -i "accelerator\|coprocessor\|processing"
# Expect a new PCIe device entry after install
Configuration tips
- Use smaller quantization for speed. In Ollama, pick q4 or q5 variants when you pull models. Example:
ollama pull llama3:8b-q4_K_M. - Warm prompt cache. Ask a short question first to load the model into RAM, then start your real task to avoid a cold start penalty.
- Pin CPU cores during tests. Use
tasksetto bind the server and the model to a NUMA node for steadier latency.
bash
sudo apt -y install util-linux
sudo taskset -c 0-15 ollama serve
- Keep context length realistic. Set
-pwith fewer system tokens or use 4k context models for most chats to avoid paging. - Log tokens per second. Add
-vto see throughput, or wrap withpvto measure token stream rate in a pinch.
Troubleshooting
1) Model will not start, memory error
- Symptom:
not enough memoryor process killed. - Fix: pick a smaller quantization like
llama3:8b-q4_K_M, close background apps, and add a 16 GB swap file.
bash
sudo fallocate -l 16G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
2) Slow first token every run
- Symptom: 20 to 60 seconds before the first token on each prompt.
- Fix: keep
ollama serverunning so the model stays in RAM.
bash
ollama serve &
Also avoid system sleep and memory pressure.
3) Port already in use
- Symptom:
bind: address already in useon 11434. - Fix: find and stop the other server or change the port.
bash
sudo lsof -i :11434
kill -9 <PID>
# or
OLLAMA_HOST=127.0.0.1:11500 ollama serve
When it beats cloud chatbots
- Constant daily prompts without a surprise bill. If you send thousands of tokens per day, a one-time card is cheaper after a few weeks.
- Private data stays at home. Good for drafts, notes, and code with secrets on an offline LAN.
- Low, steady latency for short prompts. No round trip to a remote region. First token hits fast once the model is warm.
- Batch family use. Several people on the same LAN can hit your box at once and you control the queue.
Sources
-
Want a hand?
Book a 30-min call.
Walk through your stack with us. We'll find the bottleneck and map out the exact wiring you need — free.