Furiosa RNGD: run Llama or Qwen locally and stop paying cloud bills

You slot the card in a desktop, add Furiosa's software, pick a ready model like Llama or Qwen, and chat without renting a server. This guide shows the free stack you can stand up today, and where the RNGD card fits when you add it.

You need a box with a free PCIe x16 slot, a 650 W or better PSU, and Ubuntu 22.04 or Windows 11. Even before the card shows up, set up your local LLM stack so you have a clean baseline. I used a mid-tower with 64 GB RAM and Ubuntu 22.04.

First, install Ollama (a free local LLM runner) and pull a chat model. On Linux or macOS:

bash
curl -fsSL https://ollama.com/install.sh | sh

Then start a model. Llama 3 and Qwen work well:

bash
ollama run llama3:8b

Verify it answers locally. Ask a direct question. You should see a response within a few seconds and your network monitor should stay quiet:

bash
You: Summarize the 3 shell commands I should know for disk space.

When your Furiosa RNGD is installed, the vendor runtime slots under the same local stack so you keep the same chat workflow. In the full guide I walk through Linux and Windows setup, a clean verify routine, and the knobs that cut latency without wrecking accuracy. Stop here until you want driver, firmware, and service tuning.

First, install Ollama (a free local LLM runner) and pull a chat model. On Linux or macOS:

bash
curl -fsSL https://ollama.com/install.sh | sh

Then start a model. Llama 3 and Qwen work well:

bash
ollama run llama3:8b

Verify it answers locally. Ask a direct question. You should see a response within a few seconds and your network monitor should stay quiet:

bash
You: Summarize the 3 shell commands I should know for disk space.

Setup

Linux (Ubuntu 22.04)

1) Prepare the box

Update packages and basic tools.

bash
sudo apt update && sudo apt -y upgrade
sudo apt -y install build-essential curl git pciutils htop nvme-cli

Check the free slot and lane width.

bash
sudo lspci -vv | grep -i "pci bridge\|root port\|non-native" -n

Set BIOS to Resizable BAR off and Above 4G decoding on if your board exposes them. Save and reboot.

2) Install Ollama and a baseline model

bash
curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3:8b

Optional second model for comparison:

bash
ollama pull qwen2:7b

3) Prep the system for steady throughput

bash
# Performance governor
sudo apt -y install linux-tools-common linux-tools-generic
for c in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance | sudo tee $c; done
# Disable CPU frequency turbo jitter during tests
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo >/dev/null 2>&1 || true

4) Install your Furiosa RNGD card

Power down, slot the card in a x16 mechanical slot, connect any required aux power, and boot.
Confirm the PCIe device enumerates.

bash
sudo lspci | grep -i "accelerator\|npu\|ai\|coprocessor"

Install the vendor driver and runtime using their installer for your distro. Reboot when the installer tells you.

5) Keep your chat workflow

Keep using Ollama for prompts while you validate system stability. In the next section you will confirm local-only traffic and speed.

Windows 11

1) System prep

Update BIOS and chipset drivers. Set power plan to High performance.

2) Install Ollama

bash
winget install Ollama.Ollama --silent

Then run from PowerShell:

bash
ollama run llama3:8b

3) Install the Furiosa RNGD driver and runtime using the Windows installer from the vendor. Reboot.

4) Verify the device shows in Device Manager under System devices or a dedicated NPU section.

macOS (Apple Silicon or Intel)

RNGD is a PCIe card. Macs do not accept internal PCIe cards. You can still run the same local stack to test prompts and models now.

1) Install Ollama

bash
curl -fsSL https://ollama.com/install.sh | sh

2) Run a baseline model

bash
ollama run llama3:8b

Verify

Goal: prove local answers, measure baseline latency, and confirm the accelerator is present on Linux.

1) Local only

bash
# Start a quick chat and watch connections
sudo ss -tupn | grep -v ESTAB &
OLLAMA_HOST=127.0.0.1:11434 ollama run llama3:8b

Type a question. You should not see outbound TCP connections open to the internet.

2) Model list and pull cache

bash
ollama list
# Expected entry after first run
# NAME           ID              SIZE    MODIFIED
# llama3:8b      sha256:...      4.7 GB  Just now

3) Baseline latency

bash
time ollama run llama3:8b "Say hello in 10 words."

Look at real time. Save it for later comparison.

4) Accelerator presence on Linux

bash
sudo lspci -nn | grep -i "accelerator\|coprocessor\|processing"
# Expect a new PCIe device entry after install

Configuration tips

Use smaller quantization for speed. In Ollama, pick q4 or q5 variants when you pull models. Example: ollama pull llama3:8b-q4_K_M.
Warm prompt cache. Ask a short question first to load the model into RAM, then start your real task to avoid a cold start penalty.
Pin CPU cores during tests. Use taskset to bind the server and the model to a NUMA node for steadier latency.

bash
sudo apt -y install util-linux
sudo taskset -c 0-15 ollama serve

Keep context length realistic. Set -p with fewer system tokens or use 4k context models for most chats to avoid paging.
Log tokens per second. Add -v to see throughput, or wrap with pv to measure token stream rate in a pinch.

Troubleshooting

1) Model will not start, memory error

Symptom: not enough memory or process killed.
Fix: pick a smaller quantization like llama3:8b-q4_K_M, close background apps, and add a 16 GB swap file.

bash
sudo fallocate -l 16G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

2) Slow first token every run

Symptom: 20 to 60 seconds before the first token on each prompt.
Fix: keep ollama serve running so the model stays in RAM.

bash
ollama serve &

Also avoid system sleep and memory pressure.

3) Port already in use

Symptom: bind: address already in use on 11434.
Fix: find and stop the other server or change the port.

bash
sudo lsof -i :11434
kill -9 <PID>
# or
OLLAMA_HOST=127.0.0.1:11500 ollama serve

When it beats cloud chatbots

Constant daily prompts without a surprise bill. If you send thousands of tokens per day, a one-time card is cheaper after a few weeks.
Private data stays at home. Good for drafts, notes, and code with secrets on an offline LAN.
Low, steady latency for short prompts. No round trip to a remote region. First token hits fast once the model is warm.
Batch family use. Several people on the same LAN can hit your box at once and you control the queue.

Sources

Want a hand?

Book a 30-min call.

Walk through your stack with us. We'll find the bottleneck and map out the exact wiring you need — free.

Book the callFree intro · 30 min · cal.com