How It Works — Jonestech

THE PROCESS

Five Steps.
Zero Guesswork.

Every deployment follows the same sequence. Each step feeds directly into the next — skipping any of them is how deployments go wrong. We've refined this across enough installs to know exactly where the edge cases are.

⬡ HARDWARE AUDIT

Profile Your Machine

Before touching a model file, we need to know exactly what we're working with. GPU model, VRAM, system RAM, storage speed, CPU architecture — every spec influences which models will run well and which will struggle.

GPU VRAM is the primary constraint — it determines max model size
System RAM matters for models that overflow to CPU offload
NVMe read speed affects model load time (42GB+ files)
Multi-GPU setups unlock tensor parallelism for larger models
We check CUDA version, driver state, and thermal headroom

// HARDWARE TARGETS

RTX 4090 · 24GB70B Q4 ✓

RTX 3090 · 24GB34B Q5 ✓

RTX 3080 · 10GB13B Q8 ✓

2× RTX 4090671B MoE ✓

CPU only7B Q4 ✓

◈ MODEL SELECTION

Pick the Right Model

Not every model is right for every use case. A 70B general-purpose model is overkill for a focused code assistant; a 7B coding model won't hold up for complex document analysis. We match capability to need.

General chat / assistant → Llama 3.3 70B or Mistral Small
Code generation / completion → Qwen 2.5 Coder 32B
Fast RAG retrieval → Mistral 7B or Phi-4 14B
Complex reasoning → DeepSeek R1 or Phi-4
Embeddings → nomic-embed-text or bge-m3 (run locally)

// MODEL MATCH

Chat / GeneralLlama 3.3 70B

CodeQwen 2.5 32B

RAG / SpeedMistral 22B

ReasoningPhi-4 14B

Embeddingsnomic-embed

▲ QUANTIZATION

Compress to Fit

A full-precision 70B model weighs ~140GB — impossible on a single consumer GPU. Quantization reduces each weight from 16-bit to 4-8 bits, shrinking file size dramatically with minimal quality loss if done correctly.

Q4_K_M is our default — best size/quality balance for most use cases
Q8_0 for maximum quality when VRAM allows
Q2_K only for CPU-only machines with severe memory limits
GGUF format for llama.cpp/Ollama; AWQ for vLLM
We benchmark quality drop before committing to a quantization

quantize.sh

$ollama pull llama3.3:70b-q4_K_M

pulling manifest...

pulling 42.5 GB ████████ 100%

verifying sha256... OK

$ollama run llama3.3:70b-q4_K_M

loaded in VRAM: 22.8 GB / 24 GB

◎ SERVER DEPLOYMENT

Production-Hardened Server

A model running in a terminal isn't a production system. We configure a proper inference server with process management, automatic restarts, health checks, logging, and optional load balancing for multi-GPU or multi-instance setups.

Ollama for ease-of-use and multi-model management
vLLM for maximum throughput on high-traffic deployments
llama.cpp server for lean CPU/hybrid setups
systemd service with restart policies and logging
Optional: nginx reverse proxy with auth and rate limiting

inference_server.sh

$systemctl status ollama

● ollama.service — active (running)

Loaded: enabled; vendor preset: enabled

PID: 1842 · Uptime: 14d 6h 22m

$curl localhost:11434/api/tags

{"models":[{"name":"llama3.3:70b-q4_K_M"}]}

✦ API INTERFACE

Drop-in Endpoint

The final layer is an OpenAI-compatible REST API. Your existing apps, scripts, and integrations can point at your local endpoint — same request format, same response format. One URL change and you're off cloud.

Identical to the OpenAI API spec — /v1/chat/completions
Streaming support (SSE) out of the box
Works with any OpenAI SDK — Python, Node, Go, etc.
Optional: Open WebUI for a browser-based chat interface
Optional: basic API key auth for multi-user environments

test_endpoint.py

$python test_endpoint.py

endpoint: http://localhost:11434/v1

model: llama3.3:70b-q4_K_M

TTFT: 84ms · tok/s: 47.3

status: READY ✓

MODEL	SIZE	BEST FOR	STATUS
Llama 3.3	70B Q4	General / Chat	PROD READY
Qwen 2.5 Coder	32B Q5	Code Gen	PROD READY
Mistral Small	22B Q6	RAG / Fast	PROD READY
Phi-4	14B Q8	Reasoning	PROD READY
DeepSeek R2	671B MoE	Research	HIGH VRAM

DEEP DIVE

Quantization Levels Explained

Q2_K

2-BIT · MINIMUM SIZE

70B file size~26 GB

Quality lossSignificant

Best forCPU-only

Last resort when VRAM is severely limited. Noticeable degradation on complex reasoning tasks.

Q4_K_M

4-BIT · RECOMMENDED

70B file size~42.5 GB

Quality lossMinimal

Best forMost use cases

Best balance of size and quality. Runs on a single RTX 4090. Hard to tell apart from Q8 in practice.

Q6_K

6-BIT · HIGH QUALITY

70B file size~58 GB

Quality lossVery minimal

Best forDual GPU / high VRAM

Marginal improvement over Q4 for most tasks. Worth it on multi-GPU setups where VRAM isn't a constraint.

Q8_0

8-BIT · NEAR LOSSLESS

70B file size~74 GB

Quality lossNegligible

Best forResearch / fine-tune prep

Effectively full quality. Requires 80GB+ VRAM for 70B. Overkill for most production workloads.

The Full
Stack Explained

Five Steps.
Zero Guesswork.

The Stack Behind Every Deployment

Quantization Levels Explained

See It Running
On Your Hardware

The Full Stack Explained

Five Steps.Zero Guesswork.

The Stack Behind Every Deployment

Quantization Levels Explained

See It RunningOn Your Hardware

The Full
Stack Explained

Five Steps.
Zero Guesswork.

See It Running
On Your Hardware