JONESTECH_SYS
STACK
DOC:001 // FROM HARDWARE TO INFERENCE

The Full
Stack Explained

Five steps from bare metal to a running LLM endpoint. No black boxes — here's exactly what we do, why we do it, and what you end up with.

THE PROCESS

Five Steps.
Zero Guesswork.

Every deployment follows the same sequence. Each step feeds directly into the next — skipping any of them is how deployments go wrong. We've refined this across enough installs to know exactly where the edge cases are.

01
HARDWARE AUDIT
Profile Your Machine

Before touching a model file, we need to know exactly what we're working with. GPU model, VRAM, system RAM, storage speed, CPU architecture — every spec influences which models will run well and which will struggle.

  • GPU VRAM is the primary constraint — it determines max model size
  • System RAM matters for models that overflow to CPU offload
  • NVMe read speed affects model load time (42GB+ files)
  • Multi-GPU setups unlock tensor parallelism for larger models
  • We check CUDA version, driver state, and thermal headroom
// HARDWARE TARGETS
RTX 4090 · 24GB70B Q4 ✓
RTX 3090 · 24GB34B Q5 ✓
RTX 3080 · 10GB13B Q8 ✓
2× RTX 4090671B MoE ✓
CPU only7B Q4 ✓
02
MODEL SELECTION
Pick the Right Model

Not every model is right for every use case. A 70B general-purpose model is overkill for a focused code assistant; a 7B coding model won't hold up for complex document analysis. We match capability to need.

  • General chat / assistant → Llama 3.3 70B or Mistral Small
  • Code generation / completion → Qwen 2.5 Coder 32B
  • Fast RAG retrieval → Mistral 7B or Phi-4 14B
  • Complex reasoning → DeepSeek R1 or Phi-4
  • Embeddings → nomic-embed-text or bge-m3 (run locally)
// MODEL MATCH
Chat / GeneralLlama 3.3 70B
CodeQwen 2.5 32B
RAG / SpeedMistral 22B
ReasoningPhi-4 14B
Embeddingsnomic-embed
03
QUANTIZATION
Compress to Fit

A full-precision 70B model weighs ~140GB — impossible on a single consumer GPU. Quantization reduces each weight from 16-bit to 4-8 bits, shrinking file size dramatically with minimal quality loss if done correctly.

  • Q4_K_M is our default — best size/quality balance for most use cases
  • Q8_0 for maximum quality when VRAM allows
  • Q2_K only for CPU-only machines with severe memory limits
  • GGUF format for llama.cpp/Ollama; AWQ for vLLM
  • We benchmark quality drop before committing to a quantization
quantize.sh
$ollama pull llama3.3:70b-q4_K_M
pulling manifest...
pulling 42.5 GB ████████ 100%
verifying sha256... OK
$ollama run llama3.3:70b-q4_K_M
loaded in VRAM: 22.8 GB / 24 GB
04
SERVER DEPLOYMENT
Production-Hardened Server

A model running in a terminal isn't a production system. We configure a proper inference server with process management, automatic restarts, health checks, logging, and optional load balancing for multi-GPU or multi-instance setups.

  • Ollama for ease-of-use and multi-model management
  • vLLM for maximum throughput on high-traffic deployments
  • llama.cpp server for lean CPU/hybrid setups
  • systemd service with restart policies and logging
  • Optional: nginx reverse proxy with auth and rate limiting
inference_server.sh
$systemctl status ollama
● ollama.service — active (running)
Loaded: enabled; vendor preset: enabled
PID: 1842 · Uptime: 14d 6h 22m
$curl localhost:11434/api/tags
{"models":[{"name":"llama3.3:70b-q4_K_M"}]}
05
API INTERFACE
Drop-in Endpoint

The final layer is an OpenAI-compatible REST API. Your existing apps, scripts, and integrations can point at your local endpoint — same request format, same response format. One URL change and you're off cloud.

  • Identical to the OpenAI API spec — /v1/chat/completions
  • Streaming support (SSE) out of the box
  • Works with any OpenAI SDK — Python, Node, Go, etc.
  • Optional: Open WebUI for a browser-based chat interface
  • Optional: basic API key auth for multi-user environments
test_endpoint.py
$python test_endpoint.py
endpoint: http://localhost:11434/v1
model: llama3.3:70b-q4_K_M
TTFT: 84ms · tok/s: 47.3
status: READY ✓
ARCHITECTURE

The Stack Behind Every Deployment

Each layer serves a specific purpose. Together they give you a private, fast, fully-owned inference endpoint that any application can call — with no changes to how you already write code.

MODELSIZEBEST FORSTATUS
Llama 3.370B Q4General / ChatPROD READY
Qwen 2.5 Coder32B Q5Code GenPROD READY
Mistral Small22B Q6RAG / FastPROD READY
Phi-414B Q8ReasoningPROD READY
DeepSeek R2671B MoEResearchHIGH VRAM
// DEPLOYMENT ARCHITECTURE
YOUR APPLICATIONAny language · REST clientCONSUMER
API GATEWAYOpenAI-compatible endpoint:8080
INFERENCE SERVEROllama / vLLM / llama.cppRUNNING
QUANTIZED MODELGGUF / AWQ — fits your VRAMLOADED
GPU / CPU HARDWAREYour machine · Your data centerLOCAL
DEEP DIVE

Quantization Levels Explained

Q2_K
2-BIT · MINIMUM SIZE
70B file size~26 GB
Quality lossSignificant
Best forCPU-only

Last resort when VRAM is severely limited. Noticeable degradation on complex reasoning tasks.

Q6_K
6-BIT · HIGH QUALITY
70B file size~58 GB
Quality lossVery minimal
Best forDual GPU / high VRAM

Marginal improvement over Q4 for most tasks. Worth it on multi-GPU setups where VRAM isn't a constraint.

Q8_0
8-BIT · NEAR LOSSLESS
70B file size~74 GB
Quality lossNegligible
Best forResearch / fine-tune prep

Effectively full quality. Requires 80GB+ VRAM for 70B. Overkill for most production workloads.

READY TO BUILD

See It Running
On Your Hardware

Now you know the stack — let's run it on yours. Tell us your hardware and use case and we'll have you live inside 48 hours.