Porting Undervolt from Cloud to a Jetson Under My Desk
Undervolt won the DGX AITX hackathon running on a Jetson at the venue. When we packed up and went home, the Jetson came with us, but the model traffic for the live site was hitting NVIDIA's hosted Nemotron API. That worked. It also wasn't the point.
The point was: this should run on a box. Not in a region. Not behind a per-token meter. On the same Jetson that won the prize, sitting in our office, returning answers in tens of milliseconds with no third-party in the path.
This is what changed when we made that switch.
Why bother
For most projects, "use the API" is the right answer. You get the best model. You don't manage infrastructure. The marginal token cost is small until it isn't.
Undervolt isn't most projects. It has three properties that flip the math:
- High volume. A live map with ten thousand users a month, each one running multiple queries, fans out to hundreds of thousands of model calls. At hosted-Nemotron rates, that's a meaningful monthly invoice. At electricity cost, it's near zero.
- Stable schema. The questions users ask Undervolt are bounded. "What's been permitted near this lat/lng?" is a different question from "Write me a poem." A small model trained on the right system prompt is genuinely sufficient.
- No reason to leak. Permit data is public, but the queries hint at what someone's investigating — a developer scoping a parcel, a planner running a what-if, a competitor mapping a market. Sending those through a third-party logger isn't a leak in the legal sense, but it's still data we don't need to ship out.
When all three are true, local is the right call. Most production AI workloads I've looked at have at least two of the three.
The architecture, before and after
Before:
User → Vercel → Next.js API route → integrate.api.nvidia.com (Nemotron Nano 8B)
Clean. Fast to build. Pay per token.
After:
User → Vercel → Next.js API route → LiteLLM proxy → Jetson (Ollama, Nemotron 3 Nano)
↓
fallback: integrate.api.nvidia.com
Same shape, one extra hop. The hop is LiteLLM, an OpenAI-compatible router that takes the same calls the front-end was already making and decides per-call where to send them. The Jetson runs Ollama with Nemotron 3 Nano pulled locally — 24GB on disk, fits in unified memory, returns first token in the low hundreds of milliseconds.
The fallback to NVIDIA's hosted API is the safety net. If the Jetson is offline, if Ollama crashes, if an anomalous query needs a bigger model, LiteLLM routes around the failure. The user never sees it.
What broke when we cut over
Three things, in order of how long each took to fix:
1. Tokenizer mismatch. The hosted Nemotron returns slightly different token counts than the locally-running Nemotron 3 Nano. Our prompt engineering had encoded subtle assumptions — "leave 800 tokens for the response" — that broke when the local model split tokens differently. Fix: stop reasoning about tokens, reason about characters, leave a 30% buffer.
2. Streaming format. Ollama streams in NDJSON; the hosted API streams in OpenAI-style SSE. LiteLLM normalizes most of this, but our Next.js handler had a bespoke parser for SSE chunks. Replaced the parser with the LiteLLM client SDK. Lost 60 lines of code. Net positive.
3. Cold-start latency. The Jetson's Ollama daemon evicts the model after thirty minutes of idle. First query after a quiet period took eight seconds while the model loaded. Fix: a tiny cron that pings /api/generate every fifteen minutes with a single-token request. Keep-alive without overspending battery.
Total port time: about eleven hours of focused work over two weeks.
What got better
Latency dropped. The hosted API was returning first token at 600–900 ms, mostly network round-trip. Local Jetson returns first token at 120–200 ms because the network round-trip is the wall socket.
Costs went to zero on the variable axis. The Jetson is a sunk cost. The wall power is real but rounds to about $40 a month at full utilization, which we're nowhere near. There's no incremental dollar per query.
The site got more honest. "Local inference" stopped being a marketing line and started being a deployment fact. When users asked whether their query was being logged anywhere, the answer became "no" without an asterisk.
What I learned about the local-vs-cloud decision
The decision isn't binary. Most teams I talk to are running cloud everywhere because the local question feels like infrastructure work nobody has time for. That's the right move when the workload is bursty, when the model needs to be GPT-4 class, or when you're still figuring out what the product is.
It stops being right when:
- Volume is steady and growing. The cloud line is variable cost. The local line is fixed cost. They cross.
- Latency is product-defining. Voice assistants, live captioning, agent loops where every hop matters.
- Compliance is asking questions. Healthcare, legal, regulated finance, internal codebases.
- You've outgrown the smallest cloud tier but aren't ready for the next one up. Local is often a better fit than mid-cloud.
Undervolt hit three of those four. The fourth (compliance) is hypothetical for permits but very real for the next thing we built.
The broader pattern
After Undervolt, we started defaulting to LiteLLM in front of every model call we ship. Not because we always run local, but because the option matters. With the router in place, switching a workload from cloud to local is a config change. Without it, you're rewriting client code every time the deployment shape moves.
If you're standing up a new AI product and you're unsure whether you'll go local later, put LiteLLM in early. Cost is essentially zero. Future-you will thank present-you when an enterprise customer asks where the data goes.
The Jetson today
The Jetson sits next to the standing desk, twenty-four hours a day, fan barely audible. It serves Undervolt. It also serves three other things we've shipped since — the home-network monitor, the photographer copilot, and a sports-AI prototype. Each one gets its own Ollama-served model. Each one is one more thing that doesn't need a cloud bill to keep running.
It's not the right tool for everything. But for "I have a steady workload and I'd like the model to live in my office," it's an underrated option.
Considering local-LLM deployment for a real workload? AISOFT plans, builds, and operates production local-inference stacks — Jetson, GB10, on-prem, hybrid with LiteLLM routing. hello@aisoft.us · book a 30-min consult →