All posts

Edge AI and On-Device Intelligence: A Developer's Guide to Local Inference

Not every inference should hit OpenAI. Here's how to run useful models on laptops, phones, and edge boards, with real hardware numbers and working code.

~13 min read

Your app sends a lecture transcript to a cloud API. The user is on a train with spotty Wi-Fi. The request times out. They blame your product, not the network.

That failure mode is why edge AI stopped being an embedded-systems niche and became a normal part of product architecture in 2026. You do not need a data center to run useful inference. You need the right model size, the right runtime, and honest UX about what local models can and cannot do.

This guide covers what edge AI actually means, which hardware runs which models, how to wire Ollama and llama.cpp into a real app, and where a hybrid local-plus-cloud pattern wins.

What is edge AI?

Edge AI means running machine learning inference on or near the device that collects the data, instead of sending raw inputs to a remote server.

That includes:

  • On-device LLMs on a laptop (Ollama, llama.cpp, MLX)
  • NPUs on phones and PCs (Apple Neural Engine, Qualcomm Hexagon, Intel NPU)
  • Edge GPUs on factory floors, drones, and retail kiosks
  • Microcontrollers running tiny vision or audio models via TensorFlow Lite or ONNX Runtime

The unifying idea is simple: compute where the data already lives. Less upload bandwidth, lower round-trip latency, and fewer copies of sensitive content sitting in someone else's log pipeline.

Edge AI is not the opposite of cloud AI. Most production apps I see use both. The question is which tasks stay local by default.

Why on-device inference is viable again

Three shifts made 2026 different from 2020:

  1. Quantization matured. A 7B parameter model at 4-bit precision fits in ~4–5 GB of VRAM or unified memory and often matches the quality of a full-precision 7B from two years ago for drafting and Q&A.
  2. Runtimes got fast. llama.cpp, ONNX Runtime, and vendor stacks (Core ML, TensorRT) squeeze usable token speeds out of CPU, GPU, and NPU paths.
  3. Small open models got good. Llama 3.x, Qwen 2.5/3, Gemma, Phi, and Mistral families ship in 1B–14B sizes that are actually helpful for summarization, quiz generation, and code explanation.

You still cannot run GPT-4-class reasoning on a phone. You can run a tutor that summarizes the current lecture slide without uploading the video file.

Hardware requirements for on-device LLMs

Model choice starts with hardware reality, not benchmark leaderboard hype.

Laptop and desktop tiers

Hardware profileTypical RAM/VRAMModels that run wellApprox. speed (7B Q4)Good for
Budget laptop (i5, no dGPU)8 GB unified1B–3B Q4 only8–18 tok/s CPUKeyword extract, short summaries
Standard dev machine16 GB, optional GTX 16507B Q4 (tight on 8 GB machines)15–30 tok/s CPU, 40–70 tok/s GPUDrafting, RAG Q&A, study helpers
Creator / ML box32 GB + RTX 3060–407013B Q4, 7B Q8, multi-LoRA50–90 tok/s GPULonger context, better reasoning
Workstation64 GB + RTX 4090 / dual GPU32B–70B Q4 (with offload tricks)30–80 tok/s depending on modelLocal coding agents, batch jobs

Numbers vary by model family, context length, and batch size. Treat the table as order-of-magnitude planning, not a SLA.

Apple Silicon

ChipUnified memoryPractical local LLM ceilingNotes
M1 / M2 (8 GB)8 GB shared with OS3B Q4 comfortable, 7B Q4 marginalClose browser tabs before loading 7B
M2 Pro / M3 (16–18 GB)16–18 GB7B–13B Q4 via MLX or llama.cppMetal acceleration is the default win
M3 Max / M4 Pro (36–48 GB)36–48 GB13B–32B Q4Strong default for offline dev workflows

MLX (Apple's array framework) is worth benchmarking on Mac if you ship a native desktop app. llama.cpp with Metal backend is the portable fallback.

Mobile and embedded

PlatformAcceleratorTypical model sizeLatency target
Android flagship (Snapdragon 8 Gen 3)Hexagon NPU1B–3B INT4< 200 ms first token for short prompts
iPhone 15 Pro and newerNeural Engine + GPU1B–3B via Core ML or MLXOn-device Siri-style tasks, not long chat
Raspberry Pi 5 + Hailo / CoralUSB acceleratorSub-100M param visionDetection, not generative chat
Jetson Orin Nano8–16 GB, CUDA7B Q4 with tuningRobotics, CV pipelines

Phones are great for classification, transcription snippets, and tiny assistants. Long-context tutoring still belongs on desktop or hybrid cloud for most teams.

Runtime stack: what to actually install

RuntimeBest whenTradeoff
OllamaYou want a daemon + OpenAI-compatible API in ten minutesLess granular than raw llama.cpp; great for prototypes
llama.cppYou need max control, embedded binaries, or custom buildsYou manage models and flags yourself
MLXmacOS-only apps targeting Apple SiliconNot portable to Windows/Linux
ONNX RuntimeClassical ML + small transformers, cross-platformLLM story is improving but not the default for 7B chat
TensorRT-LLM / vLLMServer-grade NVIDIA deployment at the edge rackOverkill for a student laptop

For desktop apps (Electron, Tauri, native), the pattern I use: bundle or detect Ollama, fall back to llama.cpp CLI or server mode, and keep cloud APIs behind an explicit user toggle.

Architecture patterns for edge AI apps

Pattern 1: Local-first with optional cloud escalation

flowchart LR
  subgraph device [User Device]
    UI[App UI]
    CTX[Context Builder]
    LOC[Local Runtime Ollama or llama.cpp]
    IDX[Local Vector Index]
    UI --> CTX
    CTX --> IDX
    CTX --> LOC
  end
  subgraph optional [Optional Cloud]
    API[Frontier API]
  end
  LOC -->|draft answer| UI
  CTX -->|user opts in hard question| API
  API --> UI

The local runtime handles 80% of requests: summarize, quiz, rewrite notes. Cloud handles multi-hop reasoning only when the user accepts data leaving the device.

Pattern 2: Preprocessing on device, aggregation in cloud

Factory cameras run a 50 MB defect detector on an edge GPU. Only anomaly frames upload. You cut bandwidth by orders of magnitude and keep most frames private.

Pattern 3: Fully air-gapped

No outbound inference at all. Model weights ship on USB or internal mirror. Updates are quarterly, not continuous. Common in defense, finance back-office, and some healthcare imaging workflows.

I default to Pattern 1 for consumer software. Users get offline value day one; power users opt into cloud quality.

Use cases that justify edge AI

Use caseWhy local winsCloud still helps when
Study and education appsCourse videos and notes are sensitive; offline flights and dorms are realUser asks for research beyond course scope
Code completion in enterpriseRepo snippets must not hit vendor logsRare architecture questions needing huge context
Voice wake word + commandSub-100 ms latency, always onOpen-ended conversation
Vision QC on factory line30 fps camera streams cannot all uploadRetraining on aggregated defect trends
Medical draft assist (non-diagnostic)HIPAA-adjacent data minimizationSpecialist second opinion workflows

If your feature works without network, ship it without network. Make cloud a conscious upgrade, not a hidden dependency.

Ollama: fastest path to on-device inference

Ollama wraps model pull, GPU detection, and an HTTP API. For desktop apps, it is the quickest way to prove local AI before you optimize binaries.

Install and pull a model

# macOS / Linux / Windows (WSL or native installer)
curl -fsSL https://ollama.com/install.sh | sh

# Small model for 8–16 GB machines
ollama pull llama3.2:3b

# Better drafting on 16 GB+ with GPU or Apple Silicon
ollama pull qwen2.5:7b-instruct-q4_K_M

Smoke test in the terminal

ollama run llama3.2:3b "Explain gradient descent in two paragraphs for a CS student."

Call from your app (OpenAI-compatible chat)

Ollama exposes http://localhost:11434/v1/chat/completions when you enable compatibility mode, or the native generate API:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2:3b",
  "messages": [
    {"role": "system", "content": "You are a concise study assistant."},
    {"role": "user", "content": "List three differences between stack and heap memory."}
  ],
  "stream": false
}'

Node.js example (Electron main process or local server)

const OLLAMA = "http://127.0.0.1:11434";

type ChatMessage = { role: "system" | "user" | "assistant"; content: string };

export async function askLocalLlm(
  model: string,
  messages: ChatMessage[]
): Promise<string> {
  const res = await fetch(`${OLLAMA}/api/chat`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ model, messages, stream: false }),
  });

  if (!res.ok) {
    throw new Error(`Ollama error ${res.status}: ${await res.text()}`);
  }

  const data = (await res.json()) as { message: { content: string } };
  return data.message.content;
}

Production tip: probe GET /api/tags on startup. If Ollama is missing, show "Enable local AI" with install docs instead of failing mid-question.

llama.cpp: when you need control and smaller footprints

llama.cpp is the reference implementation many runtimes fork. You trade convenience for binary size, compile flags, and embedding in C++/Rust/Go shells.

Download a GGUF model

Grab a quantized build from Hugging Face (look for Q4_K_M as a balanced default):

mkdir -p models
# Example: 3B instruct, ~2 GB on disk
wget -O models/llama-3.2-3b-instruct.Q4_K_M.gguf \
  "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf"

Interactive CLI

./llama-cli \
  -m models/llama-3.2-3b-instruct.Q4_K_M.gguf \
  -p "Summarize binary search in 3 bullet points." \
  -n 200 \
  -c 4096 \
  --temp 0.7

Flags worth knowing:

  • -ngl 99 offloads layers to GPU (NVIDIA/Metal/Vulkan depending on build)
  • -c sets context window (RAM scales with context)
  • -t sets CPU threads on machines without GPU offload

Server mode for app integration

./llama-server \
  -m models/llama-3.2-3b-instruct.Q4_K_M.gguf \
  --host 127.0.0.1 \
  --port 8080 \
  -c 8192

Then POST to /completion or /v1/chat/completions depending on build. Same fetch pattern as Ollama; swap the base URL.

When I pick llama.cpp over Ollama: shipping a single static binary inside an installer, CI builds that must not depend on a background daemon, or squeezing inference onto hardware Ollama does not support yet.

Case study: Study Stream Black and optional on-device AI

I build Study Stream Black, an offline-first Electron desktop app for downloaded courses, timestamped notes, and focus tooling. Core playback and notes never require the cloud.

The Study Room AI feature is different. It needs lesson context: current video, subtitles, notes at timestamps. Today that can route through cloud APIs (e.g. Gemini) when the user opts in.

The edge AI roadmap looks like this:

flowchart TB
  subgraph study [Study Stream Desktop]
    PLAYER[Video Player]
    SUBS[Subtitle Parser]
    NOTES[Timestamped Notes]
    ROOM[Study Room UI]
    PLAYER --> SUBS
    PLAYER --> NOTES
    SUBS --> CTX[Lesson Context Bundle]
    NOTES --> CTX
    ROOM --> CTX
  end
  subgraph inference [Inference Layer]
    OLL[Ollama localhost]
    CLOUD[Optional Cloud Model]
  end
  CTX -->|offline mode| OLL
  CTX -->|user enables cloud tutor| CLOUD
  OLL --> ROOM
  CLOUD --> ROOM

Design choices that matter:

  1. Context stays assembled locally. Only the text bundle (subtitle window + note excerpts) crosses the inference boundary, not the video file.
  2. Local model handles quizzes and recap. "Give me five questions from the last ten minutes of subtitles" is a great 3B–7B task.
  3. Cloud is labeled. UI copy says what leaves the device. No silent upload.
  4. Degrade gracefully. If Ollama is down, Study Room still shows notes and bookmarks. AI is enhancement, not gatekeeping.

That is the product pattern edge AI enables: offline core, intelligent optional layer.

Edge vs cloud: honest tradeoffs

DimensionOn-deviceCloud API
PrivacyData stays on machine by defaultRequires trust, DPA, zero-retention clauses
Latency15–80 tok/s typical on laptop; no network RTTFast models, but adds 100–400 ms network + queue
Model quality3B–13B practical; frontier reasoning unavailableGPT-4 class, tools, huge context
Cost at scaleUpfront hardware; near-zero marginal per tokenPer-token billing; predictable until you viral
UpdatesYou ship model + runtime updatesVendor improves weights silently
ComplianceEasier data residency storyDepends on region and contract

Hybrid cascade is the pattern I recommend: classify the request locally (short prompt, keyword rules, or a 1B classifier). Easy tasks never leave the machine. Hard tasks show a confirmation dialog before cloud call.

What breaks if you ignore constraints

I have hit these repeatedly:

  • Loading 13B on an 8 GB laptop swaps the OS to death. Default to 3B, detect RAM, offer upsell paths.
  • Giant context windows eat RAM linearly. A 32k context on a 7B model can need 8+ GB just for KV cache. Chunk subtitles instead of dumping entire courses.
  • First-token latency kills chat UX. Warm the model on app launch or show "Starting local AI…" with a real progress state.
  • Battery drain on laptops without GPU efficiency. Offer "battery saver" that disables local inference on DC power below 20%.
  • Stale local weights. Cloud models update monthly; your bundled 2024 weights will feel dumb. Plan a model update channel.

FAQ

What is the smallest useful on-device LLM?

For English Q&A and summarization, 1B–3B instruct models at Q4 quantization are the floor. Below that, hallucination rates spike on technical topics. For coding help, 7B is the practical minimum on 16 GB hardware.

Can edge AI run without a GPU?

Yes. llama.cpp on CPU is how most developers start. Expect 10–25 tok/s on a modern 8-core laptop for 7B Q4. Usable for study helpers; painful for long codegen sessions.

Ollama vs llama.cpp for production desktop apps?

Ollama if you want fast integration and can ask users to install a daemon. llama.cpp if you ship everything inside your installer or need fine-grained binary control. Many teams prototype on Ollama and embed llama.cpp for release.

Does on-device AI mean zero data collection?

No. Local inference removes cloud prompt logging, but your app can still phone home analytics. If you market privacy, audit your crash reporters and feature flags too.

How does edge AI relate to federated learning?

Edge inference runs a finished model locally. Federated learning trains models across devices without centralizing raw data. They solve different problems. Most app developers only need inference.

Will NPUs replace GPUs for LLMs on laptops?

NPUs excel at sustained low-power workloads (webcam effects, small classifiers). LLM decoding still favors GPU or CPU+Metal today. Watch NPU drivers in 2027, but do not block shipping on them.

What to do next

  1. Benchmark llama3.2:3b and qwen2.5:7b on your target hardware with Ollama. Log tok/s and RAM.
  2. Sketch a context bundle format (JSON of local text slices) before wiring UI.
  3. Add an explicit cloud escalation toggle with plain-language data disclosure.
  4. Read the companion post on local LLMs and privacy for RAG and compliance detail.

Edge AI is not about avoiding the cloud forever. It is about choosing which bytes never need to leave the machine, and building features that still work when they do.


Rohit Singh builds Study Stream Black and writes about shipping desktop software from Jaipur. Related: Study Room AI · Open vs closed models · Electronics and IoT edge