Local AI Distillation on WebGPU vs Cloud: When to Use Each
Distillation is the step that turns a raw conversation into useful memory. MindLock can run it two ways: locally on your device using WebLLM and your GPU, or in the cloud using Gemini on the Pro plan. Both produce the same output shape. The question is which one to use when.
This post is a practical guide — what each mode is, what it costs in time and privacy, and how to decide.
What Distillation Does
A conversation is a long sequence of messages. A memory document is a compact, structured summary of what is actually worth keeping. Distillation reads the conversation and writes:
- Profile memory — durable facts about you and your work.
- Topic memories — focused documents grouped by theme.
The output is markdown you can read, edit, search, and feed back into any AI as context. Full context on this: Memory Documents.
Local Distillation via WebLLM
Local mode runs an LLM inside your browser tab using WebGPU. The model weights are downloaded once, cached, and executed on your GPU.
Three models are offered:
| Tier | Use when |
|---|---|
| Fast | Low-end GPU or you want quick turnaround and are okay with shorter summaries. |
| Balanced | Default choice for a modern laptop. Good quality, reasonable speed. |
| Quality | Desktop GPU with plenty of VRAM. Slowest, best summaries. |
What you get:
- Privacy: the conversation never leaves your device. Not even to MindLock's servers.
- Offline: works with no network after the first model download.
- Cost: free.
What you pay:
- Speed: a local model is slower than a hosted model, especially on laptops.
- First-run cost: the model download is multi-GB. Plan for it once.
- Hardware floor: WebGPU-capable browser and a GPU with enough memory for the tier you pick.
Model selection and loading lives in Settings.
Cloud Distillation via Gemini
Cloud mode (Pro) sends the conversation to Gemini 3.0 Flash for distillation. Pro includes 100 operations per month.
What you get:
- Speed: dramatically faster than local for long conversations.
- Quality ceiling: a frontier hosted model beats what you can run in-browser.
- No hardware floor: works on any device, including phones.
What you pay:
- $5/month for the Pro plan.
- Cloud transit: the conversation is sent to Gemini for processing. If the conversation itself is sensitive, this matters.
- Quota: 100 operations/month. Heavy users track usage.
How to Decide
A simple heuristic:
| Situation | Use |
|---|---|
| Sensitive conversation (client, legal, medical, strategy) | Local. The content never leaves your device. |
| Long conversation, tight deadline | Cloud. The speed difference is real. |
| Low-power laptop, casual work | Cloud if you're on Pro; otherwise Local Fast. |
| Offline or on a plane | Local. Nothing else works. |
| You want the best summary quality | Cloud for most cases, Local Quality if the content must stay local. |
| First time trying MindLock | Local Balanced. See what it does for free before paying. |
You are not locked in. Pro users can still run local distillation any time. Free users always run local. The two modes are a menu, not a commitment.
A Mixed Workflow That Works Well
Many users end up on a hybrid:
- Sensitive conversations → Local distillation.
- Bulk, long, non-sensitive conversations → Cloud distillation to save time.
- Everything ends up in the same memory store and the same semantic search.
You pay only for the speed you actually need, on the conversations where the tradeoff makes sense.
Embeddings and Search Are Always Local
Worth calling out: the semantic search index — what powers Ctrl+K across all your content — runs on-device regardless of which distillation mode you pick. Your search queries don't leave your machine.
Start
If you are new, open the Dashboard, load a local model in Settings, and run a distillation on a real conversation. You will know within one run whether local is fast enough for how you work. If it isn't, Free vs Pro lays out the cloud option honestly.
What WebGPU Actually Is and Why It Matters Here
A short detour, because the distinction shapes everything else. WebGPU is a relatively new browser API that gives JavaScript real access to the GPU — not just for rendering, but for general compute. Before WebGPU, running a meaningful language model in the browser was impossible at usable speeds. With WebGPU, the GPU is exposed to web pages the way a native app would access it, and a model the size of a 3B–7B parameter LLM becomes feasible to run in a tab.
The implications for privacy and ownership are concrete:
- The model runs in your browser process. It is not a Chrome extension reaching out to a cloud endpoint. It is the same JavaScript runtime that renders the page, executing model inference on your local hardware.
- There is no required server round-trip. After the initial weight download, the model produces tokens entirely on your device. Network latency stops mattering; bandwidth stops mattering.
- Browser sandboxing applies. The model can't read files outside what the page is given access to. The privacy story isn't just "local execution"; it's "local execution inside a sandbox."
This is why local distillation is an actual privacy primitive rather than marketing copy. The conversation is read by code running in a browser tab on your machine. Nothing about the inference produces a network packet that touches a model provider.
The downside is the matching one: browser-based inference is slower than a hosted GPU running the same model class, because the hardware available in your laptop is what it is, and the browser introduces some overhead vs. native. WebGPU narrows the gap from "impossible" to "noticeably slower" — which is the gap that makes the local-vs-cloud decision interesting.
Choosing a Model Tier in More Detail
The Fast / Balanced / Quality tiers map to actual model sizes, and the right pick depends on three things that are easy to assess.
Your GPU's available memory. WebGPU models load weights into VRAM. Quality tier wants more VRAM than Fast. Pick the highest tier that fits comfortably; if loading freezes the tab or crashes mid-distillation, drop a tier.
Your tolerance for first-run friction. The model weight download is multi-gigabyte and only happens once per browser/profile. If you're on a metered or slow connection, Fast loads quicker and gets you to a first successful distillation sooner. You can switch to Quality later when you're on better network.
The shape of the conversations you distill. Short, dense conversations distill well at every tier — Fast is often enough. Long, rambling conversations benefit from a stronger model that can hold more context coherently. If you find yourself disappointed in the summaries from a thin conversation, the model isn't usually the problem; the conversation is.
A reasonable default: start at Balanced, distill ten real conversations, decide whether the output is the bottleneck. If yes, try Quality and see if the upgrade pays for itself in summary usefulness. If no, stay where you are and save the VRAM for browser tabs.
Cloud Distillation: When the Tradeoff Makes Sense
The cloud-mode case is easier to articulate when you've used both. A few situations where it's clearly the right choice:
- You imported a back catalog. Suddenly you have fifty conversations to distill. Local mode does fifty distillations sequentially on your GPU; cloud mode does them in parallel with frontier-quality output. The first time you process a real archive, cloud mode is dramatically less painful.
- You're on a phone or tablet. WebGPU support on mobile is uneven; even when it works, the GPU is too small for anything but the lightest tier. Cloud mode is the only practical option for on-the-go distillation.
- The conversation isn't sensitive and you're under time pressure. A long meeting transcript that needs to be useful in twenty minutes is a textbook case for cloud distillation. The privacy tradeoff is minor (the model already saw the meeting, in many cases) and the speed difference is real.
The Pro plan's 100 operations per month is the ceiling on cloud use. Heavy users sometimes hit it during back-catalog passes; once steady-state, most users stay well under. Track usage in Dashboard settings if you're unsure.
Hybrid Workflows in Practice
The distillation choice does not have to be made once and lived with. Three patterns that work well:
- Sensitivity-tiered. Default to local for anything client-related, financial, medical, or strategic. Default to cloud for everything else. The decision is per-conversation and takes one second.
- Speed-tiered. Default to cloud during work hours when speed matters, default to local in evenings and on weekends when batching old conversations. The privacy implications are the same; the time pressure is what changes.
- Bootstrap-then-local. Use cloud distillation aggressively in the first month to populate your memory layer fast, then shift to local for steady-state weekly batches once the back catalog is processed.
Whichever pattern you pick, the embeddings powering Ctrl+K search are always computed locally. That layer doesn't change with your distillation choice — only the distillation step itself moves between local and cloud.
What Distillation Quality Actually Looks Like
Worth setting expectations honestly: distilled output is a structured summary, not a verbatim record. Expect the document to capture:
- The decisions made and the reasoning behind them.
- Recurring preferences and constraints.
- Concrete artifacts (code snippets, names, numbers) that appeared in the conversation.
- Open questions or unresolved issues.
Expect it to drop:
- Conversational filler and back-and-forth.
- Reasoning chains the model considered and discarded.
- Tangential discussion that wasn't tied to a decision.
If you find the summary missed something important, edit the memory document. It is plain markdown — you are allowed to add a paragraph by hand, and that human edit will be the canonical version next time you generate context.
Cost Comparison Over a Year
A back-of-the-envelope helps if you're undecided. Suppose you distill twenty conversations a month — a typical pace for someone treating AI as a primary tool.
Local mode. Free in software. The cost is the time spent waiting for distillation: at Balanced tier on a modern laptop, a typical conversation distills in tens of seconds, so twenty conversations is a few minutes per month. The hardware cost is borne whether you use the GPU or not. Total monthly cost: zero dollars, a few minutes of attention.
Cloud mode. $5 per month for Pro. Twenty conversations a month is well within the 100-operation quota. Distillation is dramatically faster, so the time cost falls below local. Total monthly cost: five dollars, less attention.
Over a year that's $60 — the price of a cheap dinner — for noticeably faster distillation across your back catalog and freedom from worrying about local hardware. If your time is worth anything in dollar terms, Pro pays for itself; the question is whether the privacy posture of cloud distillation is acceptable for the conversations you're processing.
Most users converge on a hybrid: Pro for the convenience of cloud distillation when speed matters, local distillation for sensitive items, the same memory store backing both.
Why You Might Pick Local Even When Cloud Is Faster
A few reasons that aren't about privacy specifically:
- Predictability. Local distillation runs the same way next month as it does today. Cloud quality changes when the underlying provider updates the model — sometimes for the better, sometimes not.
- Offline reliability. If you're on a plane, in a tunnel, or on a flaky conference network, local distillation just works. Cloud doesn't.
- No quota anxiety. A hundred operations a month is plenty for most users until it isn't. If you're processing a back catalog and watching the meter, local removes the meter entirely.
- Cost predictability for teams. A small team using local distillation has a known $0 cost for the distillation step. Pro multiplied by team size adds up. The team pattern is often "everyone uses local, the lead has Pro for batch work."
These are real factors even when the privacy story isn't the one driving the decision.
When Cloud Becomes the Default
Conversely, a few situations where cloud distillation becomes the right default and local becomes the exception:
- You travel between machines. If you're on a desktop one day and a laptop the next, cloud distillation gives you consistent quality regardless of which GPU you're sitting in front of.
- You batch heavily. If you import in bursts — a week of conversations at once rather than one at a time — cloud's parallelism saves enough time to matter. Local serializes; cloud does not.
- Your laptop is older. WebGPU works on a wide range of hardware but feels different on older GPUs. If Balanced tier is sluggish and Quality tier won't load, cloud is the path that respects your time.
- You collaborate. If a teammate needs the distilled output quickly, the cloud round-trip is shorter than waiting for a local pass to complete on someone else's hardware.
The right answer is rarely "always local" or "always cloud." It is "local by default for these reasons, cloud by default for those reasons" — and the Settings page lets you switch in two clicks.
Related reading: Generating Context.