From 46% to 90%: Fine-Tuning Tiny LLMs for On-Device Agents — Cormac Brick, Google

Watch video View transcript

AI Engineer21:00Transcript ✅Added May 21, 12:53 am GMT+8

Actionable Insights

Define — Define 30-200 labeled rows of [user_prompt, tool_name, tool_args_json] for one on-device skill.
Use — Use FunctionGemma Tuning Lab or the Google FunctionGemma fine-tuning guide to train/evaluate/export.
Evaluate — Evaluate intent accuracy, exact JSON args, latency, memory, battery, and offline behavior on real devices.

Quick implementation checklist

Pick one narrow workflow; define success/failure before using the tool.
Run a baseline without the proposed technique, then repeat with it.
Log latency, cost/tokens, error category, user correction rate, and qualitative trace notes.
Keep a rollback path: do not replace proven deterministic workflows until the agentic version wins on measured reliability.
Document setup commands, model/version, hardware, data size, and cache state so results are reproducible.

Direct links to try / inspect

Google FunctionGemma fine-tuning: https://ai.google.dev/gemma/docs/functiongemma/finetuning-with-functiongemma
Google Developers FunctionGemma guide: https://developers.googleblog.com/a-guide-to-fine-tuning-functiongemma/
Gemini Nano / AICore on Android: https://developer.android.com/ai/gemini-nano

Core thesis

On-device agents become useful when tiny models are specialized for tool/function calling rather than treated as shrunken chatbots.

Big ideas / key insights

The useful pattern is not “let the model figure everything out.” It is to give the agent a narrow, current, inspectable operating surface and then measure whether it improves the actual workflow.
The video is strongest when treated as a workflow design prompt: identify the state, tools, traces, tests, and guardrails needed to make the idea reproducible.
The weakest claims are performance/savings claims without hardware, data, cache-state, or baseline details. Those should be treated as hypotheses until reproduced.

Best timestamped moments with interpretation

Interpretation: these moments define the author’s workflow claim and the technical constraints. I used them as evidence for the verdicts below rather than treating the title/marketing copy as proof.

Practical takeaways / recommended workflow

Recreate the author’s demo on a disposable repo/project first.
Add instrumentation before optimizing: traces, logs, eval rows, and a small human-reviewed failure taxonomy.
Compare against a boring baseline: manual workflow, grep/RAG, existing local runner, deterministic code path, or standard interview practice depending on the video.
Promote only what passes an evaluation gate: better first-pass success, fewer correction turns, lower cost/latency, or clearer operator control.

Comment insights

Sparse comments: one viewer liked using Gemma on a plane but noted battery drain; another dismissed the talk. Treat enthusiasm as anecdotal and the battery warning as the most actionable user caveat.

Raw comment sample considered:

No substantial comments were extracted.

Deep research

External sources checked or named:

Google FunctionGemma fine-tuning: https://ai.google.dev/gemma/docs/functiongemma/finetuning-with-functiongemma
Google Developers FunctionGemma guide: https://developers.googleblog.com/a-guide-to-fine-tuning-functiongemma/
Gemini Nano / AICore on Android: https://developer.android.com/ai/gemini-nano

Supporting evidence: the named docs/repos generally support the existence of the tool or framework and the broad technical direction described in the video. For example, official docs support MLX/Gemini Nano/LangSmith/Langfuse/Graphify-style capabilities where relevant.

Contradicting or limiting evidence: comments and external docs do not prove the broadest performance or reliability claims. In particular, claims about token savings, speedups, interview outcomes, or “best” tooling require controlled reproduction on the target hardware/repo/workload.

Verdict

Claim: On-device inference is valuable for latency, privacy, offline use, reliability, and cost.
- Verdict: Agree; confidence medium-high.
- What is overclaimed/underclaimed: The direction is credible, but exact magnitude should be reproduced with the same model, data, hardware, and baseline.
- Practical takeaway: Treat it as an experiment template; ship only after your evals confirm the benefit.
Claim: Tiny LLMs need task-specific tuning for reliable tool/function calling.
- Verdict: Agree; confidence medium-high.
- What is overclaimed/underclaimed: The direction is credible, but exact magnitude should be reproduced with the same model, data, hardware, and baseline.
- Practical takeaway: Treat it as an experiment template; ship only after your evals confirm the benefit.
Claim: A small model can jump dramatically in benchmark/task success after supervised tuning (the title claim says 46% to 90%).
- Verdict: Mixed; confidence medium.
- What is overclaimed/underclaimed: The direction is credible, but exact magnitude should be reproduced with the same model, data, hardware, and baseline.
- Practical takeaway: Treat it as an experiment template; ship only after your evals confirm the benefit.

Screen-level insights

Frames show Google AI Edge scale metrics, Gemini Nano/AICore vs Apple Intelligence, a “Restaurant Roulette” Gemma skill demo, macOS local-app demo, and FunctionGemma Tuning Lab with JSON tool schemas, CSV training data, train/eval/export, and Hugging Face sign-in.

Why the visual step matters: the frames show whether the talk is conceptual, slide-driven, or actually demonstrating a tool. They also expose implementation details the transcript compresses away: file names, dashboards, command shapes, cost gates, model identifiers, and evaluation UI.

My read / why it matters

This is worth saving because it converts a YouTube idea into an engineering checklist. The practical value is not the claim itself; it is the repeatable loop: set up the tool, trace or benchmark it, compare against a baseline, and write down the failure modes before scaling.

Verification notes

Source/evidence audit: checked extracted transcript snippets, extracted comments, frame analysis, and current web sources listed above. Strong claims were downgraded to “mixed” where only marketing/title/comment evidence supported the magnitude.
Transcript/comment/frame fidelity audit: timestamped transcript bullets and frame-derived observations were kept aligned with the draft packet evidence; raw transcript is not duplicated beyond selected moments.
Hallucination/overclaim audit: avoided asserting exact benchmark results unless visible in frames or named as a claim; marked unverified magnitude claims as hypotheses.
Actionable Insights audit: top section includes concrete first steps, links, metrics/evaluation criteria, and cautions. Residual uncertainty remains around extracted transcript completeness and exact tool versions shown in the videos.