Self-Training Agents: Hermes Agent, HF Traces, Skills, MCP & Finetuning — Merve Noyan, Hugging Face
Actionable Insights
- Create a trace dataset from agent runs. Export Claude Code/Codex/Pi/Hermes traces into a Hugging Face dataset repo using the new traces type when available. Include task, tools, outputs, human correction, and pass/fail. Evaluate by whether you can query recurring failure modes and build a fine-tuning/eval set.
- Benchmark open models for your agent task, not generic chat. Use Hugging Face model filters and benchmark datasets, then run your own task eval. Candidate tools: Hugging Face Hub (https://huggingface.co/models), Inference Providers (https://huggingface.co/docs/inference-providers), vLLM (https://docs.vllm.ai), llama.cpp (https://github.com/ggml-org/llama.cpp). Metrics: task success, tool-call validity, latency, cost, privacy posture.
- Try a local coding-agent loop with open weights. Serve a model with llama.cpp/vLLM/MLX and connect a lightweight coding agent such as Pi/OpenCode-style tools if supported. Start on non-critical repos. Caution: local models may lag frontier models on long-horizon coding even if benchmarks look strong.
- Use MCP/skills as reusable capability layers. Package repeatable tasks—dataset upload, model eval, trace review—as skills or MCP tools. First experiment: a skill that turns a failed trace into an eval case. Evaluate by reduced manual steps and lower repeated mistakes.
- Separate “open weights” from “open source” in governance. The speaker makes this distinction; adopt it in model review. Record license, commercial rights, data policy, weights availability, and deployment mode before using a model in production.
Core thesis
The Hugging Face ecosystem is becoming an agent platform: open models, inference routing, MCP, skills, trace datasets, and local/remote agents can be combined into a self-improving workflow, but the talk is more ecosystem tour than proof of self-training.
Big ideas / key insights
- The valuable pattern is not “let the agent run longer”; it is to make the work inspectable, measurable, and interruptible.
- The transcript evidence points to concrete workflow design: artifacts, traces, evals, policies, or specs that survive a single chat context.
- The comment evidence is used as a sanity check: where practitioners push back, the verdicts below are deliberately more conservative.
- The strongest practical takeaway is to convert the creator’s idea into a small pilot with explicit success/failure criteria before standardizing it.
Best timestamped moments
- 0:46 — Clear distinction between open weights, open-source licenses, and fully open agent harnesses.
- 1:17 — Open weights allow quantization, fine-tuning, and private deployment.
- 2:20 — Claim that open models are catching up on intelligence indices.
- 4:24 — HF benchmark datasets are shown as model-selection aids.
- 5:26 — Inference Providers allow comparison by provider/cost/speed/tool use.
- 5:57 — HF MCP server and skills are introduced.
- 7:29 — Hermes Agent is recommended for memory/open-model workflows.
- 9:03 — HF trace repository type can host Code/Codex/Claude/Pi traces.
Practical takeaways / recommended workflow
- Create the durable artifact first. Write the spec/rubric/policy/trace schema before letting agents perform expensive work.
- Run a constrained pilot. Pick one repository, one team, or one workflow; record baseline cost, latency, failure rate, and review time.
- Instrument the loop. Capture traces, commands, tool calls, test results, and human corrections so the workflow can be evaluated later.
- Add gates. Require acceptance tests, human approval for sensitive actions, and rollback paths before allowing broader automation.
- Review after 5-10 runs. Keep the practice only if it improves measurable outcomes, not just because the demo felt compelling.
Comment insights
Comments are supportive but note the talk feels like Hugging Face 101 or an ad. One useful addition: human-in-the-loop creates richer training data than unattended AFK runs. Another asks why choose OpenRouter versus HF routing, highlighting provider-selection tradeoffs.
Deep research
- Hugging Face Hub docs. HF hosts models, datasets, Spaces, and inference/provider integrations. Source: https://huggingface.co/docs
- Open Source Initiative / model licenses. Open weights are not always open source; license review is required.
- vLLM / llama.cpp / MLX. These projects support local or self-hosted inference for open models.
- SWE-bench and benchmark caveats. Benchmarks help triage models but do not replace task-specific evals.
Evidence quality note: research here uses named public documentation, standards, and widely known project sources where available. Some vendor claims are treated as product claims unless independently benchmarked in the user’s environment.
Verdicts
- Open models are catching up: Mixed / medium confidence. Some open models are very strong on benchmarks; frontier closed models may still lead in certain agentic settings.
- HF traces can support self-training loops: Agree directionally / medium confidence. Trace datasets are a necessary substrate, but quality labels and evals determine training value.
- Hermes/open agents are ready for production autonomy: Mixed / low-medium confidence. Promising for experiments; production needs permissions, evals, rollback, and data controls.
Screen-level insights
Frames show Hugging Face Hub model filters, benchmark dataset UI, inference-provider comparisons, MCP/skills slides, trace dataset viewer, and local serving options. The visual step matters because the talk is a product/ecosystem tour; the UI frames reveal which features are concrete.
Representative extracted frame anchors checked against transcript context:
- 0:15 — image
youtube-extract/OV56RddyFuU/frames/000_000015.jpg; transcript context: Hello everyone and welcome to this talk in open agent ecosystem and I would like to call it having an AI engineer at your fingertips. I’m Marwa and I work in the open source team of hugging face. How many of you are hugging using hugging face on daily basis? Oh, let’s change that. This is not okay. But first let’s talk a bit about open source and what it is. - 2:20 — image
youtube-extract/OV56RddyFuU/frames/003_000140.jpg; transcript context: intelligence index and the green ones are open models meanwhile the black ones are the closed models and we are we just catched up. And we will catch up even more with the upcoming models and stuff. And let’s go back to Hugging Face Hub. So everything is facilitated through Hugging Face Hub, all of the open releases. It’s the infra layer for all of your open - 2:50 — image
youtube-extract/OV56RddyFuU/frames/004_000170.jpg; transcript context: models. I should have updated the number. It’s probably close to 3 million. A lot of data sets, spaces, and everything, but that’s not all when it comes to the agentic ecosystem and this is what we’re going to talk about today. So when you go to the models, you can filter for agentic models. They are mostly the trending ones. And there is like two types of m - 3:22 — image
youtube-extract/OV56RddyFuU/frames/005_000202.jpg; transcript context: agent over the screenshots. They know where to click etc., which is pretty cool. And one trend I have recently noticed is the fact that you have labs releasing their LLMs as vision with vision capabilities day zero. Like for instance the Gemini 4 was an omni model and still it’s an agentic model. There is Qwen 3.5, there is Chimera Chimera 2.5. These were VL - 4:24 — image
youtube-extract/OV56RddyFuU/frames/007_000264.jpg; transcript context: we have recently launched this feature called benchmark datasets. So, when you go to the datasets on the left-hand side, there is like on the bottom there is a bunch benchmark button. You just click it and then you can see the popular benchmarks such as SWE E Bench Pro or Humanity’s Last Exam or AIMEE and others. And when you go to for instance SWE Bench to - 5:26 — image
youtube-extract/OV56RddyFuU/frames/009_000326.jpg; transcript context: providers out there there’s Grok, Cerebras, I don’t know Novita and everything. And then it’s super easy to compare them as well. If you see like you have the cheapest or the fastest option. Actually, I had to truncate it, but also there is the tool use column. So, you can actually pick one of the open source models for the agentic use case and stuff. And go - 5:57 — image
youtube-extract/OV56RddyFuU/frames/010_000357.jpg; transcript context: shipped a ton of futures for you to use open models with agents agents and stuff. And first of like there is the MCP server where you can plug the hub into your LLM. And there is skills which allow you to even vibe train models. Like you just go to your agent and say train Q and 23.5 on this data set for me and then it just trains. Which to me is like a sci- - 9:03 — image
youtube-extract/OV56RddyFuU/frames/011_000543.jpg; transcript context: was a rumored uh Minimax model coming up, so I will also probably try with that and share my findings. So, I absolutely recommend using Hermes agent with the open models. And one more thing. So, basically, uh Hugging Face Hub now has a new data set repository type called traces, and this is basically all of your uh code X, uh cloud code, or pie traces they h - 9:33 — image
youtube-extract/OV56RddyFuU/frames/012_000573.jpg; transcript context: And for instance, if you go to your um if you pushed uh trace and then you go over there you will see in the data set viewer if you click on the traces column it pops up like this it is very nicely parsed and you can just explore your data and then later if you want you can even train a model on that which is pretty cool in my opinion and if you want to push - 10:03 — image
youtube-extract/OV56RddyFuU/frames/013_000603.jpg; transcript context: traces you can just upload your sessions from these paths and nothing else is needed and we will also probably have Hermes agent very soon for traces going back if you want to use if you want more options to serve LLM behind the agent locally so some tips and tricks in finding a good model you just go to hugging face there is an other tab under the other tab
My read / why it matters
This video is useful if you convert it into an operating procedure rather than copying the headline. The durable lesson is about control surfaces for AI work: specs humans read, traces teams audit, evals that catch regressions, identity policies that revoke access, or graphs that preserve provenance. The risky version is adopting the slogan without the measurement and governance layer.
Verification notes
- Source/evidence audit: Checked the extracted transcript/comment packet and named external sources/docs relevant to the main claims. Vendor/tool links are identified as vendor/project sources, not neutral proof of effectiveness.
- Transcript/comment/frame fidelity audit: Timestamped moments and comment insights were kept close to extracted evidence in
youtube-extract/OV56RddyFuU/and the draft packet. Screen claims are limited to the extracted key-frame metadata and visible UI descriptions; for-QFHIoCo-Ko, no frame-derived claims are made because key frames were not extracted. - Hallucination/overclaim audit: Headline claims were softened where evidence was insufficient. Verdicts explicitly mark mixed/low-confidence claims and separate practical heuristics from proven facts.
- Actionable Insights audit: The top section was checked for executable first steps, tools/commands or links where available, evaluation criteria, and cautions. Generic summary bullets were rewritten as workflow steps.
- Residual uncertainty: I did not have independent benchmark results for the specific demos, and several claims would need local measurement before adoption. Transcript extraction status was marked unknown by the extractor, so the analysis relies on the processor’s excerpted transcript evidence rather than a full raw transcript page.