Transcript: Skill issue: Lessons from skilling up coding agents to use Langfuse - Marc Klingen, Clickhouse

Watch video

AI Engineer24:08Transcript ✅Added May 21, 12:53 am GMT+8

Source video ID: vNCY9kXXyDQ

Transcript

0:15 — Okay, that was quick. Hi everyone. Super excited to be here. I’m Mark, one of the founders of Langfuse. Uh I mean, we started Langfuse like three years ago when everything was felt quite early. Uh building agents that didn’t work. uh and then realized okay there needs to be like some evaluation tracing built like link is the open source project in the space by now I mean by the metrics that we track uh we seem to be the largest one in the space we do all product engineering out of Europe thus I’m very excited this conference is coming to Europe because there’s so many great people here uh and we always need to resist the urge to like ship the whole team to another continent because
0:45 — actually being here is very nice um and you can just travel and hang with people on discord and zoom so yeah very excited to be here in person and um what I want to talk about today is how Um well the lessons we made from scaling up coding agents and actually adding length use to an application because like back in the days when you added like obsibility or emails you need to read like hundreds of pages of docs like figure out your own mental model and now you kind of expect like to be handholded by a custom agent to do this and uh we’ve come a long way in achieving this kind of vision and I just want to explain what we learned on
1:16 — the way. Um first of all like I’ll start very conceptual very easy then a bit more conceptual deeper and then I’ll do go to the learnings. So uh my mental model for skills is just like I get this Rubik’s cube when I was a kid. I had no idea what to actually do. But I mean I basically have a bash tool. I can do whatever I want with this Rubik’s cube, but it just looks colorful in different ways and no idea how to solve it. So skills was like a great way once you get the manual. It’s easy. You just need to follow the manual and you can you can solve a Rubik’s cube. And I feel like the same thing now applies to to agents where like there was this whole debate of like workflow versus fully autonomous
1:46 — agent. It was like huge fight on X of like what is the best way to build an application. Everyone was kind of like yes, you kind of need both. And I think Malta this morning had like a good note where like the surface area of deploying agents is so broad that um like for some you don’t need like a coding agent even if it’s like the best way you can build agents at the frontier today. Like you don’t need this for every application. It makes things slow expensive and um like there’s this kind of like balance between workflow being very reliable and agent now having kind of like unlimited capabilities. And I think what’s very exciting is that then skills are kind of
2:17 — like a formalized shortcut to make things more reliable where you historically would have built a workflow like I don’t know you have a customer re like customer support agent and someone asks for like a password reset then historically you would have built like a workflow that’s very reliable of like ah you have a router that routes to like an agent that can only do password resets and that agent has like the context to do password resets. Um, that was great. But also, if the user then wants to do password reset, but also change the email address at the same time, then like the router is kind of like, okay, I have this email router, I have this password reset router, what what do I even do? And I think what’s really
2:48 — exciting now is that an agent can just like progressively get the context needed to then solve like a problem that’s multi-dommain like that would have historically been in multiple workflows. So, that’s very exciting. However, uh it has always been hard to kind of like how to how to build an agent, what kind of use cases even exist because you have like this open-ended text input box often or like open world context. So you didn’t really know. So what we now see is what many teams do is they have the agent runtime. They trace everything. I mean I’m building languages this I put my logo here but you can also use whatever you want. In
3:18 — the end it’s more about like the concept of tracing is what you need to like identify what happens at runtime when a trace when like an agent is executed because then it helps you learn two things. one like new kind of like use cases that you didn’t expect users to do because you might have expected that nobody ever wants to change a password but now they want to change a password you need to see the execution trace of someone being upset via like a production evil to then derive that you need to add a skill um for handling password resets well uh and two once you have these uh skills they can get out of date or you can realize they’re not the
3:49 — most efficient way of actually solving for this use case so it’s like the second thing of like helping you improve skills that you already have in your agent so this very conceptual But this is what we what we now see see most teams do that use language. Now more towards like the learnings uh we made uh when building a skill to help customers add language to their project. So what was the lay of the land before we got started with this? Um like 478 pages of documentation. Whenever I see like the thing deploying I’m like who wrote all of this? So apparently if you build a
4:19 — project over three years it just grows in complexity. people can do all of these different things, but then you need to read all of these things and nobody has the time across like five different feature areas and a lot of implementation flexibility because there have always been like projects in like the eel space that were very I’d say opinionated. So they were like, “Oh, you have a chatbot then add this project and it’ll just solve this for you opinionated end to end.” We always were like, “No, no, we are infrastructure. We do like tracing well like if you ingest like billions of traces, it will still work. If you want to customize your evils, it’ll still work.” So we’re
4:49 — always like more on the unopinionated side. uh which was always I’d say uh a weakness compared to uh projects that are more opinionated but now I think it’s a strength because in the end what what what what do you need uh when agents do all of this you only need the infrastructure piece if agents can then customize for different workflows um so but there’s the problem if people want to add length to a project how to do it correctly for a project is to be figured out by an agent um and uh interestingly uh when we first got into model pre-training context as a project and you can just ask an agent how to add
5:19 — length views and they’ll spit out like langu SDK logic the first time this happens it’s amazing but then if you’re like two years in like the project evolves like interfaces change uh and now being in pre-training context might even be like a disadvantage if uh you don’t fetch up to date information so we just like oh we get like all of these hallucinations of methods that have been uh available in the past but they’re are not uh available there today so yeah we we felt like when skills launched this is like the exact pattern that we need in order to help teams achieve this um so I’ll use an example um when you just
5:51 — ask cloud code to add language to a project it just worked but it was not working in the best way possible. So for example user ask add tracing to my agent and um then cloud code kind of like implements the instrumentation based on the outdated pre-training context then tries to verify whether the tracing works then realize oh it doesn’t work and then only in a second step fetches upto-date information to then correct the issue. Um at the same time how you add tracing or evaluation to a project they can evaluate in like like gazillion
6:21 — different ways like online evals offline evals human in the loop there are so many different things and often the pressure is like what even is relevant for your application. Uh so human agent kind of need to figure it out on the way but agent is not tasked to help you like figure out what’s the best thing for your for your um application. So main problems out training data the nonoptimal setup because the agent wasn’t really like primed to help you discover what to do for your app and it’s very slow because you first add instrumentation a wrong way then you figure out it’s wrong and then you need to fetch more documentation to to fix
6:52 — the issues. So what did we do? Oh yeah this is how the uh this how just metal metal note this is how the trace looked like when we just tried it with cloud code. So like just tracks two LM calls in an agent but like you still don’t know what the agent’s actually doing. So um what was the goal of our skill? Uh like give every uh Langfus user like there are thousands of teams in the community, thousand of customers on our cloud product, give them all like a language experts to help them quickly set up obsibility pro management evals in line with like best practices and upto-date docs and rep like references because uh all of you I mean there’s
7:22 — Annabal from the team here as well like if you have questions regarding obsibility evals you can talk to us but in the end that doesn’t scale to thousands of people uh to basically talk through your problems and figure out what the best strategy is for you. So we were like okay skills is the is the way to go and this like very conceptual how our skill works where um like user comes in ask coding engine do something and then the skill kind of has like a reference of like the skill MD is more like well what kind of style do we want uh in order to implement um length view so for example ask followup questions
7:52 — before making decision because there’s so much you can be doing and then references for the different product modules to kind of like progressively disclose additional hints that the agent might need to have and then it can call um the documentation And interesting enough as we started open source and like saw ourselves as like unopened infrastructure we always had like APIs for everything because teams built their own I don’t know own labeling UIs on top of it own evaluation uh execution logic on top of our our back end and we had APIs for everything. Now we’ve wrapped it in the CLI and now an agent can just do everything humans needed to do in the
8:23 — UI in the past. Um which is very cool uh because uh like so many teams spend so many hours every week in clicking around in our UI to evaluate and improve the application and in the end how will this look like end of year it’ll probably just be like uh like connect repository to length views and then agent just does the whole thing like auto reggressively I mean that’s what we are building towards that’s what everyone is building towards I think that’s a cool step in the right direction so um to to shortcut to the end result okay after like conversing with the agent now for the for a similar thing it looks way more
8:53 — detailed like um like detailed evils that are relevant and detailed steps regarding tool execution. Um so so just like there there’s a stark difference and yeah what did we learn on the way um six main things uh I’ll go through every single one of them um of uh what were basically our realizations uh when when building the skill um and like one looking at traces still like gets you to like 80% of the detail. This was always what we kind of like tried to preach regarding like evils where many people try to complicate things right away while they haven’t dig through um through like just like what did the
9:23 — agent actually do at runtime themselves a couple of times. So so uh what do we do? We have instrumentation for cloud code and just ourselves interactively um like tried to use length views with cloud code and then look through uh traces and langu did the um agent error how can we improve the skill to make it like straight uh shooting at the goal instead of like wandering in in different ways to the to the target that was really helpful too. Um there were like some of the interesting learnings here. Um for
9:54 — example for humans we try to cut down on the number of environment variables that you need to set in order to set up length use. So for example we just auto assumed um a data region like length is available in Europe is available in the US. Uh for anecdote we assumed that only Europeans like care about data regions. This we made Europe the default. Then we learned some US enterprises also care about data regionality. Now we have a US data region so many different other data regions. Um so we always defaulted to Europe and now we kind of like for an agent like adding another environment variable like they don’t care like it’s not effort for them. this uh we always prompt for figure out what data region
10:24 — the user is actually in and don’t assume Europe for example uh two hallucinated CLI parameters uh because it just oh this like includes the word trace I’ve seen tracing CLI before I just assume what what we could be doing here and we just advertise the the help flag more aggressively uh it takes another turn but it’s fast and thereby like it directly knows what the CLI can do. Um, two, um, we try to help the agent to understand how to navigate available information because like I mean 500 documentation pages, how to find the
10:54 — right one instead of like looping through, I don’t know, fetching one then learning something then fetching another one always with like thought process in the meantime. Um, so what did we do? We always had this LMS txt which was very hyped when it launched but never actually used. I think what’s now cool is we have this agent sitemap kind of that we just expose to um like a coding agent via the skill of like go there first in order to learn what kind of like documentation is available and uh two I mean there’s like this whole content negotiation that if you send like an a request header that you want markdown that you get markdown back from the docs but some coding agents don’t do
11:25 — this by default so we just advertise this because otherwise some coding agents might try to pass the HTML which just adds additional tokens so uh like for link for example you can just add like a MD to any documentation page or you request markdown and you’ll get a markdown page. Uh three, um I think that was uh like one of the things I was most excited about. Uh we always had this uh like docs Q&A agent that was able to answer questions about languages more like interactively. Therefore, we built like a Rex stack and now we just surface this uh Rex stack again via like a search endpoint. So like a coding agent
11:56 — can just ask whatever natural language query about lengthuse and we’ll get back documentation chunks for this query. Why is this exciting? one you don’t need to fetch five different doc page where you can just ask a question get something back that’s relevant directly solve for the problem and two we get to track these search parameters um because uh like if a coding agent fetches documentation it’s very difficult to understand what did like cloud code in our user laptop do but if they um ask questions about language to our search endpoint we can track the searches and thereby understand what problems do they run into where do we need to add more
12:26 — documentation pages because maybe we didn’t expect this kind of problem to happen so yes adding a search endpoint was like was really cool to um capture more data. Um then uh basic evil setup is better than none because we initially struggled to get this done because it’s like so broad. Some langu users built like chat applications, real-time voice, video generation, like batch processing of like invoices in the background of some kind of like text uh software like so many different use cases where then the question is what’s even like a good evaluation like setup and we just
12:56 — created like five uh different ones and this was already helpful um because otherwise uh it’s really hard to uh to kind of like measure measure anything and what we did here um can I zoom in? No. Um so basically we have this uh like just like a prompt instrument application with langfuse and then like a sample repository folder. So for example like an openi custom function rack whatever application and our checks are just natural language statements that we then by element as a judge try to um like evaluate on top of the file system and div state before and after um
13:26 — running the skill. So for example, we we expect that our openi instrumentation was added because like an openi example and we because it’s rack, we um expect like some retrieval spans to show up in our uh trace because if they if there are no retrieval spans then probably we only capture for example lum um calls. This was already helpful because that we were able to make changes in here that we didn’t break anything. The whole thing that uh why we even built length views for like building AI agents now also applies here. Five, uh, dynamic content should be referenced because there’s a huge, I’d say, incentive for
13:57 — like developers on the team, but also for users in the community to just contribute a lot of context to the scale because then you’re like, ah, it’s kind of like a local cache of the documentation that’s immediately available. However, then the same thing applies uh that applies to pre-training context. It’s kind of uh like it goes out of date and now we have the documentation and now we have yet another representation of what length use is. So you rather try to point just to straight to the reference of documentation and um because otherwise you just duplicate all all content. And six uh we applied like auto research uh
14:27 — to uh the scale of okay if we have a target function how can like agents help us improve the agent. Um because there are so many like different patterns that we can explore. Um, so we set up a target function mostly get towards like our experiment here was help teams move prompts from their local git repository into length prompt management which is used by like larger teams to collaborate on prompts with their non-engineering counterparts because then like PMs can make changes to prompts integrate on a playground like all of this kind of like collaborative stuff. Um and the task was okay how do we improve the skill to migrate prompts out of any kind of like
14:57 — codebase into our managed prompt system and um in the end we accepted three out of the six improvements that were suggested which I think is a success uh uh but uh it allowed us to experiment much more than we could have explored manually uh with the time that we have as we are like a very small team uh learnings uh like the target function really matters uh like uh I think it’s it sounds obvious but um for us defining like the right target function was very hard for um because we we assumed like a trump migration should be fast fast we measured in like the number of turns but
15:27 — if we basically asked to minimize the number of turns then like our uh like the agent that tried to optimize the skill just took out all of the notes that we had to um like fetch documentation because it was like I know how to how lengthy’s prompt management works I don’t need this I’ll just try it myself uh which then negates the whole thing of we want to fetch up to date context because otherwise if you use the skill install the skill once wait 3 months then you’ll have like wrong context um because we duplicate information two um like we had like an approval gate usually uh where we want to suggest a plan or ask for a question
15:59 — suggest plan to a user before doing anything because we kind of like push their prompt to like a send a repository and it’s kind of like their data leaving their laptop somewhere else. Um but the sandbox didn’t have this so we didn’t really weren’t really able to to to try for this and um like length view like the the sole feature like usually we try to make it easy to get going with something but then it’s very deep of how to do it in a good way and we want agents to directly go for the good way like figure out with a user what they want to achieve and then have like a very full implementation not start with something and then like two months later go deeper. However, if the target
16:30 — function does not include like we want like uh linking prompt versions to pro traces. So then you can see how like different prompt versions impact like for example production results. Um like we didn’t have this in the target function. This uh like everything that like nudged towards this was kind of like removed because it’s kind of like it’s just like like garbage on the way that we don’t need to achieve the goal. So again the target function really matters. High level. These were like the the six uh main takeaways. Looking at traces gets you 80% of the way. the uh production signals really helped. So the search endpoint uh was really helpful
17:00 — for our documentation. Um a help agent to navigate uh the information because otherwise it just searches with like Google Brave whatever search and finds all sorts of different things on the internet. Um even a basic evil setup helped. It wasn’t that hard to set up the dynamic content should be referenced otherwise you have just duplicates and the auto research was uh very helpful to explore things but um is bound by the target function. um topics basically on our minds here are uh it’s so powerful but at the same time you kind of then duplicate stuff into like user space kind of like
17:32 — somewhere on on like a machine um like there’s no like package management for this which then like you like tells the user this is outdated um like we could we we thought about just adding like a time stamp of this the current date where it was fetched the skill uh and then just oh if this like older than a month then try to update but then we go to second problem of skill distribution and like instead they’re installing into like the agent environment. Usually this is kind of like gated or not possible for the agent like depending on uh what you use. Um this user needs to do something to install the skill. This
18:03 — also upgrading doesn’t auto upgrading doesn’t really uh really work but it really depends on the coding agent that you use and uh like the target function is interesting for us because like we can either go for user needs to get to like an initial aha of like oh this works or do we want to directly straight shoot for this is the perfect setup of how you would do evils for this use case but this is I mean without a skill it takes like usually like an AI engineering team it takes like month to get to a perfect setup. Do we now aim for an agent to do this in a single shot and overload the user with lots of lots of questions or do we just try to get to
18:34 — something and then you can still invoke it again of like improve my setup ask and then it can ask questions to improve it. So it’s kind of like what’s what’s the target for the skill that was uh very interesting for us. Uh yep I would invite you to try it and give us feedback u because they’ll be really interesting. Um and uh like we do lots of lots of calls with people from the community every week and uh like I think it’s not a surprise that I think nobody reads documentation themselves and everyone is just like yeah just add this to my like I just want this to work like
19:04 — just add it. Um so yeah the skill is the primary way of how things get done. This is also like now the advertised way across all of our documentation um that you just should ask your coding agent to do whatever you try to do right now. Um I’m very excited that it works really well. Uh but also I’m excited to see what what comes next. Um for us as a project road map wise we see the skill right now like our users use this when getting started with the project but also um to drive a lot of automation around the like evaluation life cycle of oh I now want to create like an element as a judge that’s aligned with user
19:34 — preferences or um like I got user feedback on 100 different executions what do they have in common and you just then fetch this via the CLI. So many of these workflows that people needed to do manually now coding agents do for them. um we’ll bring this in product via like we’ll help automate this via skills one bring this in product two and then three I I feel like we just need this orchestration agent that does what the team is doing right now so yeah I’m very excited for our road map to to automate all of this but yeah if you have any feedback uh I’m around Annabal is around I would love to talk to you uh and yeah thanks so much for your TIME
20:09 — I don’t know if do we have time for a question okay » yep I may have missed you when you were the human. So you were basically like directly trying to implement using the skill the right or » uh yeah it was kind of um like I mean you kind of want to be out of the loop for the experimentation and then just review the suggested changes. Uh so it
20:40 — was kind of um like experimental things u give us like all sorts of different recommendations and then human review the suggestions like we didn’t accept all because many didn’t make sense because our target function wasn’t perfect because it’s really difficult to get to like a very perfect target function but it’s good at just creating ideas and then uh we human reviewed all of the ideas to make the changes to the scale » so at the end the skill might not be optimizing for human interaction
21:18 — or I don’t know. » Yeah, that’s what we try to kind of like you need to try it yourself to just get a sense of how it feels to use the skill to then um like like add language to an application. So, we just use it ourselves to get a sense for the feeling because it should like where where we want to go is it should feel like um like an expert user trying to guide you through what you need for your problem where usually someone comes in with just like I need evals because I read about it online. Uh but I don’t know what actually I need for my application and it kind of like it needs guidance of where you want to go. Uh like what is
21:49 — your problem? I don’t know what you what do you worry about? You probably don’t need like a I don’t know hallucination evil, but probably need something that’s very specific to your application. And um yeah, that’s what we want to achieve with a skill that you get like some some like professional guidance. » Yep. » I really rellistly insane right now of just like install. Uh what are your thoughts on sort of like the treating skills as packages
22:20 — like skills kind of approach or like going full on you know plugin marketable » as a small team I’m not that excited about plug-in marketplaces because then you now need to kind of like maintain all of these proprietary integrations update them in I don’t know tanthropic tell openi cursor whenever you make » an open air now agreeing on sort of like
22:51 — » yeah still I mean like for the skill uh I think it would be cool if we just had like a well-known skill or something and like whenever someone is like oh I want to for example use langu like the agent can just autodiscocover that exists uh like we have it across all of our docs so I think it would be enough if agent can kind of like ask user I want to install skill question is do you even need to ask like I think you only need to ask if the skill is kind of like more trusted than the public web if it’s like same trust level then why even bother asking um and then two is kind of like if I have this installed that’s kind of
23:21 — like a cache of something that was up to date when when I installed it and then the question is how do I know whether it’s out of date so I think so just like timestamping it is enough so when you use the skill that agent can be like oh this seems old I should probably like fetch a new one um I think this would already go a long way um but yeah I’m excited to see like we are going more the time stamp fetch route or alert user of this might be out of date uh that’s at least like what we discussed now. Um, but yeah, I’m excited to see what everyone is shipping in the space. Yeah, I’m around. Uh, thanks so much. Bye-bye.