Transcript: LLM codegen fails and how to stop 'em — Danilo Campos, PostHog

0:15

Good morning. >> Who's afraid of robots? Afraid of robots. Afraid of robots. Uh, I'm not afraid of robots because they have already bloodied my nose so many times. There is no more pain that they can give to me. And that is what I want to tell you about on this fine morning. Thanks for coming to hang with me. So, my name is Denilo. I work at Post Hog and I make the Post Hog Wizard.

0:45

And the very strange thing that the post hog wizard does is it skips two hours of misery that you will never get back in your life and it hands it back to you as 8 minutes of pseudo entertainment. Now, how do we get away with this? We're talking 15,000 people every single month run this wizard

1:15

and in exchange for their trouble, they get a Post Hog integration that works and that they actually like. How do we do it? I'm going to tell you all about it today. And just to underscore the point that this actually works within the last 6 hours, we get two unprompted posts on Blue Sky and Twitter where people are actually happy. Now,

1:48

this should be terrifying, right? I got a robot out there. It's writing code for people. What if it's doing a bad job? Well, we learned all the ways that it could do a bad job. I'm going to tell you the ways that those bad jobs happen. I'm going to tell you some strategies that you can use so that your autonomous coding agents do the right thing as well. All right, let's start with the easy one. We got model rot.

2:18

Now, training a model takes a lot of time, but it's not even the time. It's the it's the money, right? You're not screwing around as anthropic training a model on a weekend as a lark, right? This is serious capital expense. And the trade-off with this is that the models sit there no longer representing reality, right? They are a snapshot of the world and the web as it was, you know, 6 8 12 18 months ago perhaps. Now,

2:51

this is useful for many things, but if you're a fastmoving software project, and there are loads of fastmoving software projects, the trade-off of this is that the model doesn't know what the hell is going on anymore. So, you got to deal with model rot. Now, this is fairly straightforward stuff. You've probably dealt with this sort of thing before. Does anyone here have a conviction about how you deal with model rot? Any guesses?

3:23

Seeing some shaking heads. What's that? >> Rags. >> Rags is good. >> Although, I'll tell you what, with the context windows being what they are at this point, you can't beat just shoving a bunch of markdown files into the context and patching the holes. And this is exactly what we do uh with the Postthog Wizard is that we have documentation that is fresh, hot off the presses on postto.com and we allow the agent to make a selection. We say, "Hey, what are you doing? What are you integrating right here? What have we detected?" And the

3:54

agent can use tools to go out, pick from a menu of fresh hot markdown that it can then just slide right into its context, get the job done, do things correctly. Now, what happened to spur all of this was that a year ago, people started asking their very primitive agents like, "All right, cursor, I want you to integrate Postto for me." And it would do a terrible job, right? It's just it's making up keys. It is

4:25

making up patterns. It is inventing APIs that don't exist. And it is it's not our fault. Like we didn't do anything, but it was our problem. So figuring out ways that we could serve correct upto-date context to the agent so that it would do the correct job is part of how we get people posting happy about what the wizard did for them. All right. Now, these models, I mean, clearly

4:55

they've been scraping every kind of project out there. And I have to guess that not all of them had great architecture because some of the decisions that these agents make when they're putting a project together are very strange. Uh, and so what do you do? How do you deal with the fact that an agent's conception of how to put something together may be technically like workable but not exactly ideal? Well, me and my

5:25

homies on the Post Hog Wizard team, we maintain a fleet of what we call model airplanes. And these are projects that have Post Hog implemented in them. They've got them across a bunch of frameworks, a bunch of languages. But what makes it a model airplane is that we don't have an entire proper production application going in there. What we have is something much thinner. Something that is a similacrim of a real application.

5:58

But for example, the O doesn't work. Or rather, the O works for anything. You can just put whatever you want in the password field and you're going to be able to log in. But the O is Oshshaped which means that we can provide these model airplanes to the agent and then the agent knows, oh cool. So when O shows up, this is a great place to put the particular event tracking that one would want to use when they wanted to track login and identity in post. And so through the maintenance

6:28

of a thing that isn't quite as elaborate as the real production application, which means also of course it is more token efficient, what you get is the correct shape of an integration as a pattern that the model and agent are able to complete consistently every time. All right. So in addition to weird architecture, the agent can find a weird path through the problem space. And with

6:58

15,000 integrations per month, it might find 15,000 ways to get a post hog integration done. And while this would satisfy the requirements of we've automated integration, it would leave us with a very strange support burden because we would have too many different ways that Post hog was set up. It's like what what what the hell is this? How do I make sense of this? Right? This would be a problem at scale. This would be some sorcerers apprentice stuff. So

7:29

to limit improvisation, what we do is breadcrumb the agent. We don't tell the agent upfront exactly what we're going to do. You know, maybe you've seen this before even when you're using clawed code is that if you tell them exactly where you want to go, it might make a claw code shaped hole through the first four tasks and then just get really rock polishy with the fifth, right? And this is not what we want for our case. And so one of the things that we do is we start off barely

8:00

even telling the agent that this is what we're doing. We don't even mention really that we're doing a Postto hugg integration. We start with something like where are the files with interesting business value in this project? Can you find something that looks like a login or a Stripe interface or something that might indicate someone's about to churn? Right? We go looking for the files that would be responsive to impact in somebody's business. And the funny thing

8:31

is that business stuff casts a huge shadow in code. And so we can very reliably detect this kind of stuff. Now from there we say okay here are some cool files. What are the interesting events going on in those files? Don't write any code right now. Just let's think about some cool events that we might want to sprinkle through here. What might those be? So we make a list of these. We get the event names. We get the descriptions for those events. And we just tuck them into a little file. And this is the start of things, right? Like we don't even know where we're

9:01

going necessarily. And so the next breadcrumb is like, okay, let's start to actually implement Post Hog. We now know a bunch of events and we've really thought carefully about what those events might be. And now we have documentation and everything which we can load uh at whim according to the framework and language that we care about here. And so we can reliably go in there and start to make modifications to people's files. And the modifications are once again not stupid. um and and they're not mad about it.

9:31

Okay. Now, we can do all of the thoughtful stuff that we can to make the agent successful, but the biggest threat to our agent outcomes is ourselves. We're we're feeble little beings. We got a little bit of meat right here locked inside of our heads. And we have a context limit, too. We can't really quantify it. And it varies by how long ago we had some coffee and if we had breakfast that morning. Our

10:02

context is not just limited but fragmentaryary. There's stuff that we remember implementing last week and there's stuff that we forgot from last month. And so we're making changes and we're editing code and we're evolving the stuff that our agent is working around and sometimes we are dropping things that really matter. And so there was a point where we had an MCP uh tool instruction that was contradictory to a different tool. And the agent is like,

10:32

man, I don't know what to do here. You you you're putting me into an impossible spot. Um we had um a situation where we were telling it, hey, there's a tool that you definitely need to use to conclude this setup. And the agent's getting there, all right, cool. Let's let's use the tool. Wait, the the MCP does not have a tool by this name. And we're talking like hundreds of runs going with this missing tool. And what's going on there? So if we didn't

11:02

ask, we wouldn't know. And so one of the things that you can do that is really handy and fairly cheap is a little bit of inference time interrogation of what just happened with your agent. So, at the end of every run, uh, right at the stop hook, we ask a very simple question. We're doing a little bit of user research, but the user is, in this case, a robot. And we ask the the robot user, hey, what could we have done better to set you up for success in this

11:34

run? And then it tells us, and that's how we found out like, oh, we we didn't give you permission to access the tool, and so there was no tool. Hey, you've got these contradictory directives um without this ongoing interrogation. Uh oh, a good one is that we kept giving it instructions for JavaScript and it was a Python project that it was working in. Um, of course, very um well, not frustrating, but you know, we would identify it that way. Um, so the human error, big deal. You have to ask to find

12:06

out. Now, there's also shenanigans that you got to be concerned about here because running an agent on someone else's machine demands a huge amount of trust, right? We we've got this robot that could do anything potentially and we don't want to do something uh bad or destructive to the user's project. We don't want to put them in a worse spot. Um, and one of the early versions of our wizards would actually just readenv

12:36

files, which is necessary to do writes, right? You can't just write blind to a file. It's just one of the mechanics of how, you know, these agents work, but it's also not ideal to be sending people's EMV contents up to a cloud and just like, all right, cool. That's sitting in someone's damn log that you don't know about. Um, so this was obviously bad news, but when you're designing these things, you have fine grain control over tool usage, right? You can decide, all right, these tools are okay, these kinds of reads are okay,

13:07

these kinds of reads are not okay. So, we really locked down what the agent was allowed to do around anything that was an ENV file. And then we were able to build it a tool that could do two things. It could check the presence of a key. does this key exist? And it could write to a key a new value and that was it. There was nothing that was going up in terms of inference for this EMV file.

13:38

And so as a result, we were no longer touching this stuff. Uh but again, man, you you are setting loose these robots on anybody's computer. You got to keep an eye on these shenanigans because even if you're kind of solving the problem you promised you'd solve, you might be doing it in a way that makes you look like an All right. Now, this is the big one. This is the weird one because our whole careers we have

14:09

been rewarded by writing the code. We write the code. We write more more code. We write the clever code. Oh, I got a structure in here. This thing works. This thing is reliable. This thing is elaborate, but the performance is really good. And man, we just got to code the out of this thing. If we code our way out of this problem, everything's going to be great, right? That is not the world that we live in anymore. And a very funny thing about code is that if today you have written

14:42

some code that you think is good and tomorrow a new model drops, the code that you wrote has the exact same value. If anything, it might be declining a bit. Right? Code has always been a depreciating asset. You write it, you might ship it into the world a little bit rotten because you got some tech debt and you got to deal with that. at some point, but meanwhile you shipped on time. You got what you had to do. The wizard that makes everybody so happy

15:13

is 90% markdown files, 8% tools for delivering and processing markdown files, and then the rest is like agent harness stuff, right? Plain text pros is where so much of our value now lives. When you write great pros today and tomorrow an even better model drops, it's going to be able to take that pros and do even more with it. And

15:44

so an agent is an octopus, right? It can wrigle. It can squeeze into tight corners. It can maneuver itself around problems. You do not want to overconstrain the agent in its ability to get problems done. Aside from the shenanigan stuff as we talked about, right? So instead of thinking about like man, how can I scaffold the hell out of the behavior of this agent? It's about saying how do I step back, how do I give it enough information and how do I

16:15

sequence the information that I give it so that it does the thing that I want it to do and it makes people happy in the process. So this is what I know from the robot bloody my nose. I see on my clock here, I got a couple minutes left. Does anyone have questions about the strange adventure of building this robot that makes people happy? >> Shoot.

16:46

>> Yeah. >> Is that how is that structured? >> Oh, sure. So the way that we drive context for the wizard is we use skill files that are generated from our context service. And so that context service is going to take all of those model airplanes, flatten them into a single markdown file and then include them as a reference in the skill file.

17:18

And so we always have access to the full model airplane which the model can gp and otherwise churn through. I just different skill. >> Oh, sure. >> So, yeah, this is just part of the supplemental content that is included in

17:48

the skill. And so, what we found was there was a range of useful input that we could include as part of the skill file. So we've got documentation which is plain text pros but then we also include the model airplane so that it can see the shape of a successful integration and it references all of that as part of getting the job done. >> Yeah. >> Yeah. Shooting agents like >> Oh sure. So this uses the claude agent

18:19

SDK which we then wrap inside of a CLI and so you just run a single command and then we give you free inference by logging into Post Hogs. We've got this LLM gateway where we could cover all of the tokens on your behalf. Um which was a whole zoo because sometimes clawed code would store O information in a place that we weren't expecting and then it would just break for people. Uh it's it's early days for doing this as kind of a service. Yeah. Anything

18:49

else I can tell you? Well, then I'm going to scoot out of the next speaker's way. Thank you for hanging out. It's great to see you. Have yourself a great rest of your day.