Transcript: Fast Models Need Slow Developers — Sarah Chieng, Cerebras

Watch video

AI Engineer18m 01sTranscript ✅Added May 25, 6:19 pm GMT+8

Source video ID: TeGsFFNqRLA

Transcript

0:16 — Hi everyone. So we’ll just get right into it. So over the past few years, we as developers have developed a series of bad habits when it comes to developing as a result of slow AI code generation. And so we we’re all familiar with it. We do things like write massive prompts and try to oneshot. We’ll make huge commits or we’ll have our 10 agents all on the screen at the same time combulating, coitating, thinking. And so about a
0:47 — month ago, we at Cerebrus and OpenAI released a new model, state-of-the-art model called Codex Spark. Codex Spark can generate code at 1,200 tokens per second. And to put that into perspective, if you look at the Sonnet family or the Opus family, those can generate code at about 40 to 60 tokens per second. So in this new era, as we’re starting to see much faster coding models, this is 20 times faster. Not only does it unlock new capabilities and use cases, but it also requires us to
1:19 — rethink how we as developers interact with the coding model. And a lot of these bad habits that we had before that were generating maybe 50 tokens per second of bad code. Unless we fix them, they’re going to start generating 1,200 tokens per second of bad code. And so that is the topic of today’s uh talk. So to get started, my name is Sarah Chang. I’m the head of developer experience at Cerebrris where we are building the world’s largest and fastest AI processor.
1:50 — A large part of my job is that I get to introduce fast inference and fast coding models to developers for the very first time. And for most people, it’s a very exciting moment. There’s no thinking and waiting and starting up that you might be really annoyed about. But at the same time as I said unless we change our habits we are not going to have good code in the future. And so this talk really is a practical playbook for how we as developers can think about how we
2:20 — interact with the models in this new regime especially in a future where the models are generating code faster than we the human can keep up. So I want to look back at history a little bit. We’ve had a very exciting past two years. The models have gotten bigger, they’re getting smarter, we have bigger context windows, but the thing that has remained relatively constant over the past two years is coding speeds is model speed. So if we look at a lot of the popular families, we have Gemini, Claude, GPT, Sonnet. Over the past two
2:51 — years, they’ve always been within, you know, 50 to 150 tokens per second. And this is Codex Spark. Again, codec Spark is just the first of many models that we as developers can expect to be much faster than what we’re previously used to. And we even had to change the Y-axis because it’s so much faster. And so before we get into the actual playbook and tips, I want to talk about why this is happening. Why are we suddenly seeing such faster models? And it’s actually a very exciting development. It’s it’s what many of you
3:22 — probably work on on a day-to-day, but it’s there’s so many companies that are working on this problem all at the same time. And as a result, the entire AI inference stack is getting optimized all at once. And so breaking it down, let’s go through really quickly. We have hardware. This is a physical device that inference, training, all of our compute is happening on. One of the biggest things that we have to think about with hardware is the memory wall. And this is exactly why hardware and memory movement takes up 50 to 80% of that latency time
3:52 — for inference. This is where a lot of the frustration comes from. And so when we are running inference, we have to constantly move our weights and KV cache values between memory and our actual chip. On the NVIDIA GPU, this is the most traditional type of hardware. All of this memory is stored off chip on offchip HBM. And we’re now have a memory bandwidth bottleneck. What a lot of newer companies are doing are thinking about companies like Cerebrus or Gro, they’re thinking about how do we move this memory to be as close to the chip as possible. And so here’s an example of
4:22 — the cerebrus wafer where all of the chip is um all the memory is distributed across the chip in SRAMM. So every core has direct access to the values it needs. Even more exciting we have disagregated inference and this is an in um disagregated inference really has become commercialized in the last few months. This is why Nvidia bought Grock for $20 billion a few months ago. And this is also why Cerebrus and AWS are now partnering to serve the wafer and AWS trainium together. So in traditional
4:52 — inference, there’s two steps. There’s prefill and there’s decode. Traditionally, both of these steps have always been run on the same piece of hardware. Prefill is where we’re taking every token that the user inputs and processing it, embedding it, and adding it to our KV cache. This is a sequent this is a step that can happen in parallel and so it’s computebound. Decode on the other hand is where we’re actually generating the output token by token and this is sequential and is as we mentioned memory bound. Again it goes to the same problems that we mentioned before. And so what we’re doing and
5:24 — seeing now commercially is that we’re splitting up these two steps so that prefill is done on one type of hardware that is compute optimized and decode is done on another piece of hardware that is memory optimized. Going up the stack, there’s the diagram. Going up the stack, we look at model architecture. There’s so many ways that we are training our models and shaping our models to cater to our hardware. We have specific layer dimensions and memory and model size that we’re always thinking about. A
5:54 — great example is a very standard model architecture mixture of experts. here. Instead of activating the entire model all at once for every single token, we’re only activating a subset of experts for every time. And what this does is it allows us to have the intelligence of a much small larger model for the compute cost of a much smaller model. And again, we’re always thinking about memory and the size of our models. And a lot of people have been building on top of this in recent years. An example is reap router weighted expert activation pruning. I had to read that one. Um, and here we’re
6:25 — looking at the specific use case. We’re seeing which experts aren’t being activated all at all and we’re pruning them all together. We’re getting rid of them. Again, we’re always thinking about model size. And then at the very top layer of the stack, we have inference optimizations. And this is where many of you might be working in and a lot of companies that you’re probably familiar are also working at. These are companies like together, base 10, modal, who’s also here, fireworks. And one of the biggest things that we’re thinking about at this level is KV cache reuse. And so by storing and reusing previously
6:58 — computed token representations, we don’t have to recalculate attention over the sequence at every step. And now I want to get to the very top and most exciting part, the developer. This is the current state of what the internet looks like or what Twitter LinkedIn looks like. We have someone running six cloud code terminals at once. A 500 plus agent coding swarm. um someone running eight agents across five screens. And I get how tempting doing something like this can be. I feel like
7:28 — if you’re on Twitter at all these days, unless you are doing something like that, the internet is basically convincing you that you are living in the stone age and that you need to catch up. But the reality, what is the reality of what is happening in all these setups is that we’re generating massive amounts of code that nobody is verifying. And in the new future with much faster inference, this becomes increasingly dangerous. And so, especially with fast inference, we’re now going to be generating technical debt at a level that we’ve
7:58 — never seen before. And we’re not going to know what to do with it. And so, I’m going to pivot now to spend the rest of the talk on the practical playbook and tips and workflows and how we can reimagine how we as a developer should operate in this new regime of faster inference. And as I mentioned, Codex Spark operates at 1,200 tokens per second, but it really is just the first model and what we should as developers expect and prepare for to be a new regime of faster models across the board. And so starting with the first one, the first category is just choosing
8:29 — the right models and how do we orchestrate our agents so that we’re leveraging different model strengths. I think historically we always think about intelligence. There’s no is no secret that we as developers are not particularly loyal and that we will switch to whatever model whatever family is most intelligent at a given time and maybe we also think about cost unless our company pays for whatever we want and so here now the inference speed is a 20x difference now we also have another vertical to think about speed and so a
8:59 — good mental model is to use a larger model like GBT 5.4 for 5.3 for your planning or your long horizon workflows and then using a faster model like Codex Spark as your actual executor. And so here’s an example. You might ask your 5 GBT 5.4 to generate your plan. You would generate a um you would spawn all of your sub agents with codecs spark and have it actually operate uh have it actually execute on all of those steps um one by one. Another really helpful
9:30 — trick is to actually make skills out of successful sessions and capture trajectories that are working really well. A thing that you can do here is use a model like GPT 5.4 to actually have it do the initial harder larger task, capture that as a skill and therefore making it a verifiable repeatable workflow and then having a small um faster agent like codec spark just do it again and again in the background. The next category I think is even more exciting because this is a category of
10:01 — things that just were not possible and were not practical. These are things we wouldn’t do because we’re tired of the cogitating justiculating germinating that you might have seen. And so here I really want us to think about this and internalize this. But at 1200 tokens per second a model like codec spark makes validation basically free. There is no excuse and no reason why you should not be doing things like this. Test suites, linting, pre-commit
10:32 — hooks, diff reviews, browserbased QA automations. There’s all these things that you can add to every step of your workflow because it is instant. It’s not slowing you down and it’s not you do this all of this at the very end or right before you’re about to push your code. Another tip that I really like is exploring cherrypicking. So let’s say that I want to code a navbar and I want it to be midnight blue. I want four different icons. I give it to the model and the result’s fine. Instead, what I can do with Codeex Spark or much faster model is I can have
11:03 — it tell it to generate 15 versions in the same time that it would have taken the a previous model to generate one version and I can cherrypick the version that I like the best. Even better, I can generate five sub aents that are each generating 15 versions and now I have 75 versions and I pick the one that’s best. And this is great for things where we really value quantity or variety. So things like research direction, different types of architecture d um directions, or even just graphic design. And the reason why I really like this one is because it almost allows us to
11:34 — artificially induce taste into our model output. So traditionally, it’s no secret it’s very easy to sniff out any UI or text that a model writes. The models themselves do not have taste. And the ways that we’ve kind of brute force worked around this is that we either create an example ourselves or we find examples for the model which is timeconuming or we give the prompt so much detail that we might as well have completed the task ourselves. This is a great way of saving our time and also
12:05 — getting much better results. The next tip is kind of more more so a a mental model where now that the models are so fast, it should not be you spawn a session, you go get a hamburger, you scroll Twitter, and then you come back. Now you can actually sit down and it’s a real time collaboration that you’re able to have with this model. You should view it much more as a pure programmer. And this is the only way that you are going to avoid
12:35 — having bad code. So you can sit down and ask questions like h having it collect all the context across your repo and actually asking it how does it work being the one in the front seat making decisions and implementations. The AI should always be helping you make decisions not the other way around. The next one I hate this slide because it’s everyone’s trigger word and overused word but h how do we avoid slob? So, as I was mentioning before, it really shouldn’t be, you know, you spawn 10 agents, you never verify the code.
13:06 — You don’t know what’s happening under the scene. Someone asks you to explain, you have to read the code for the first time. Now, you can actually have two to three sessions and actually sit down next to your code. And I know this is something we’re not really used to, but sit down with it and actually steer it, understand what’s happening, because again, we are now experiencing real-time collaboration as we code with this agent. You can be super specific. You can think do things like ban the model from deleting files, give it a max diff size, the model, have the model only read and write, and even give it
13:37 — steering directions, things like only change this, don’t touch types yet. Wait, that implementation wasn’t quite right. Let’s redo that. The graph on the left is a is a helpful mental model as an example of how the developer, the AI agent, and the codebase can all work together and what that should look like. This next step, refactoring is very similar to what I was talking about with valid with verification. Just like with verification, something like constantly refactoring and cleaning up your code automatically is basically free at 1,200
14:08 — tokens per second. So you can do things instead of doing it at the very end right before you’re about to commit your code. You can just re you can just bake this into your automatic workflow so that after every single task on that checklist is complete, you’re just asking the model to automatically, you know, delete unused imports, clean up unnecessary lines of code, make it so that all of my functions are structured the same way. The last category that I want to talk about, and I’m sure that so many of you guys have already heard these two words a countless amount of time over the past
14:38 — few days and across so many talks, is context management. But the reason I’m going to talk to you about it again is because let’s say that historically it took you 10 minutes to fill up your context before you saw, you know, the god-feared word compaction. Now, if you take 10 minutes, divide it by 20, you are now getting compaction in 30 seconds. And so, context management, especially with fast inference, is more important to think about than ever. And you can’t get away with sloppy practices
15:08 — anymore. And so, all of these these really are just good practices no matter what coding model you are using or what speeds. But a general very high level framework is just always always break up large tasks into smaller bounded goals. And this graph on the right is a good mental model for how how full your context is will then affect your behavior, the model behavior. So you always want to avoid the 80 to 100% because you’re going to get compaction. And right now we all know some things
15:38 — might get lost. And so a good way that you can think about how do I externalize this memory so that I can have these small bounded goals like what does that look like? So an example of how you can do this and set up an external memory system that is persistent every time you set up a new session is with this four file system. We have agents MD which is where we’re actually defining all our agents our sub agents. We have plan MD which is what we’re creating at the very beginning and this is where we’re just generating the entire plan and step by
16:09 — step ch stepby-step checklist that we’re going to go through. We have progress MD which is where we’re keeping track of what’s do we need to do and what has been done before. So every time you spawn a new agent or session is no context. It comes in it looks at progress MD. It sees what’s been done before and it’s like okay here’s where I pick up. Here’s where the next task needs to be done. And then the last is verify MD. And this is what we’re using at every single step to just make sure everything looks good. It’s clean code and we can move on to the next step. And
16:39 — so an example of this is again leveraging different models using a GPT 5.3 or 5.4 codeex having it create your plan and then having your GPT 5.3 codec spark actually execute the checklist one by one much faster than before. And as a final slide I want to do these um few helpful commands for how you can get the best out of codecs. Things like permissions experimental skills review and rename. But the biggest thing that I really want to emphasize here is that honestly it’s not really about just having faster coding models. What it
17:11 — really means is that the developer experience is actually going to become so much better. And when it’s becoming so much better, there’s so much more we can do. And there’s so many ways that we can now avoid creating bad clo bad code in a way that isn’t miserable or us staring at a screen for 30 minutes. So thank you guys so much for welcoming today me today. My name is Sarah Chang. Um, I’m visiting from SF. It’s an honor to be here in London. Um, if you have any questions or need any credits, my
17:41 — handle is milks and matcha across every platform. Thank you guys.