
Exploring Kimi K2: Moonshot's Latest Open Source Model In this video, we dive into Kimi K2, the newest open-source model from Moonshot. This mixture-of-experts model boasts 32 billion activated parameters and a trillion total parameters. It's designed for agentic tasks and exhibits notable performance across several benchmarks. We discuss its standing relative to other models, highlighting its non-reasoning nature and impressive performance given its specifications and open-source status. The video covers benchmark comparisons, token usage, and how to access and utilize the model effectively. The insights provided will help you understand where Kimi K2 stands in the landscape of large language models. https://www.kimi.com/ 00:00 Introduction to Kimi K2 00:05 Model Specifications and Benchmarks 00:51 Artificial Analysis Insights 02:57 Token Usage and Cost Implications 04:52 Accessing and Using Kimi K2 06:07 Conclusion and Final Thoughts
--- type: transcript date: 2025-07-15 youtube_id: gq3oddN-u1Q --- # Transcript: Kimi K2 in 6 minutes In this video, I'm going to be going over Kimmy K2, which is the latest open source model from the team over at Moonshot. Right off the bat, this is a mixture of experts model with 32 billion activated parameters as well as 1 trillion total parameters. They describe this as a non-thinking model. One of the really big use cases for this model is a gentic task. Now, the one thing that I've really come to realize with a lot of these large language models that are coming out is increasingly one of the most important benchmarks for these coding models is actually its agenda capabilities right off the bat. So now if we look at the benchmarks basically across the board here, we can see that this model does outperform across a number of different benchmarks and where it doesn't, it does come quite close. And the one thing to note with this isn't comparing to reasoning models. They don't have this plotted against things like 03 or 04 or within the reasoning capabilities of the models such as Opus. Now I want to take a look at artificial analysis which just ran the benchmarks of Kimmy K2 just before I recorded this video. One thing that I do want to call out is this model is a non-reasoning model and a lot of the oxygen in the room over the past several months have really been around these reasoning models. But one of the key aspects of these reasoning models to consider is the responses for these do generally take a fair bit longer. If we take a look at KK2 and we compare it to the non-reasoning models, we can see that this plots above all of them. We have DeepSc V3, we have Llama 4 as well as some others like GBD40, so on and so forth. It is impressive in its own regard that it is both open- source as well as does outperform the other models. But in terms of raw intelligence, you can see that it is a ways away from a number of these other models. Now, in terms of some of the other benchmarks, if we take a look here, so on MMLU, we have it at 82%. On GPQA diamond, we have that at 77% just shy of 04 mini and just above Claude 4 sonnet with thinking. Now, in terms of humanity's last exam, this ranked at 7% just below magestral small from Mistral. To put this into perspective, we did have gro 4 just come out and this ranked just shy of 24%. And again, just mind you, this is a non-reasoning model. Now, in terms of their evaluation of live codebench, we have it at 56, just shy of cloud for sonnet thinking. And then for human eval, we have it at 93%. Now, in terms of its intelligence to price ratio, this is an important factor just given how much you get out of the model for what you're actually paying for. for a $1.50 for the blended rate per million tokens. We do see this rank at 57.45 on the artificial analysis intelligence index. We do see that this does outperform things like GBD4.1. But with that being said, if I do move the tool tip over here, we do have a number of other models that are both cheaper as well as rank higher on this overall intelligence index. models like DeepSeek R1 as well as Gro 3 Mini Reasoning, the Gemini 2.5 Flash Reasoning, as well as Miniax M1. Now, one really interesting thing that I do want to point out from artificial analysis is they mentioned that while Moonshot AI's Kimmy A2 is the leading openweight non-reasoning model in the artificial analysis intelligent index, it does output 3x more tokens than non-reasoning model, which blurs the lines between reasoning and non-reasoning. This is where it gets quite interesting. Kim K2 is the largest major openweight model yet. 1 trillion total parameters with 32 billion active parameters. That requires a massive 1 TB of memory. They mentioned just like I point out that they do have K2 at 57 in the artificial analysis intelligence index. An impressive open score that puts it above models like GPD4.1 and DeepSeek V3, but behind leading reasoning models. Until now, there has been a clear distinction between reasoning models and non-reasoning models in our evals defined by whether or not the model uses reasoning tags, but primarily by token usage. The median number of tokens used to answer all of the evals in artificial analysis intelligence index is 10x higher for reasoning models than for non-reasoning models. Then they go on to describe that Kimk 2 uses 3x the number of tokens that the median non-reasoning model uses. What's really interesting is they mentioned that its token usage is only up to 30% lower than Quad for Sonnet and Opus when running their maximum budget extended thinking mode and is nearly triple the token usage of both quad for Sonnet and Opus with reasoning turned off. We therefore recommend that Kimmy K2 should be compared to Clad for Sonnet and Opus in their maximum budget extended thinking modes, not the non-reasoning scores for Cloud 4 models. So, this is a really interesting aspect to consider because the number of tokens that you have to leverage to get those quality responses do in fact incur more costs, right? You're going to be using more tokens. It definitely does seem like it might sit somewhere in between the middle of actually a reasoning model. Now, in terms of actually accessing the model, you can pull this down on hugging face. Now, obviously given the size of the model, you definitely will have to have substantial hardware to actually be able to run this, but the fact that it is open source, it will be interesting to see what types of models we do see derived from this. In terms of trying out the model, you can go to kimmyki.com to try this out within their web interface. In terms of the fastest option, rock at time of recording is by far the fastest option in terms of how you can leverage this new Kimmy model. So, here is an example of a little application that I built out where I'm demonstrating a Mario game that it was able to generate just on the fly here. One quick tip that I did see on X here was if you do want to leverage Kimmy K2 as your agent within Claude Code, you can do that by changing out both the base URL and setting the API key before actually running the Claude command. And then from there, all of the prompts and code request will now go through Kimmy in the Cloud Code terminal you're used to. So all that you need to do to try this out is you can get your API key from Moonshot AI. You can run anthropic base URL and you can set it to the Moonshot API URL and then from there you can set your Anthropic O token to be the API key that you'd get from Moonshot AI. I'll put the link to that within the description of the video. But otherwise, that's pretty much it for this video. Kudos to the team at Kimmy for this impressive model that they just put out. Let me know your thoughts within the comments below. But otherwise, if you found this video useful, please comment, share, and subscribe. Otherwise, until the next
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.