Kimi K2: Open Agentic Intelligence

ArXiv · 15 min read · original

We assess Kimi-K2-Instruct across different areas. For coding, we adopt LiveCodeBench v6 \parencitejain2024livecodebench(questions from August 2024 to May 2025), OJBench \parencitewang2025ojbenchcompetitionlevelcode, MultiPL-E \parencite10103177, SWE-bench Verified \parencitejimenez2024swebench,yang2025swesmith, TerminalBench \parencitetbench_2025, Multi-SWE-bench \parencitezan2025multi, SWE-Lancer \parencitemiserendino2025swelancer, PaperBench \parencitestarace2025paperbench, and Aider-Polyglot \parenciteaider. For tool use tasks, we evaluate performance on τ2superscript𝜏2\tau^{2}italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-Bench \parencitebarres2025tau2 and AceBench \parencitechen2025acebench, which emphasize multi-turn tool-calling capabilities. In reasoning, we include a wide range of mathematical, science and logical tasks: AIME 2024/2025, MATH-500, HMMT 2025, CNMO 2024, PolyMath-en, ZebraLogic \parencitelin2025zebralogicscalinglimitsllms, AutoLogi \parencitezhu2025autologiautomatedgenerationlogic, GPQA-Diamond \parenciterein2024gpqa, SuperGPQA \parencitedu2025supergpqa, and Humanity’s Last Exam (Text-Only) \parencitephan2025humanitysexam. We benchmark the long-context capabilities on: MRCR for long-context retrieval, and DROP \parenciteDBLP:journals/corr/abs-1903-00161, FRAMES \parencitekrishna2025factfetchreasonunified and LongBench v2 \parencitebai2025longbenchv2deeperunderstanding for long-context reasoning. For factuality, we evaluate FACTS Grounding \parencitejacovi2025factsgroundingleaderboardbenchmarking, the Vectara Hallucination Leaderboard \parencitehhem-2.1-open, and FaithJudge \parencitetamber2025benchmarking. Finally, general capabilities are assessed using MMLU \parencitehendrycks2021measuringmassivemultitasklanguage, MMLU-Redux \parencitegema2024we, MMLU-Pro \parencitewang2024mmluprorobustchallengingmultitask, IFEval \parenciteZhou2023InstructionFollowingEF, Multi-Challenge \parencitesirdeshmukh2025multichallengerealisticmultiturnconversation, SimpleQA \parencitewei2024measuring, and LiveBench \parencitelivebench (as of 2024-11-25).