Andrej Karpathy just dropped another open-source project, and this one’s ambitious even by his standards. AutoResearch, now live on GitHub, is a framework designed to automate the entire scientific research pipeline — from literature review to hypothesis generation to running experiments and writing up results. All of it handled by AI agents with minimal human intervention.
That’s a big swing.
The project, which Karpathy announced in late June 2025, builds on the growing wave of “AI scientist” tools that have emerged over the past year, but it takes a more opinionated and end-to-end approach than most. Rather than offering a chatbot that helps you brainstorm or a tool that summarizes papers, AutoResearch chains together multiple LLM-powered agents that each handle a discrete phase of the research process. One agent searches and synthesizes existing literature. Another formulates hypotheses based on gaps it identifies. A third writes and executes code to test those hypotheses. And a final agent compiles findings into a structured research report.
The repo is early-stage — Karpathy himself labels it experimental. But the architecture is already instructive for anyone building agentic AI systems.
At its core, AutoResearch relies on frontier language models (currently supporting OpenAI’s models and Anthropic’s Claude) to power each agent. The agents communicate through a shared context that accumulates as the pipeline progresses, so later stages can reference decisions and findings from earlier ones. It’s not a single monolithic prompt. It’s a multi-step workflow where each agent operates semi-independently but contributes to a coherent whole. Think of it less like a chatbot and more like a software pipeline where LLMs replace human researchers at each station.
Karpathy has been vocal on X (formerly Twitter) about his belief that AI research itself will be one of the first domains to be substantially automated. He’s argued that much of the grunt work in ML research — running ablations, surveying related work, writing boilerplate sections of papers — is formulaic enough that current models can handle it competently. AutoResearch is essentially him putting code behind that thesis.
The timing matters. Several teams have released similar tools in recent months. Sakana AI’s “The AI Scientist”, released in mid-2024, demonstrated that LLMs could generate novel ML research papers and even submit them to peer review. More recently, projects from groups at Stanford and MIT have explored automated experiment design. But Karpathy’s version benefits from his enormous personal following — he has over 1.5 million followers on X — and his reputation as a builder who ships clean, well-documented code. His earlier projects like nanoGPT and llm.c became de facto educational resources for the AI community.
So what can AutoResearch actually do right now? Based on the repository’s documentation and examples, it can take a broad research topic — say, “scaling laws for small language models” — and produce a multi-page report that includes a literature synthesis, identified research questions, experimental code, results with visualizations, and a written analysis. The quality varies. Some outputs read like competent survey papers. Others show the familiar failure modes of LLM-generated research: shallow analysis, hallucinated citations, experiments that test obvious things.
That unevenness is the honest reality of where these tools stand.
For industry professionals, the implications split a few ways. If you’re running an ML research lab, tools like AutoResearch won’t replace your senior researchers anytime soon, but they could meaningfully accelerate the early stages of a project — especially literature review and initial experiment scoping. If you’re building agentic AI products, the repo offers a clean reference implementation for multi-agent pipelines with shared state. And if you’re thinking about the future of scientific publishing, automated research tools raise real questions about attribution, review standards, and what counts as original contribution.
There are obvious limitations. The system can only run experiments it can express in code, which currently means Python-based computational experiments. No wet labs. No physical simulations beyond what standard scientific Python libraries support. It also inherits all the well-known weaknesses of its underlying models — tendency toward plausible-sounding but incorrect reasoning, difficulty with truly novel ideas, and inconsistent mathematical rigor.
Karpathy hasn’t indicated whether this will become a maintained product or remain a proof of concept. Given his track record, it’ll likely evolve through community contributions. The repo already has hundreds of stars and forks within days of release.
The broader signal here is clear. The people who know AI best are increasingly building tools to automate their own work. Whether that’s hubris or inevitability probably depends on your time horizon. But when someone with Karpathy’s track record bets on automated research, the industry pays attention. And it should.