The Karpathy Loop: build a Claude skill that improves itself

Leonardo Garcia-Curtis13/05/2026

TL;DR

The Karpathy Loop is a feedback pattern named after Andrej Karpathy's iterative training playbook. Applied to Claude skills, it looks like this: a skill produces output, a second skill critiques it against criteria you defined, the original skill updates its own instructions based on the critique, and the next run is slightly better. Set it up as a weekly Cowork task and the skill is genuinely better six months later than it was the day you shipped it. The trick is defining the critique criteria sharply, otherwise the loop drifts. Four moving parts: producer skill, critic skill, evaluation criteria, and a change log.

8 min read · Operator-level pattern · Last updated 13 May 2026

Part of Learn Claude Code: The Complete Operator's Guide. For the operator's overview of Skills, Connectors, Cowork, and Artifacts, start there.

Most people ship a skill, use it for a month, and the skill stays the same.

The output gets better only when you remember to fix it. Which is almost never.

Karol Zieminski borrowed an idea from Andrej Karpathy's training loops and applied it to Cowork. The skill itself iterates. Every week it critiques its own output and updates its instructions. The version you ship in month six is genuinely better than the version you shipped in week one, and you did not touch it.

This is how to build one.

What a Karpathy Loop actually is

In machine learning, Karpathy's iterative loop is the discipline of building a baseline, evaluating it carefully, identifying the single biggest failure mode, and fixing that one thing. Repeat forever. Compounding.

Applied to a Claude skill, the loop has four moving parts:

The producer skill. The thing that does the work. Writes the blog post, summarises the meeting, builds the report.

The critic skill. A separate skill whose only job is to score the producer's output against criteria you defined.

The evaluation criteria. A short rubric the critic uses. "Does the post sound like Leo? Does it use any of the banned phrases? Does it have a clear call to action?"

The change log. A file the loop writes describing what changed in the producer's instructions this cycle.

The loop runs weekly. The producer writes. The critic scores. If the score is below threshold on a specific criterion, the loop updates the producer's instructions to address that one criterion. Change log written. Next week the producer is slightly better.

Why it works

Three things compound at once.

First, the critic stays calibrated. It is reading the same rubric every week, so its scoring is consistent.

Second, the producer never drifts off the criteria. Every cycle, the instructions get refined toward what the criteria reward.

Third, you stay informed. The change log every week is a 30-second read. You spot when the loop is improving or going in a weird direction.

It is the same pattern good editors use on writers. The skill gets edited by another skill, on a schedule, against rules you defined.

The four-part setup

Part 1: Build the producer skill normally

Use the Module 4 walkthrough in Skills 101. Build the skill that does the work you actually want done. Get it to v1.

For our example we will use a blog-post writer skill called blog-writer.

Part 2: Build a critic skill

This is the new piece. The critic skill's only job is to score output against your rubric.

In Claude Desktop, hit the + next to Skills, pick Create skill, then Write skill instructions. The dialog gives you three fields: name, description, instructions. Fill them in for the critic.

The Create skill flow in Claude Desktop, showing the path: plus button to Create skill to Write skill instructions

The critic skill instructions look like this:

You are a critic skill for blog posts. Read the post the

producer wrote. Score it 1-5 on each of these criteria:

1. Voice match: does it sound like Leo?

2. Phrase ban: any use of "leverage", "stakeholders",

"moving forward", em-dashes, en-dashes?

3. Call to action: does it end with a workshop CTA?

4. Internal links: are there at least 3 real internal links

to existing pages?

5. Structure: bold sub-headers on their own line, short

paragraphs, no walls of text?

For each criterion below 4, write one sentence of specific

feedback. Output as JSON.

Sharp criteria are everything. "Better tone" is useless. "Does it use the word stakeholders" is testable.

Part 3: Schedule the loop in Cowork

In Cowork, schedule a weekly task. Something like Sunday at 8pm.

Take the three blog posts the blog-writer skill produced this

week from the "Blog drafts" Notion database. For each, invoke

the critic-blog skill to score it. Identify the criterion with

the lowest aggregate score across the three posts. Update the

blog-writer skill's instructions to specifically address that

criterion. Write what-changed.md noting which criterion scored

lowest and what change was made.

That is the loop. The critic looks at last week's work, finds the weakest pattern, updates the producer to fix it.

Part 4: Read the change log every week

This is the discipline that keeps the loop honest.

Open the change log every Monday. 30 seconds. Three questions:

Does the change make sense?

Is the loop moving the producer in the direction I actually want?

Is any criterion stuck at a low score for multiple weeks (meaning the loop cannot fix it on its own)?

If anything looks off, intervene. The loop is a junior teammate, not a god. You are still the lead.

Where this breaks

Soft criteria. "Tone is better" is not measurable. The loop cannot improve against it. Every criterion in your rubric must be checkable in one read.

Too many criteria. Five is sharp. Twenty is noise. The loop will jitter between trying to improve everything and fix nothing.

Critic too lenient. If the critic scores everything 4-5, no improvement signal. Calibrate the critic on a deliberately bad draft and a deliberately good one. Make sure the score range is wide.

No human in the read. The change log is what keeps the loop accountable. Skip it for a month and you stop knowing what your producer skill is doing.

What we use this for at Waboom AI

If you want to see a Karpathy Loop running live on real client work, we demo it at our Claude Code course. Watching a skill rewrite its own instructions on a Sunday night is the moment most operators stop treating Claude as a chat tool.

Our voice-DNA enforcer skill runs a Karpathy Loop weekly. Every weekend it reads the week's blog drafts, scores them against our voice rules, and tightens the rules where the drafts drifted.

Six months in, the skill catches things we did not even know to flag in week one. Em-dashes hidden in image alt-text. Three-adjective fanfare. "Imagine this" openings.

The original v1 of that skill had nine rules. The current version has 28. We wrote nine of them. The loop wrote the other 19.

What to do next

You need v1 of a producer skill before you build a loop. Do Skills 101 first if you have not.

Then pick the work you do most often, build a producer, build a critic, schedule the loop. By month three you will feel the compounding.

Credit: this post adapts Karol Zieminski's original Karpathy Loop write-up. We strongly recommend reading the source for additional context.

Self-paced

Build your first skill before you build a loop. Six short modules. One hour. Free.

Start Claude Skills 101 →

Hands-on with us

Live workshop covers loops, critics, and the change-log discipline live. You leave with a producer + critic running on your real work.

See the workshop →

The Karpathy Loop: build a Claude skill that improves itself

Leonardo Garcia-Curtis13/05/2026

TL;DR

8 min read · Operator-level pattern · Last updated 13 May 2026

Part of Learn Claude Code: The Complete Operator's Guide. For the operator's overview of Skills, Connectors, Cowork, and Artifacts, start there.

Most people ship a skill, use it for a month, and the skill stays the same.

The output gets better only when you remember to fix it. Which is almost never.

This is how to build one.

What a Karpathy Loop actually is

Applied to a Claude skill, the loop has four moving parts:

The producer skill. The thing that does the work. Writes the blog post, summarises the meeting, builds the report.

The critic skill. A separate skill whose only job is to score the producer's output against criteria you defined.

The evaluation criteria. A short rubric the critic uses. "Does the post sound like Leo? Does it use any of the banned phrases? Does it have a clear call to action?"

The change log. A file the loop writes describing what changed in the producer's instructions this cycle.

Why it works

Three things compound at once.

First, the critic stays calibrated. It is reading the same rubric every week, so its scoring is consistent.

Second, the producer never drifts off the criteria. Every cycle, the instructions get refined toward what the criteria reward.

Third, you stay informed. The change log every week is a 30-second read. You spot when the loop is improving or going in a weird direction.

It is the same pattern good editors use on writers. The skill gets edited by another skill, on a schedule, against rules you defined.

The four-part setup

Part 1: Build the producer skill normally

Use the Module 4 walkthrough in Skills 101. Build the skill that does the work you actually want done. Get it to v1.

For our example we will use a blog-post writer skill called blog-writer.

Part 2: Build a critic skill

This is the new piece. The critic skill's only job is to score output against your rubric.

In Claude Desktop, hit the + next to Skills, pick Create skill, then Write skill instructions. The dialog gives you three fields: name, description, instructions. Fill them in for the critic.

The critic skill instructions look like this:

You are a critic skill for blog posts. Read the post the

producer wrote. Score it 1-5 on each of these criteria:

1. Voice match: does it sound like Leo?

2. Phrase ban: any use of "leverage", "stakeholders",

"moving forward", em-dashes, en-dashes?

3. Call to action: does it end with a workshop CTA?

4. Internal links: are there at least 3 real internal links

to existing pages?

5. Structure: bold sub-headers on their own line, short

paragraphs, no walls of text?

For each criterion below 4, write one sentence of specific

feedback. Output as JSON.

Sharp criteria are everything. "Better tone" is useless. "Does it use the word stakeholders" is testable.

Part 3: Schedule the loop in Cowork

In Cowork, schedule a weekly task. Something like Sunday at 8pm.

Take the three blog posts the blog-writer skill produced this

week from the "Blog drafts" Notion database. For each, invoke

the critic-blog skill to score it. Identify the criterion with

the lowest aggregate score across the three posts. Update the

blog-writer skill's instructions to specifically address that

criterion. Write what-changed.md noting which criterion scored

lowest and what change was made.

That is the loop. The critic looks at last week's work, finds the weakest pattern, updates the producer to fix it.

Part 4: Read the change log every week

This is the discipline that keeps the loop honest.

Open the change log every Monday. 30 seconds. Three questions:

Does the change make sense?

Is the loop moving the producer in the direction I actually want?

Is any criterion stuck at a low score for multiple weeks (meaning the loop cannot fix it on its own)?

If anything looks off, intervene. The loop is a junior teammate, not a god. You are still the lead.

Where this breaks

Soft criteria. "Tone is better" is not measurable. The loop cannot improve against it. Every criterion in your rubric must be checkable in one read.

Too many criteria. Five is sharp. Twenty is noise. The loop will jitter between trying to improve everything and fix nothing.

Critic too lenient. If the critic scores everything 4-5, no improvement signal. Calibrate the critic on a deliberately bad draft and a deliberately good one. Make sure the score range is wide.

No human in the read. The change log is what keeps the loop accountable. Skip it for a month and you stop knowing what your producer skill is doing.

What we use this for at Waboom AI

Our voice-DNA enforcer skill runs a Karpathy Loop weekly. Every weekend it reads the week's blog drafts, scores them against our voice rules, and tightens the rules where the drafts drifted.

Six months in, the skill catches things we did not even know to flag in week one. Em-dashes hidden in image alt-text. Three-adjective fanfare. "Imagine this" openings.

The original v1 of that skill had nine rules. The current version has 28. We wrote nine of them. The loop wrote the other 19.

What to do next

You need v1 of a producer skill before you build a loop. Do Skills 101 first if you have not.

Then pick the work you do most often, build a producer, build a critic, schedule the loop. By month three you will feel the compounding.

Credit: this post adapts Karol Zieminski's original Karpathy Loop write-up. We strongly recommend reading the source for additional context.

Self-paced

Build your first skill before you build a loop. Six short modules. One hour. Free.

Start Claude Skills 101 →

Hands-on with us

Live workshop covers loops, critics, and the change-log discipline live. You leave with a producer + critic running on your real work.

See the workshop →

The Karpathy Loop: build a Claude skill that improves itself

What a Karpathy Loop actually is

Why it works

The four-part setup

Where this breaks

What we use this for at Waboom AI

What to do next

Self-paced

Hands-on with us

Leonardo Garcia-Curtis

Ready to Build Your AI Voice Agent?

Related Pages

AI Receptionist Australia

AI Sales Agent Australia

AI Voice Agents for Mortgage Brokers AU

Related Articles

AU$374.4M in sales already lost to competitors, uncovered from just the first 5,200 calls

Browse the Claude skills directory: find a working skill in 60 seconds

Claude for Excel: the 15-minute setup for non-devs

The Karpathy Loop: build a Claude skill that improves itself

What a Karpathy Loop actually is

Why it works

The four-part setup

Where this breaks

What we use this for at Waboom AI

What to do next

Self-paced

Hands-on with us

Leonardo Garcia-Curtis

Ready to Build Your AI Voice Agent?

Related Pages

AI Receptionist Australia

AI Sales Agent Australia

AI Voice Agents for Mortgage Brokers AU

Related Articles

AU$374.4M in sales already lost to competitors, uncovered from just the first 5,200 calls

Browse the Claude skills directory: find a working skill in 60 seconds

Claude for Excel: the 15-minute setup for non-devs