Back to blog
AIComparisonWordPress

Claude Opus vs GPT-4 for writing WordPress plugins: full test

Head-to-head test: I asked Claude Opus and GPT-4 to build the same three WordPress plugins. Winner announced on code quality, security defaults, and WP standards compliance.

· 10 min read

I gave Claude Opus 4.6 and GPT-4 the same three WordPress plugin prompts, with the same system instructions about WordPress Coding Standards. Then I reviewed the output blind (prompt labels stripped) and scored them. Here is what I found.

The test setup

Same prompts, same system message describing the WordPress context (WP 6.4, PHP 8.1, WooCommerce 8.5 where applicable), same temperature and top_p settings. No reference code provided. I ran each model three times per prompt and picked the best attempt. Reviewed blind with a numbered rubric: structure (15), security defaults (25), performance defaults (15), WP standards compliance (15), readability (15), feature completeness (15). Total out of 100.

Prompt 1: Simple — "Shortcode to show the current moon phase with an emoji"

Expected output is short, maybe 40 lines. The hard part is using a correct astronomical formula rather than a cheap approximation.

Claude Opus 4.6: used Conway's algorithm with the correct epoch, cached the calculation in a one-hour transient, returned an escaped string, and added a small test-friendly function with dependency injection for the current date. 94 out of 100. Lost points on the algorithm not being the most accurate choice, but it was clearly the right trade-off for a plugin this size.

GPT-4: used a simpler 29.53-day modulo formula, no caching, returned HTML directly instead of escaping. Less defensible on the formula side (accurate to maybe a day), but it worked. 72 out of 100. Lost points on the unescaped output, which is a real bug.

Winner on this prompt: Claude.

Prompt 2: Medium — "Email digest of all WooCommerce low-stock products, weekly, configurable threshold"

Requires: cron, options page, email templating, stock lookup, multiple stores support. Maybe 200 to 300 lines of real code.

Claude Opus 4.6: clean architecture split across four files, used wp_schedule_event with an unschedule on deactivate (correct), built the email from an HTML template using wp_mail with proper Content-Type headers, option page with nonce protection, tested with zero-product edge case. Added a "Send test now" button on the settings page I had not asked for. 91 out of 100. Lost points for not deduplicating notifications if the same product stayed low across weeks.

GPT-4: single-file implementation, used wp_mail correctly, but stored the email template as a giant string constant which made customization painful. Cron was scheduled on activation but not unscheduled on deactivation, which means deactivating and reactivating would create duplicate cron events. 78 out of 100.

Winner on this prompt: Claude.

Prompt 3: Complex — "Sync WooCommerce orders to HubSpot as deals, bidirectionally, with field mapping UI"

This is the "real client project" scale. Expected around 600 to 900 lines. Multiple APIs, stateful logic, admin UI, queue, retry.

Claude Opus 4.6: split into seven files, created a queue using a custom table (not options, which is correct for volume), used WP Cron with Action Scheduler fallback if Action Scheduler was available, OAuth token refresh built in, field mapping UI was a real React-via-WP-Scripts component. Bidirectional sync was correctly debounced to prevent echo loops. 89 out of 100. Lost points because the React admin UI required a build step that the plugin ZIP did not include compiled, so it would not actually work out of the box without running npm install and npm run build first.

GPT-4: five files, HubSpot client with manual retries, field mapping UI built with plain jQuery (simpler, worked immediately). Queue used the options table, which would not scale past a few thousand orders. Did not handle OAuth token refresh automatically, so the plugin would break after an hour. 69 out of 100.

Winner on this prompt: Claude.

Overall scores

  • Claude Opus 4.6 average: 91.3 out of 100
  • GPT-4 average: 73.0 out of 100

Claude was consistently better at security defaults (escaping, nonces, capability checks) and at thinking in terms of plugin lifecycle (activate, deactivate, uninstall). GPT-4 was slightly better at producing code that "just worked" without extra build steps, which matters if you are pasting into production in a hurry.

Where each model wins

Claude is the model I trust for client work where I will actually review the code and where I care that it follows conventions. The defaults are closer to what a senior plugin developer would produce.

GPT-4 is useful when the priority is a working prototype you will throw away. The simpler architectures it produces are sometimes easier to read if you do not need to scale.

One important caveat

I tested the model through an agent loop with tool use (file writes, reads, bash). Running Claude Opus 4.6 as a raw chat completion without the agent framework produces notably worse results, because the agent lets the model correct itself after reading its own output. If you are comparing models, make sure you are comparing like with like.

Cost

Claude Opus 4.6 at current pricing (5 dollars per million input, 25 dollars per million output tokens) produced the complex plugin for about 2 dollars of raw API cost. GPT-4 was similar in cost. The quality difference is the story, not the price.

Would I switch between them?

I would use Claude Opus 4.6 for anything I would put my name on. I would use GPT-4 when the stakes are low and I want a one-shot answer. For everything in between, I would run both and keep whichever one I end up reading more carefully, which is usually a Claude output.