Example · Case study

Cartpilot: 38% fewer agent drop-offs in 30 days.

Cartpilot converted well for humans, but AI agents could not compare variants, service levels, and bundle rules. We rebuilt the product evidence, clarified checkout, and left a repeatable agent-readiness test.

The context

Cartpilot supplies industrial and workshop customers with components and consumables — 12,400 SKUs, most with two to seven variants. Human conversion was steady at 4.7%.

In early 2026, AI agents started crawling the catalog: GPT Operator and Perplexity Comet sessions showed up in the logs. They converted at 0.4%. The agents found Cartpilot and bought from competitors.

Where the path broke

We reconstructed 47 agent sessions from the last 60 days. Three breaks explained most of the failures:

Variants without an anchor

Marketing named a 'Pro' version, but nothing told the agent which SKU that was or how it differed on voltage, fit, or capacity. Agents picked on shipping speed — often the wrong variant.

Rules that appeared too late

Service level and bundle constraints only resolved at the cart. Headless agents bailed after 30 seconds of inactivity, before they ever saw the options.

Compatibility locked in PDFs

What combined with what lived in spec sheets the crawler couldn't cite. Agents built invalid bundles and read it as a fault of the shop.

"We didn't have a conversion problem. We had an agent-trust problem. In 30 days it was gone."

Head of Product, Cartpilot

What we did

Four workstreams in parallel over 30 days. Each with a clear owner and a pass criterion before going live.

01

Mapped the decision hotspots

We replayed the real buying task — 'find a 110V express-shippable drill compatible with the existing battery system' — and marked every point where agents stalled. The result was a 12-point heatmap per top category, weighted by revenue risk.

Tactic

GPT-4 Action Replay + internal browser telemetry. Output: a hotspot list, not a 60-page audit.

02

Made each variant self-describing

For 11 recurring variant axes we built distinguishing-fact templates and rolled them across 6,214 PDPs as JSON-LD, so agents parse the facts directly instead of scraping prose.

Tactic

Schema.org ProductGroup with variesBy + per-variant Product nodes. 6,214 pages migrated in 8 days, validated against Google Rich Results and a headless agent crawl.

product-group.jsonldJSON-LD
{  "@context": "https://schema.org",  "@type": "ProductGroup",  "name": "Cordless Impact Drill",  // the axes an agent must disambiguate  "variesBy": ["voltage", "batterySystem"],  "hasVariant": [{    "@type": "Product",    "sku": "DRL-110-PRO",    "name": "Impact Drill 110V Pro",    "additionalProperty": [      { "@type": "PropertyValue", "name": "voltage", "value": "110V" },      { "@type": "PropertyValue", "name": "batterySystem", "value": "M-Core" }    ]  }]}
03

Surfaced the rules before the cart

Service level and bundle constraints moved up to the product page as semantic options agents could evaluate before committing — so the decision happened where the agent still had context.

Tactic

Service level as a labelled <select>, defaults mapped per category, readable in headless browsing without JS.

04

Handed over the test

The 47 agent tasks became a CI check that runs before every release. Below 90% task completion, the deploy blocks — so a weak new product is caught before launch, not after.

Tactic

Playwright + Anthropic Claude API as agent simulator, integrated via GitHub Actions.

Results after 30 days

Three headline numbers — and the main one breaks down.

38%fewer agent drop-offs
61%more product evidence live

−38% agent drop-offs

The 38% broke down into four levers we measured separately:

  • clearer variant evidence+14%
  • service levels surfaced upstream+12%
  • bundle constraints in structured markup+8%
  • QA-validated PDPs before launch+4%

What they kept

The durable output wasn't the redesign — it was the test. Cartpilot now runs the 47 agent tasks before every release, so product-readiness is something the team checks itself, the same way it checks that a page renders. Maintenance: about two hours a month.