Pilot-to-Production: The Criteria That Mean You're Actually Ready to Scale
Most pilots that advance to production scaling do so because someone was impatient, not because they were ready. Here's how to tell the difference.

McKinsey's research on scaling robotics deployments beyond the pilot phase found that roughly 40% of executives reported deployed pilots that were technically exciting but produced unclear business value. [REPORTED] The robots worked. The business case didn't — or couldn't be proven well enough to justify production investment.
The gap between a successful pilot and a scalable deployment is not primarily technical. Pilots succeed in controlled conditions. Production scaling succeeds when controlled conditions are generalized across more facilities, more shifts, more operators, and more variability than the pilot ever encountered.
Scaling before you're ready is more expensive than not scaling at all. A failed production rollout doesn't just waste capital — it burns organizational credibility for the next robotics initiative, sometimes for years.
The Readiness Framework
Eight criteria across four domains. Each should be confirmed, not assumed, before a production scaling decision.
Domain 1: Technology Readiness
Criterion 1: Sustained uptime above 90%, measured over at least 60 consecutive operating days
The pilot's uptime performance must be measured over a long enough window to capture realistic variability — shift changes, weekend restarts, software updates, and at least one maintenance cycle. 90 days of data, not 30. 60 consecutive operating days minimum.
The measurement must reflect actual production conditions, not a curated window. If you're reporting uptime metrics by excluding days when the robot was down for "scheduled" vendor visits, your number is wrong.
Why 90%? Below 90% uptime, the labor planning equation breaks down. Your workforce can't plan around a robot that might be down 10%+ of the time — they'll route around it habitually, and your labor displacement gains evaporate.
Anything below 85% sustained uptime in a pilot should be treated as a vendor SLA failure and resolved before scaling is considered.
Criterion 2: Performance in edge cases is documented and handled
A pilot that has only run under ideal conditions hasn't proved anything. The production readiness question isn't "does the robot work?" It's "what happens when things go wrong, and can your team handle it?"
Before scaling:
- Document the 10 most common fault types encountered during the pilot
- Verify that your operators can diagnose and respond to each without vendor assistance
- Document the 5 edge cases the robot handled poorly and either resolve them or explicitly exclude them from the scaled deployment scope
If you can't list the edge cases because you don't track them, you're not ready.
Criterion 3: Software update risk is understood and managed
Software updates are how vendors improve their products and also how they break deployments. Before scaling, establish:
- How often does the vendor push updates, and how much advance notice do they provide?
- What's your tested process for rolling back an update that causes a regression?
- Does the vendor notify you before pushing to production, or do they push automatically?
Scaling across multiple facilities or units without a managed software update process means one bad update can take your entire fleet offline simultaneously. This is an operational risk that's easy to prevent and rarely scoped before scaling.
Domain 2: Operations Readiness
Criterion 4: Operator competency is institutional, not individual
Pilot success often rides on one or two exceptional operators who became experts through the pilot experience. This is not institutional competency.
Before scaling, test your training infrastructure:
- Can a new operator reach basic certification in under 60 days using your internal materials, without vendor involvement?
- Do your supervisors know how to interpret the robot's performance data and escalate appropriately?
- Does your maintenance team know the planned maintenance schedule and have the parts to execute it?
If any of these require the vendor's ongoing presence, you have a vendor dependency, not a competency. At scale across multiple sites, vendor dependency is a production bottleneck.
Criterion 5: Your integration is documented and maintainable
Custom software integrations built during a pilot are frequently underdocumented. Before scaling, require:
- Complete documentation of the integration architecture: what connects to what, how, with what credentials and protocols
- A clear answer to "who maintains this when the vendor releases a software update on either side?"
- Source code and configuration files held by your organization, not only by the vendor or integrator
At single-site pilot scale, institutional knowledge can substitute for documentation. Across five facilities, it can't.
Domain 3: Organization Readiness
Criterion 6: An executive owner with cross-functional authority
Pilots can succeed with a mid-level operations champion who has enough informal authority to make things happen within their facility. Production scaling requires someone with budget authority across facilities, the ability to resolve resource conflicts between the robotics program and competing operational priorities, and accountability for the program outcome.
This person must exist and must be named before scaling begins. A steering committee is not the same as an executive owner.
McKinsey's research specifically identifies organizational trust — not technical readiness — as the decisive factor in whether robotics programs advance from pilot to scale. Two to three years of organizational adjustment are typically required before companies are confident enough to expand. [REPORTED] That adjustment is cultural as much as operational, and it requires sustained leadership attention.
Criterion 7: Change management infrastructure is in place at scale
Your pilot change management worked for one site with a team that went through the deployment together. At scale, new sites start from zero organizational readiness.
Before scaling:
- Is the change management playbook documented in enough detail that a site that wasn't in the pilot can execute it?
- Is there a dedicated change management resource who will be on-site at each new location during go-live?
- Is the communication strategy for new sites — addressing workforce anxiety, documenting role changes — ready to execute?
If the answer to any of these is "we'll figure it out site by site," you're planning to repeat your pilot mistakes at scale.
Domain 4: Financial Readiness
Criterion 8: The pilot's financial performance is fully reconciled, and the scaling case uses actual costs
This is the gate that is most commonly skipped because it requires admitting that the pilot's real costs differed from the original projection.
Before a scaling decision, require a full reconciliation of actual pilot costs versus projected costs:
- What did integration actually cost, versus the original estimate?
- What was the actual utilization rate, versus the vendor's reference case?
- What were actual maintenance costs, including unplanned repairs and parts?
- What did change management actually cost, including the undocumented hours?
Then build the scaling financial case using actual multipliers from the pilot, not the original vendor projections.
| Cost Category | Projected (pilot) | Actual (pilot) | Scaling assumption |
|---|---|---|---|
| Hardware (per unit) | $X | $X | Same |
| Integration (per site) | $Y | $Y+Z | Actual + 10% buffer |
| Year-1 maintenance | $A | $B | Actual |
| Change management | $C | $D | Actual per site |
| Payback period | N months | N+Δ months | Recalculated |
If the payback period calculated on actual costs still makes the scaling decision financially compelling, proceed. If it only works when you revert to projected costs, the decision isn't supported by your data.
The Readiness Scorecard
Use this before the scaling decision meeting. Each criterion is Pass / Conditional / Not Ready.
| Criterion | Status | Notes |
|---|---|---|
| 1. Sustained uptime ≥90% over 60 operating days | ||
| 2. Edge cases documented and handled | ||
| 3. Software update risk managed | ||
| 4. Operator competency is institutional | ||
| 5. Integration documented and maintainable | ||
| 6. Executive owner with cross-functional authority | ||
| 7. Change management infrastructure at scale | ||
| 8. Scaling case built on actual pilot costs |
Green light for scaling: All 8 are Pass, or 7 are Pass and 1 is Conditional with a documented remediation plan.
Conditional green: 5–6 Pass, 2–3 Conditional with clear remediation, 0 Not Ready. Scale to one additional site as a "controlled expansion" before committing the full fleet.
Red: delay scaling. Any "Not Ready" is a stop-sign, not a discussion point. The cost of scaling before a Not Ready criterion is resolved is consistently higher than the cost of the delay.
What "Controlled Expansion" Looks Like
When you're between conditional green and green, the right move is a controlled expansion rather than a full fleet rollout.
A controlled expansion applies the pilot deployment — one additional site, all the same rigor as the original pilot — before committing capital to the full scale. The controlled expansion is specifically designed to test the readiness gaps. It answers: does the playbook work when the team that built it isn't on the ground?
Controlled expansion budget: treat it as a mini-pilot, with its own baseline measurement, KPI targets, and decision gate. Not a foregone conclusion. If the controlled expansion doesn't perform as well as the original pilot, find out why before you multiply the problem.
The Cases That Rushed
The organizations that regret production scaling decisions are, almost without exception, organizations that advanced before hitting Criteria 4 (institutional competency) and 8 (actual-cost financial case). They had enthusiasm and organizational pressure for progress, which substituted for data.
The pattern: a pilot shows promising results. Leadership sets a deadline for the full rollout — "we'll have all five sites live by Q3." The deployment team knows the competency infrastructure isn't ready, but the deadline is set, and they spend Q2 rushing training programs instead of building them properly. Go-live happens on schedule. Performance is 60% of the pilot benchmark. The program is declared underperforming. The robots get reprogrammed for narrower use cases. The ROI never closes.
This is the most common failure mode in production scaling, and it's entirely preventable by the 8-criterion gate. The delay is never as expensive as the failed rollout.
Next in this series: Measuring Success — The KPIs That Matter Beyond "It Works"


