Running a Humanoid Pilot: Scope, Success Criteria, and Exit Options
The pilot that produces no decision is the most expensive outcome. Design yours to avoid it.

When Agility Robotics signed its commercial agreement with Toyota Motor Manufacturing Canada in February 2026, it followed a structured pilot that produced a clear positive outcome: measurable productivity, sufficient reliability, and a customer confident enough to pay for ongoing service. That outcome didn't happen accidentally. The Toyota pilot had defined task scope (RAV4 logistics support, not general-purpose floor work), defined success metrics, and a commercial structure that gave both parties clear terms for expansion.
The contrast with most humanoid pilot programs in the market is stark. Gartner's assessment of the broader humanoid market is that the majority of enterprises advancing proofs-of-concept will never reach production deployment. That failure rate is not primarily technological. It's structural: pilots without defined scope, without quantifiable success criteria, and without a named owner who is accountable for producing a decision.
A humanoid pilot that ends at day 90 with "the team feels positive about the technology" is a failed pilot. It consumed capital, staff time, and vendor support resources — and produced nothing that can inform a go/no-go recommendation to leadership. Design your pilot to avoid that outcome from day one.
Pre-Pilot: The Work That Determines the Outcome
The most important decisions in a humanoid pilot are made before the robot arrives.
Define the task scope in writing
The task scope has two dimensions: the physical task the robot will perform, and the operational boundaries within which it will perform it.
The physical task must be specific enough to measure. "Support warehouse operations" is not a task scope. "Transport totes from the inbound receiving station (Zone A) to the staging buffer (Zone C) on Shift 2, Monday through Friday" is a task scope. The level of specificity determines whether you can measure performance.
The operational boundaries define what the robot will not do during the pilot. This matters because vendor account managers will advocate for expanding scope as soon as the initial task is running — "the robot is doing great, let's also have it handle returns" — and scope expansion during a pilot destroys the ability to measure performance cleanly. Hold the line. One task, one zone, one shift.
Establish the baseline before the robot arrives
Two weeks of baseline measurement is non-negotiable. Run the operation as you normally run it, measure every KPI in your success definition, and log the results. This is the only honest comparison point when you reach day 90.
Common baseline metrics for humanoid logistics deployments:
| KPI | Unit | How to Measure |
|---|---|---|
| Task throughput | Totes/cases/units per shift | WMS logs or manual count |
| Labor hours on target task | Hours per shift | Timecard data or direct observation |
| Exception rate | Incidents requiring human intervention per 100 cycles | Manual incident log |
| Task cycle time | Minutes per completed cycle | WMS timestamps or stopwatch sampling |
Pick three. More than three metrics creates ambiguity about what actually drives the go/no-go recommendation. Fewer than two leaves you without a fallback if one metric is noisy.
Get the success definition signed by GM and Finance
Turning a success definition into a document that two senior stakeholders have signed changes its status from a working assumption to a commitment. The signature is not bureaucratic theater — it is protection against two very common failure modes.
The first failure mode: at day 90, the numbers don't clearly meet the target, and there is pressure to extend the pilot "because we're almost there." A signed KPI document gives the pilot owner standing to recommend a kill decision without navigating organizational politics around whether the target was actually the target.
The second failure mode: a new executive inherits the program and has different expectations. A signed document establishes what the program was supposed to accomplish before their involvement.
Name a single owner
One person is accountable for the pilot outcome. Not the technology committee. Not the operations team collectively. One person, named in writing, whose job description for the duration of the pilot includes running weekly stand-ups, reviewing KPI data, logging incidents, and making the kill/scale recommendation at day 90.
Without a single owner, nobody does the unglamorous mid-pilot work: the daily incident log, the week 6 KPI review that surfaces a negative trend, the vendor call to escalate when performance degrades. By day 90 you have impressions but not data.
The 90-Day Pilot Timeline
Days 1–14: Environment preparation and vendor onboarding
- Complete facility mapping (if the vendor requires it)
- Install and test charging infrastructure
- Validate wifi coverage in the deployment zone — dead zones surface here, not in week 6
- Mandatory staff demo sessions for all workers in the deployment zone
- Confirm incident reporting protocol: who logs what, where, and how quickly
Do not rush through this phase to get the robot running faster. Infrastructure gaps and staff resistance that surface in week 6 almost always trace back to skipped steps during setup.
Days 15–30: Supervised operation (teleop heavy)
The first two weeks of robot operation will involve significant vendor support — either on-site engineers or high-ratio remote teleop. This is normal and expected. The robot is being environment-mapped and task-trained on your specific facility.
Track teleop ratio from day one. This is your primary indicator of autonomy progress. Log: total robot operating hours, total teleop intervention hours (including both the duration of each teleop session and the number of interventions). Calculate the ratio weekly.
A healthy trajectory: week 2 teleop ratio of 1:3 (one teleop hour per three robot hours) is acceptable; by week 8 it should be tracking toward 1:8 or better for a well-scoped task in a structured environment.
Red flag: If teleop ratio is not improving week-over-week by week 4, surface this immediately with the vendor. Either the task scope is outside the robot's current training envelope, or the environment has characteristics that aren't being addressed.
Days 31–60: Unsupervised operation target
By day 30, the pilot should be targeting operation without vendor personnel physically on site. Remote monitoring and on-call support are fine; having a vendor engineer on the floor every day is not a sustainable production configuration and masks the true operational performance of the robot.
This phase is where the real data is generated. KPIs measured with vendor support on site are artificially inflated — the engineer resolves exceptions before they become logged incidents. Measure everything from day 30 onward against the baseline, with vendor presence clearly logged.
Run the week 4 and week 8 staff surveys. The data from the 2024 Washington State University study on hospitality robots applies broadly: workers with direct robot experience can develop elevated turnover intention even when the robot is performing well, particularly if task displacement anxiety is not addressed directly. Catching this at week 4 gives you time to intervene with targeted communication rather than discovering it as a retention problem in month 4.
Days 61–90: Final measurement and decision preparation
The final 30 days should look like normal production operations with a running measurement protocol. No scope changes, no task additions, no changes to the measurement methodology.
At day 75, the pilot owner should draft the initial decision recommendation. This forces a preliminary synthesis of the data while there is still time to address any measurement gaps before the formal day-90 review. Common gap: realizing at day 88 that the baseline measurement period was inconsistently collected and the comparison is not clean.
At day 90, the pilot owner presents a recommendation to GM and Finance: scale to production, extend with a defined new target and timeline, or kill. The recommendation must be driven by the KPIs established in the pre-pilot agreement — not by impressions, vendor advocacy, or the sunk cost of the pilot investment.
The Kill Criteria (Write These Before the Pilot Starts)
The kill criteria are the specific outcomes at day 90 that trigger a decision to discontinue — not extend.
A practical kill criteria template:
"If [KPI 1] has not improved by at least [threshold] compared to the baseline, OR if [KPI 2] falls below [minimum level] at any point during the measurement period, the program does not advance to Phase 2 regardless of other factors."
The specific thresholds should be calibrated to your unit economics — what level of performance improvement justifies the ongoing cost of the robot deployment? Work this backward from your TCO model. If the robot needs to displace X hours of labor to justify the annual cost, what throughput does it need to achieve to hit that displacement?
The hardest discipline in humanoid pilots is pulling the plug when the numbers don't work. There is always a reason to extend. "The environment mapping just improved." "We had an unusual season." "The vendor pushed a major autonomy update that should kick in next month." Vendors are structurally incentivized to advocate for extension — an extension is a delayed kill, and a delayed kill is another 30 days of demonstration data.
A kill decision with clean data is valuable. It tells you: this platform, this use case, this facility — not right now. That information informs your next procurement decision. An indefinitely extended pilot that never produces a decision is a sunk cost machine.
Exit Options: What They Look Like in Practice
Structure three exit paths into the pilot contract before signing:
Scale to production: The vendor transitions from pilot pricing to production pricing, the task scope expands to full operational deployment, and the robot is integrated into the WMS/MES as a permanent operational asset. Confirm the commercial terms for this transition before the pilot starts — discovering at day 90 that production pricing is 40% higher than pilot pricing is a common surprise.
Defined extension: A 30-day extension with a specific new KPI target and a clear understanding that the extension results in either a production decision or a kill. An extension with no new target is not an extension — it is a delay. Some pilot contracts allow only one extension; this is worth negotiating into the original agreement.
Kill with wind-down: The equipment is returned, the environment mapping and task training data is either retained by the customer or deleted per the contract, and the staff configuration reverts to pre-pilot. Confirm the data retention and deletion terms before the pilot. The autonomy training data generated in your environment has value — in some contracts, vendors retain exclusive rights to all training data generated on their hardware, which means you cannot use that data to train an alternative platform.
Common Failure Modes in Humanoid Pilots
Scope creep: The pilot starts on one task and expands to three tasks by week 6. Now you can't measure any of them cleanly. Hold the scope.
The unmeasured baseline: The robot arrives before the baseline measurement period completes. You have no comparison point. Enforce the two-week baseline window even if the vendor is pushing to deploy faster.
Vendor presence masking performance: The vendor engineer is on site 4 days a week resolving exceptions. Day 90 data looks great. Day 91, when they've left, reveals that 40% of the robot's operating hours required human intervention that wasn't being logged as teleop.
Missing the staff signal: Week 8 staff surveys surface significant anxiety, but the pilot owner is focused on throughput numbers and doesn't escalate. By day 90, three experienced workers have quietly transferred away from the deployment zone.
No single owner: The pilot is owned "by the team." Nobody does the daily incident log. Nobody escalates the week 4 anomaly. Day 90 arrives with impressions and no data.
The next article covers vendor selection: the 10 questions you must ask before signing any humanoid deployment contract, and what the answers reveal about vendor maturity.


