Running an AGV pilot: what should pass and what should fail it
A pilot without written pass/fail criteria is a sales exercise, not an evaluation. Here is the framework.

The majority of AGV pilots are structured to succeed. The vendor's demo team is on-site for commissioning. Routes are pre-optimized. Traffic is managed. The pilot period is short enough that the low-frequency failure modes — battery capacity degradation, floor-wear-related sensor drift, maintenance staff turnover — haven't had time to manifest.
This is not malicious. It is the natural result of a vendor deploying their best team under close observation during a defined evaluation window. The problem is that a pilot designed to succeed tells you very little about how the system will perform in year 3, when the commissioning team has left, two of the original maintenance technicians have moved on, and the floor has been repainted twice.
A properly structured AGV pilot is designed to expose failure modes, not to suppress them. It tests the system under real operating conditions, creates a written record of exceptions, and produces a go/no-go decision tied to pre-agreed criteria rather than gut feeling at the end of a 90-day window.
Before the pilot starts: the pre-conditions that must be in writing
1. Pilot scope document, signed by both parties
The pilot scope document defines what is and is not being evaluated. It should specify:
- Vehicle configuration: which vehicles, in what configuration, on which routes
- Route definition: the exact guide path, stations, and traffic zones included in the pilot
- Operating hours: which shifts, how many days per week, over what period
- Integration scope: which WMS/WCS touchpoints are live for the pilot vs mocked
- Staff support: which vendor resources will be present during the pilot and for how long
- Exclusions: what conditions are out of scope (e.g., "vehicle performance in cold storage zones not included in Phase 1 pilot")
The exclusions section is as important as the inclusions. Vendors will naturally scope pilots to their system's strengths. An exclusion that covers the 30% of your facility that is cold storage and your most throughput-critical zone is not a neutral exclusion — it is a risk deferral that you are accepting.
2. Baseline measurement
Two weeks of baseline measurement before the first AGV arrives. You need before-and-after data for every KPI the pilot is supposed to improve. Common AGV pilot KPIs that require baseline:
- Pallet moves per shift (measured by WMS task completion data, not manual count)
- Labor hours per 1,000 pallet moves (warehouse management labor reports)
- Median and P95 cycle time for the route being replaced
- Incident rate on the route (forklift near-misses, load drops, floor damage events)
- Unplanned downtime on adjacent processes caused by transport delays
Without baseline data, the pilot produces absolute numbers but no comparison. "The AGV fleet moved 420 pallets in the Tuesday day shift" means nothing unless you know the manual baseline was 380 pallets. Or was it 450? If you didn't measure before, you'll never know.
3. Written pass/fail criteria
This is the document that most pilots lack and that most pilot failures trace back to. Pass/fail criteria must be:
- Quantitative. "Throughput improved" is not a criterion. "P50 cycle time on Route A ≤ 4:30 min/move, measured over 10 consecutive operating days" is a criterion.
- Signed by decision-makers. The plant manager, the logistics director, and finance or operations leadership must sign the criteria document before the pilot starts. This prevents post-pilot scope creep ("the numbers look good but let's extend another 30 days to see...").
- Inclusive of failure modes, not just success metrics. A criterion that says "the system must achieve 98% availability" implies that if it achieves 97% availability, you have grounds to exit the contract. Make this explicit: "Availability < 96% over any rolling 14-day period is a contract failure event."
- Time-bounded. The pilot has a defined end date. At that date, the pass/fail assessment is performed and a go/no-go decision is made within 5 business days. No automatic extensions.
The six performance criteria that should be in every AGV pilot
Criterion 1: Availability
Definition: Percentage of scheduled operating hours during which the AGV fleet is in active service, excluding planned maintenance windows.
How to measure: (Operating hours − unplanned downtime hours) / Operating hours, measured daily, reported weekly, assessed over the full pilot period.
Minimum threshold for a pass:
| Application type | Minimum availability |
|---|---|
| Non-critical intralogistics (material replenishment) | 94% |
| Production-support (line-side delivery) | 97% |
| Takt-critical (BIW transfer, JIT pull) | 98.5% |
Common failure modes to watch for: sensor drift from floor contamination, WiFi roaming failures causing task dropouts, battery charge failures due to incorrect charge programming, traffic deadlocks from flow management errors.
Distinguish between vehicle downtime (one unit is unavailable) and fleet downtime (the entire system is unavailable). A pilot with six vehicles where two are simultaneously down 15% of the time is delivering 67% of planned capacity even if no single unit triggers the availability threshold.
Criterion 2: Cycle time P95
Definition: The 95th percentile cycle time for the primary route, measured over 10 consecutive operating days after commissioning stabilization (typically days 14–24 of a 90-day pilot).
Why P95 matters: Median cycle time tells you what happens most of the time. P95 tells you what happens during operational stress — when the route is busiest, when interfering traffic is at peak, when the system is managing multiple simultaneous exception conditions.
Target: Specify the P95 target at the RFP stage, before commissioning. If the vendor cannot commit to a P95 SLA in the contract, treat this as a red flag (see article 6 in this series).
A common rule of thumb for laser-guided AGVs on clean, dedicated routes: P95 should be ≤ 1.15× the quoted median. If the vendor quotes a 4-minute median and the pilot produces a P95 of 6.5 minutes, the system is not performing to specification even if the median looks fine.
Criterion 3: Throughput rate
Definition: Pallet moves (or equivalent task units) completed per shift, averaged over the measurement period.
Measurement: Use WMS task completion timestamps, not the AGV FMS's own reporting. The WMS timestamp captures when the task was completed in the management system, which is the operationally relevant metric. The AGV FMS may report "move completed" on vehicle return without accounting for upstream or downstream delays.
Pass threshold: This should be derived from your production requirement, not negotiated with the vendor. If line-side delivery requires 45 moves per shift to maintain production cadence, the pilot threshold is 45 moves per shift minimum, 48 moves per shift at full specification.
Watch for gaming: pilots that are scoped to run during off-peak periods, or on the most favorable route segments, will produce throughput numbers that do not extrapolate to full-operation conditions. Ensure the pilot measurement window includes at least 5 peak-production-day measurements.
Criterion 4: Fault classification and resolution time
Definition: Log every fault event during the pilot, classified by severity:
- Class A (critical): system-level fault requiring human intervention to resume operation. Route blocked, entire fleet stopped, safety system triggered.
- Class B (unit fault): single vehicle fault requiring maintenance attention, reducing fleet capacity.
- Class C (transient): self-recovering navigation exception (obstacle detection and stop, automatic retry within 90 seconds).
Pass thresholds:
| Fault class | Maximum acceptable rate |
|---|---|
| Class A | < 1 per week after day 14 of pilot |
| Class B | < 3 per vehicle per week after day 14 |
| Class C | < 8 per vehicle per shift |
Class C faults are expected and indicate normal obstacle encounter behavior. A very high Class C rate (> 15 per vehicle per shift) may indicate route design problems — excessive cross-traffic, narrow corridors, or obstacle patterns the system wasn't designed for.
Class A faults are the most expensive. Every Class A event requires human intervention and creates a period of zero throughput. Document the cause of every Class A event during the pilot. If the same cause produces Class A events more than twice, it is a systemic issue, not a one-off.
Criterion 5: Maintenance response time
Definition: Time from fault notification to system restored to service, for Class B faults.
Test: During the pilot, deliberately induce at least three maintenance scenarios (in coordination with the vendor) on non-critical operating days. Document the elapsed time from notification to restoration.
Pass threshold: The actual response time achieved during the pilot must be ≤ the SLA specified in the maintenance contract. If the SLA says 4-hour on-site response and the pilot shows 7-hour average response (because the nearest technician is actually 3 hours away), the SLA is not achievable under normal conditions.
This test sounds adversarial. It is not — it is protecting both parties from a contract whose service terms cannot be fulfilled.
Criterion 6: Integration stability
Definition: Zero data integrity failures between the AGV FMS and the WMS over the pilot measurement period. Data integrity failure = task assignment that was confirmed by the WMS but not executed by the AGV, or execution confirmed by the AGV but not reflected in the WMS.
Why this matters: Integration failures are invisible in operational metrics until they cause inventory discrepancies or missed production schedules. A pilot that runs clean on throughput and availability but has 2–3 integration dropouts per week is producing unreliable inventory data. This is not a minor issue — it is a data quality problem that will worsen as transaction volume scales.
Measurement: Cross-reference WMS task completion logs against AGV FMS logs daily. Any discrepancy is a failure event.
Organizational conditions that make pilots work
Technical criteria matter, but they only produce a valid result if the organizational conditions support honest measurement.
Assign one internal pilot owner. This person is responsible for logging faults, compiling metrics, running the weekly pilot review, and making the go/no-go recommendation. If no one owns the pilot, no one is watching the fault log in week 6 when the pattern is developing.
Keep the vendor's commissioning team off the monitoring system. During normal pilot operation (days 14+), the vendor's team should not have direct access to the fault log or the metric compilation. You are measuring the vendor's system, not the vendor's monitoring of their system. This is not about distrust — it is about eliminating confirmation bias in the reporting.
Run the pilot for at least 10 weeks. Short pilots miss the failure modes that emerge after the commissioning team has left and the plant maintenance staff are responsible for the system. Battery capacity starts showing measurable decline at 150–300 cycles. Floor surface wear at high-traffic points becomes visible at 6–8 weeks. Software issues that weren't triggered in commissioning emerge when the WMS sends edge-case task sequences at 8 weeks.
Define the kill decision in advance. If a Class A fault type recurs more than three times in two weeks, what happens? If availability drops below threshold for two consecutive weeks, what happens? These should not be decisions made in the room at the time — they should be decisions committed to in the pass/fail criteria document signed before the pilot started.
The most important decision: what happens at day 90
A well-structured pilot ends with a pre-agreed process:
- Metrics compiled against baseline, presented in a standard format
- Pass/fail determination against criteria — not a qualitative assessment
- Go/no-go decision by the signed decision-makers, within 5 business days
- If go: proceed to contract on pre-agreed terms
- If no-go: document the specific failure modes, issue revised RFP if appropriate
The most common pilot failure is not a bad result — it is no result. The 90-day window passes, the results are ambiguous, someone suggests an extension, and the pilot drifts into month 6. At that point, the organizational investment in the vendor relationship, the infrastructure already in place for the pilot, and the sunken cost of the pilot period all create pressure toward continuing regardless of the data.
A pilot with clear kill criteria, signed by leadership before it starts, is the only reliable protection against this outcome.
Read next: AGV RFP red flags: avoiding vendor lock-in


