Introduction
Field biologists know a non-obvious rule of thumb about scorpions: the smaller the pincers, the deadlier the sting. So, don’t trust your eyes that easily, because the obvious thing to see, the big, impressive claws usually mean mild venom. The small, forgettable ones often mean a tail loaded with neurotoxin, and that brings Us to the today’s topic. Mind the danger that is in the tails.
With Industrial asset availability, tails are the same dangerous. “90% availability” is the big thing to see: large, visible, comforting. But the poison is in the tails of the distribution; in the rare, long outages, or in the small cluster of failures happening exactly when prices spike, or in the hours where you are “up” at 60% capacity while your KPI-tracker counts that as “available”.
This article for Us to mind “the tail”; that statistical place where low‑frequency, high‑impact events live, exist and cause damage, almost invisibly. We will “dissect” that tail apart into event structure, time context, and capacity level, and put hard money to see it clearly, maybe for the first time. After learning about tails, you will not be able to see the “90% availability good news” without also noticing the scorpion tail of your own downtime distribution, while wondering which part of your P&L it is already stinging.
Part 1 — Why “90% Availability” Tells You Almost Nothing
A plant can be wildly out of control and still have a beautiful availability number.
The illusion of control
Let’s start by notice some stuff related to averages:
- A drunk driver has an “average lane position.”
- A collapsing bridge has an “average load capacity.”
- A plant with chaos in planning, maintenance, and operations has an “average availability”..so what!
The existence of an average proves nothing about control. Only low variability and predictable patterns prove control. A single availability KPI hides completely whether you are or not, in control.
If you tell me “we are at 90% availability,” you have demonstrated exactly two facts:
- You can add the up‑hours.
- You can divide by total hours.
Nothing in that ratio tells me how uptime is distributed, whether the pattern is getting tighter or wilder, or whether you can predict anything about next month.
You have provided zero evidence that the system is under control, improving or worsening. You have only shown that you own a calculator.
The numbers that make denial impossible
ABB surveyed 3,600 industrial decision‑makers worldwide.
- 44% suffer equipment‑related interruptions at least monthly.
- 14% suffer them weekly.
- At the same time, 83% say unplanned downtime costs at least $10,000 per hour, and three‑quarters put the figure as high as $500,000 per hour.
Now put those two facts next to the typical KPI dashboard:
- Line 1: “Plant availability: 90.3%. ”
- Line 2 (missing): “We have unplanned stoppages every single week.”
- Line 3 (missing): “Each stoppage costs us up to $500,000 per hour.”
The availability number is not lying. It is simply not answering any of the questions that matter if you want to assess who is the boss..the plant, or you!
Why the average is structurally misleading
Availability is defined as:
A = uptime / (uptime + downtime)
It is a time‑weighted average. For that average to mean anything, as a business signal, the world would have to be flat: prices flat, demand flat, contractual windows symmetric, operating stress flat. Bad news is…
In energy, mining, water, and whatever industrial plant worth to work at, the world is not flat:
- Spot prices and netbacks move.
- Seasonal demand moves.
- Penalty clauses kick in only when you miss specific windows.
- Failure rates themselves climb with loading and stress.
Such variability implies that 1% loss of availability in January can destroy 10× the value of a 1% loss in June. If, for example, January is peak‑price, peak‑demand, high‑penalty season January’s 1% can’t be compared with June’s 1%s. Customers never buy “plant availability.” They buy megawatt‑hours, tonnes, cubic meters, milk bottles or whatever, delivered when they need them, not when your plant is “available” to provide it!
The core problem in one sentence
So let’s close the first idea. An availability number silently assumes that time is all the same and that mentioned business variabilities, very visible to all of Us (i know ..I know…not to all of Us), simply doesn’t matter.
- It ignores the when. Whether the plant is down or up gets decoupled from how badly we need it.
- It ignores how downtime clusters.
- It ignores what is happening in the market when failures occur.
That is why “we have 90% availability” is not a reassurance and its meaning is limited. Over-reliance in such a number is a red flag for organization might be mistaking arithmetic for control.
Part 2 — What “90% Availability” Actually Hides
Ok, that was my first attempt in the article to “kill” the comfort blanket of availability percentages. Now let’s go under the number and show exactly what it throws away:
- event structure,
- time context, and
- capacity level.
2.1 The textbook definition, and its blind spot
In the simplest repairable-component model, the asset flips between two states: up and down. - Failure rate λ: transitions per unit time from up to down.
Repair rate μ: transitions per unit time from down to up.
Classic Markov analysis gives the long‑run probabilities:
A = μ / (λ + μ), U = λ / (λ + μ)
Availability is just the fraction of time spent in the up state over an infinite horizon. That is all it is.
From the same model you also get:
- Frequency of down events (how often you go down):
f_down = A · λ = λμ / (λ + μ)
- Mean duration of each down event:
d_down = U / f_down = 1 / μ
Our traditional availability calculation keeps A and discards f_down and d_down.
It keeps how much time you are down and throws away how that downtime is structured.
That is the first fatal collapse: the event structure.
2.2 Same availability, different universe
Now the identity that should be tattooed on every KPI dashboard: take two systems.
- System 1: failure rate λ, repair rate μ
- System 2: failure rate 2λ, repair rate 2μ
Compute availability for each:
A₁ = μ / (λ + μ), A₂ = 2μ / (2λ + 2μ)
So, from an Availability perspective, both systems are exactly equal!.
But operationally, they are not even in the same universe!
- System 2 has twice as many failures per year.
- Each failure in System 2 is half as long.
Same Availability, is representing completely different worlds:
- Different maintenance workload.
- Different spares and staffing profile.
- Different exposure to contractual windows.
- Different impact on buffers and scheduling.
Availability cannot tell you whether you live in the “rare, long outages” world or the “frequent, short outages” world. Mathematically, it erases that dimension.
That is not a philosophical quibble. It is the difference between “a few outages you rebuild around” and “a pattern that will eventually blow up your contracts” and exhausts completely your organization logistics and personnel.
2.3 The second collapse: time is treated as if it were all the same
Availability is a time‑weighted average:
A = uptime / (uptime + downtime)
It treats every hour of the year as identical.
For this number to be a reasonable proxy for economic performance, the world would have to be:
- Flat price, flat demand.
- Flat penalty structure.
- Flat stress (failure rate independent of load).
Power‑system reliability theory has matured much above simply “availability” and exists because this KPI is not useful for supporting decision making. LOLE/LOEE and frequency–duration analysis are built precisely to measure how often and how long capacity shortfalls overlap high‑demand, high‑value periods.
In real assets:
- Failure rates climb under high load and thermal stress.
- Prices and contribution margins move with season and market.
- Penalty clauses trigger only in specific windows.
Availability silently assumes that an hour in January at peak price is equivalent to an hour in June at trough price. It is not. A 1% availability loss in January can be worth 10× a 1% loss in June if January carries peak price, peak demand, and penalty exposure.
That is the second collapse: time context.
2.4 The third collapse: "availability of what capacity?"
So far we have pretended the plant is either 0% or 100%. That is the next lie.
Real units have derated states: 0% (forced outage), 50%, 70%, 100%.
If you want to honor a contractual commitment, reliability analysis should not just ask “is the unit up?” It should ask “how much capacity is actually there for Us to deliver what we promised?”.
Let’s make a naïve example: Suppose a 100 MW generation system unit with long‑run probabilities:
- 5% at 0 MW
- 10% at 50 MW
- 85% at 100 MW
Then:
- Binary availability (“not completely dead”):
A_binary = 1 − 0.05 = 95%
- Expected power capacity delivery:
E[C] = 0.10 × 50 + 0.85 × 100 = 90 MW
It should be very clear that, from a contractual perspective, and for a buyer’s perspective, there is a huge difference between:
- “95% of the time I get 100 MW,” and
- “95% of the time I get somewhere between 50 and 100 MW, average 90 MW.”
Standard availability reporting hides this. It collapses “barely limping” and “fully healthy” into the same “up” state.
Formally, if C is available capacity and C_rated is nameplate, what you really care about is:
A_α = P( C / C_rated ≥ α )
for thresholds like:
- A_0.99: availability of ≥99% capacity.
- A_0.95: availability of ≥95% capacity.
- A_0.70: availability of ≥70% capacity.
For the same plant, these can be radically different.
That is the third collapse: capacity level. Availability does not answer “availability of what?”..and just create an ambiguous zone for Us to feel nice about something we haven’t truly achieved.
2.5 What the full reliability picture looks like (minimal version)
From the same Markov and frequency-duration framework, for each plant you can cheaply compute:
- U_plant: long‑run fraction of time below minimum acceptable capacity.
- f_down,plant: how many plant‑down events per year.
- d_down,plant: mean duration per plant‑down event.
- d_down,P95: 95th‑percentile outage duration (tail risk).
- A_α: availability of ≥α·nameplate (capacity bands).
Yes…you can calculate it!.. and managerial team can learn to understand it in less time that it takes to check a TikTok video!
You can also compute how often plant‑down or derated states coincide with high‑load / high‑price periods by combining the capacity model with a load/price model, exactly as LOLE/LOEE does for generation systems.
That gives you a vector like:
“We are ≥95% of capacity for 88% of the year, in 9 events, with P95 duration 11 hours, and only 0.8 events/year overlap our top 10% margin hours.”
This “complete” types of measurement allow to talk about something that has a clear, and almost “mechanical” relationship to cash and risk.
Compare that to:
“We are at 90% availability.”
Both can be derived from the same underlying model. One is almost pure theater. The other is engineering.
2.6 The scalar KPI of Availability is structurally unfit for capital decisions
When you roll everything into “90% availability,” you have:
- Collapsed event structure – You lost how often you go down and how long each outage lasts.
- Collapsed time context – You lost when those outages land relative to demand and price.
- Collapsed capacity level – You lost whether “up” means 100%, 95%, or 60% of nameplate.
So, if you ever felt the Availability KPI was empty it is because IS EMPTY!. When you look at it with a finance or operations brain, it is a ratio of expectations that throws away exactly the dimensions where value is created and destroyed.
But my mission is not “killing availability”. That KPI, used as a derived index inside a proper reliability model, availability is fine. Now, used as a headline KPI for “are we in control / improving?”, it is almost the wrong concept.
Part 3 — The Price of Each Collapse
Parts 1 and 2 established (I hope…) that availability is structurally blind because it erases the pattern of failures, ignores when they occur, and collapses partial output into a binary state. This section shows what that hiding costs is, expressed in dollars, tonnes, and megawatt‑hours. I use three worked examples, one for each collapse. Every number is checkable. Every scenario is one that experienced operations professionals will recognize immediately..(I hope also!).
3.1 Case 1: The Event‑Structure Collapse. Two Plants at 90%, Twenty Million Dollars Apart
$20.8 million per year. That is the revenue gap between two mineral processing plants, identical in size, that report the same 90% availability.
Plant A and Plant B each operate a five‑stage series system: crushing, grinding, flotation, thickening, and filtration. Each stage is rated at 500 tonnes per hour.
Plant A experiences 36 failure events per year, each averaging 24 hours. Total downtime: 864 hours. Availability: 90.1%.
Failure Profile A: Frequent but short; bearing seizures, instrument faults. The kind of events maintenance crews handle between shift changes.
Plant B experiences 4 failure events per year, each averaging 219 hours — roughly nine days. Total downtime: 876 hours. Availability: 90.0%.
Failure Profile B: Rare but catastrophic: a mill gearbox failure, a flotation cell structural collapse, a months‑in‑the‑making corrosion breach.
The availability metric sees these two plants as identical, BUT the bank account does not!
Why the dollars diverge. The mechanism is buffer interaction in series systems. Intermediate stockpiles like:
- surge bins between crushing and grinding,
- slurry tanks between flotation and thickening
exist to decouple stages.
A typical inter‑stage buffer in mineral processing holds 4–12 hours of throughput. The exact size varies by circuit, but it is finite and knowable.
| Metric | Plant A | Plant B |
|---|---|---|
| Failure events/year | 36 | 4 |
| Avg outage duration | 24 hours | 219 hours (9.1 days) |
| Total downtime | 864 hours | 876 hours |
| Availability | 90.1% | 90.0% |
| Buffer coverage | Most outages absorbed | Buffers empty by hour 8 |
| Cascade propagation | Minimal | Full system stall per event |
| Contract penalties | None triggered | Continuity clause breached |
| Annual revenue gap | $20.8M | |
Plant A’s outages are short enough for the buffers to absorb the shock. Downstream stages continue production while the failed stage is repaired. Lost throughput is limited mainly to the failed stage for the failed hours.
Plant B’s outages exceed buffer capacity within the first 8–12 hours. Once buffers drain, every stage downstream goes idle. The surge bin between crushing and grinding empties. The slurry tank feeding flotation empties. The entire line stops until the repair is complete and the system ramps back up. Each of Plant B’s 4 events is not just a single‑stage failure. It is a plant‑wide production halt lasting a week or more.
Additionally, Plant B’s nine‑day outages trigger contractual continuity clauses that Plant A’s 24‑hour outages never approach. The penalties alone add several million dollars per year.
So the net result is this: same availability, but a $20.8 million annual revenue gap. The event-structure collapse. It is invisible to the metric and devastating to the income statement.
Maybe I’m wrong, but a part of me believes you should want to know if your plant is more like A, or like B. If so, the scalar Availability is NOT going to give you a hand here.
3.2 Case 2: The Time‑Context Collapse. Same 91% Availability, $74.3 Million Apart
Now a power generation case: two dispatchable gas‑fired generators, each rated at 400 MW, operating in the ERCOT market in Texas.
Both report 91% availability. Both have 788 hours of downtime per year. Same number of hours. Same headline KPI.
The revenue gap between them: $74.3 million per year.
How. In merchant power markets, price is not flat. ERCOT has an energy‑only design with a $5,000/MWh price cap. Prices are bimodal: thousands of hours at $20–$40/MWh (surplus periods) and a few hundred hours at $200–$5,000/MWh (scarcity events, typically in summer peaks and winter cold snaps).
For this example, let us simplify the price distribution into:
| Price Tier | Hours/year | Price ($/MWh) | Revenue potential |
|---|---|---|---|
| Baseload (7,972 h) | 7,972 | $35 | $111.6M |
| Shoulder (600 h) | 600 | $150 | $36.0M |
| Scarcity (188 h) | 188 | $1,500 | $112.8M |
| Total | 8,760 | $260.4M |
Plant C’s 788 hours of downtime fall randomly across the year: proportionally in each tier. Plant D’s 788 hours of downtime cluster during summer peaks and winter events — 186 of its 788 hours fall during scarcity pricing, and 200 fall during shoulder.
Plant C’s revenue loss: $14.2 million. Its downtime, like a random sample of hours, falls mostly in baseload.
Plant D’s revenue loss: $88.5 million. Because 186 hours at $1,500/MWh alone account for $111.6 million in lost revenue potential.
The gap: $74.3 million per year. Same downtime. Same availability. The only difference is when the downtime occurs.
That is the time‑context collapse. Availability treats a baseload hour and a scarcity hour as identical. The market does not. In ERCOT’s case, the value difference is more than 40:1 between the cheapest and most expensive hours.
This is not an exotic edge case. Scarcity pricing is a designed feature of energy‑only markets, and it is the mechanism by which investment in reliable capacity is supposed to be rewarded.
If your reliability metric cannot distinguish between losing those hours and losing cheap ones, it is failing at its one job.
3.3 Case 3: The Capacity‑Level Collapse. “Available” at Half Power
Plant E is a 500 MW combined‑cycle power station. Its binary availability is reported as 94%. If we look at it as binary and naïve we would say that, for 94% of the time the turbines are spinning and the plant is contributing to the grid.
But (…) “spinning” is not the same as “delivering rated output.”
Plant E has extensive partial‑output states. Its real operating profile:
| State | Capacity | % of time | Binary status |
|---|---|---|---|
| Full output | 500 MW | 72% | “Available” |
| Gas turbine derate | 400 MW | 10% | “Available” |
| Steam turbine only | 300 MW | 6% | “Available” |
| Minimum stable load | 250 MW | 6% | “Available” |
| Full outage | 0 MW | 6% | “Unavailable” |
Binary availability says 94%. But the expected energy delivery is:
E[C] = 0.72 × 500 + 0.10 × 400 + 0.06 × 300 + 0.06 × 250 + 0.06 × 0 = 429 MW
The energy‑equivalent availability is 429 / 500 = 85.8%, not 94%. That is 8.2 percentage points hidden inside the “available” label.
| Metric | Value |
|---|---|
| Binary availability | 94% |
| Energy-equivalent availability | 85.8% |
| Hidden capacity gap | 8.2 points |
| MWh “lost” inside “available” hours | 142,040 MWh/year |
| At weighted avg. $65/MWh | $9.2M/year |
Those 142,040 MWh are not downtime. They are “available” hours where the plant was delivering 50–80% of nameplate. The availability number counts them as fully operational. The revenue statement does not.
3.4 The Compound Reality
No real plant suffers only one collapse out of the 3 cases show above at a time. All three operate simultaneously. Failure events have structure and they correlate with market conditions and many of the “up” hours are derated.
The examples above are conservative because I isolate each mechanism just for making this very short article readable. In reality, the three collapses do not add. They multiply. The long outages from Case 1 hit hardest during high‑demand periods from Case 2, because equipment is pushed hardest exactly when the market needs it most. The derated hours from Case 3 are worst during those same high‑stress hours, when inlet temperatures are highest and fouling has progressed furthest.
So, a plant reporting 91% availability might be destroying $20 million through event‑structure blindness, $60 million through time‑context blindness, and $8 million through capacity‑level blindness, simultaneously, and silently, and with the full endorsement of a KPI dashboard showing green simply because..well… we are deeply in love with average percentages!..I mean, there’s not any technical reason to keep availability…so I’m deeply convinced it IS because of Love!!
Part 4 — The Integrated Case: Three Plants, One Number, Three Fates
Three chemical plants. Identical specifications. Same market. Same 90% availability.
The NPV gap between them: $136 million over ten years. The availability dashboard showed green for all three, every single quarter.
4.1 The Setup
I’m feeling Greek today so let’s use Alpha, Beta, and Gamma for naming 3 production facilities. Each rated at 2,000 tonnes per day. Each operating in a market with seasonal prices:
| Quarter | Price ($/tonne) | Context |
|---|---|---|
| Q1 | $520 | Winter heating‑season demand |
| Q2 | $380 | Shoulder |
| Q3 | $580 | Peak industrial demand |
| Q4 | $360 | Shoulder |
- Weighted average: $460/tonne.
- Maximum theoretical annual revenue at 100% utilization: $335.7 million.
- Fixed OPEX: $42 million/year.
- Variable cost: $180/tonne.
- Capital expenditure: $280 million each.
All three report 90% availability. Each has 36.5 days of downtime. A financial analyst looking at the KPI dashboard sees three identical assets. A board presentation shows three green lights. An investment case for a fourth plant uses the “validated” 90% figure to project revenue and compute NPV.
That projection would be wrong by between $76 million and $136 million, depending on which plant’s pattern it accidentally resembled.
4.2 Three Patterns of “90%”
Each plant has a distinct failure signature, invisible to the availability metric but devastatingly visible to the income statement.
Plant Alpha: “The Fragile Workhorse”
48 failure events per year, each averaging 18 hours. The plant that never stops breaking, and never stays broken for long (Yes…yes…I know …I know… we all had that “alpha girlfriend” at college). Maintenance crews are expert firefighters. The CMMS is full of work orders. The availability metric forgives all of it because the total downtime adds up to 90%.
Alpha’s real problem is chronic derating. Years of frequent thermal cycling, catalyst degradation, and instrument drift have left the plant running below nameplate for nearly half its operating hours. When “up”: full capacity only 55% of the time, 82% capacity for 25% (fouling), 70% for 15% (advanced degradation), 55% for 5% (control limitations). Effective capacity factor during uptime: 88.8%.
90% time availability × 88.8% capacity effectiveness = true throughput availability of 79.9%. More than ten points below the headline figure. The availability dashboard shows 90%. The production report shows a plant operating as though it were available only 80% of the time.
Plant Beta : “The Seasonal Victim”
8 failure events per year, each averaging 4.6 days. Fewer events, better root‑cause analysis, longer intervals between failures. When Beta Plant runs, it runs at near‑full capacity: 99.2% effectiveness during uptime. On paper and in practice, Beta is the better‑maintained plant.
Beta’s downtime is 61% more expensive per hour than randomly distributed downtime would be. Its failure mode?..(You guessed it!)… cooling system degradation under thermal stress, heat exchanger fouling acceleration during high‑throughput campaigns. This means 70% of its downtime concentrates in Q1 and Q3, the two premium‑price quarters. Every day of downtime in Q3 destroys revenue at $580/tonne; the same day in Q4 would cost only $360/tonne.
Additionally, Beta’s 4.6‑day outage duration overwhelms inter‑stage buffers sized for roughly one day of throughput. Each failure event generates approximately 1.5 additional days of system‑wide disruption as upstream stages block and downstream stages starve before the cascade resets. That adds 12 days of effective production loss that appears nowhere in the availability calculation.
Beta’s equivalent naive availability: 84.7%.
Plant Gamma — “The Silent Bleeder”
3 failure events per year. Planned turnaround shutdowns, each lasting about 12 days, deliberately and wisely scheduled in Q2 and Q4 shoulder seasons where revenue impact is minimized. Lowest failure count. Highest MTBF. Near‑perfect maintenance scheduling discipline. By conventional reliability metrics, the best‑managed of the three. Its maintenance team would win industry awards.
Gamma’s destruction mechanism is invisible to every metric except direct production measurement. Progressive degradation (steam turbine blade erosion, compressor fouling, heat exchanger scaling, catalyst deactivation between turnarounds) silently erodes output capacity throughout the operating cycle.
The degradation accelerates during the high‑demand periods when the plant is pushed hardest: effective capacity drops to 83.7% during Q3 (summer peak) and 89.4% during Q1 (winter premium), while holding near 97% during the lower‑value shoulder quarters.
So the “Silent Bleeder” is alias a “perverse inversion”: Gamma loses the most capacity during the hours worth the most money. Its derating is seasonally anti‑correlated with price!, the worst possible pattern, and entirely invisible to the availability metric.
Gamma’s equivalent naive availability: 83.0%.
4.3 The Financial Reckoning
| Dimension | Naive 90% Estimate | Plant Alpha | Plant Beta | Plant Gamma |
|---|---|---|---|---|
| Reported availability | 90.0% | 90.0% | 90.0% | 90.0% |
| Failure events/year | — | 48 | 8 | 3 |
| Avg outage duration | — | 0.8 days | 4.6 days | 12.2 days |
| Effective capacity when “up” | 100% (assumed) | 88.8% | 99.2% | 91.4% |
| True throughput availability | 90.0% | 79.9% | 89.3% | 82.3% |
| Downtime in premium quarters | Proportional | Proportional | 70% | 10% |
| Derating in premium quarters | None | Moderate | Minimal | Severe |
| Annual production (tonnes) | 657,000 | 583,088 | 627,744 | 600,742 |
| Annual revenue | $302.1M | $268.2M | $286.1M | $277.6M |
| Fixed OPEX | $42.0M | $42.0M | $42.0M | $42.0M |
| Variable OPEX | $118.3M | $105.0M | $113.0M | $108.1M |
| Maintenance CAPEX | $6.5M | $8.0M | $6.5M | $5.0M |
| Contract penalties | $0 | $0 | $1.6M | $0 |
| Annual free cash flow | $135.4M | $113.2M | $123.0M | $122.5M |
| 10‑year NPV @ 10% WACC | $551.9M | $415.6M | $475.7M | $472.6M |
4.4 The Match!
Here we have our three plants and one single “reported availability” of 90%. Here we have a $60 million NPV gap between the best performer (Beta, $475.7M) and the worst (Alpha, $415.6M) both wearing the same 90% badge.
Against the naive projection:
- Plant Alpha destroys $136.3 million in NPV. That is 25% of expected asset value, through chronic derating the availability metric structurally cannot see.
- Plant Beta destroys $76.2 million. 14% of expected value associated to seasonal outage clustering and buffer‑cascade propagation.
- Plant Gamma destroys $79.3 million, also 14% out of expected value lost through progressive capacity erosion that concentrates in the highest‑value operating periods.
No plant reaches the naive projection. The best real outcome is $76 million below what the availability metric implied. The worst is $136 million below. But of course, the availability dashboard showed green for all three, for all ten years.
4.5 What the Board Never Saw
The most dangerous aspect of this analysis is not the magnitude of the gaps. It is that nothing in the standard KPI framework would alert anyone that the gaps exist.
Alpha would show high work‑order volume and short MTTR metrics that many organizations interpret as evidence of good maintenance responsiveness. The derating would appear, if at all, as a footnote in a monthly production variance report attributed to “operational factors.”
Beta would show low failure frequency and high MTBF metrics that would earn praise in any reliability review. The seasonal concentration of outages would be invisible unless someone manually cross‑referenced outage logs with price calendars, and that’s an analysis that standard dashboards do not perform.
Gamma would show the lowest failure count, the highest MTBF, and near‑perfect maintenance scheduling discipline. That is the textbook reliability department. The progressive capacity degradation would be buried in daily production figures that nobody correlates with hourly capacity bands, and certainly not with seasonal price curves.
In each case, the information needed to see the value destruction exists in the plant’s data systems. The production historian has the hourly output. The market desk has the price curves. The CMMS has the outage timestamps.
But the availability metric, by collapsing all of this into a single number, ensures that nobody is required to connect them, and in practice, NOBODY DOES!…
So…Am I crazy to ask? Why on earth we still use availability….ah! ..let me guess…here comes the chorus…the chant!…“Because it’s easy!”
Part 5 — Dismantling the Mask: Constructing the Measurement Vector
Parts 1 to 4 needed to do one and only job: Provide extensive examples on how the “scalar availability” number hides the very structures that drive cash and risk. Event patterns. Timing. Capacity levels. The idea of a KPI (afaik) is to improve decision-making capacity. Well, my fellow engineers…I have NO CLUE how average scalar availability serve that purpose!
Now that I’m at peace…and my mind feels calmed…this part does the opposite job. Now I don’t want to get you in trouble anymore. Now I want to build a valid measurement from the ground up. The goal is not to invent “yet another KPI.” The goal is to find the minimum set of numbers that:
- Preserve event structure.
- Preserve time context.
- Preserve capacity level.
- Translate directly into money and risk.
Not one component more. Not one component less.
You will see five components. Each exists only because it blocks one of the collapses you have already seen in action. Every one of them has a clean mechanical link to dollars.
5.1 The Design Principle: What Must the Replacement Avoid Destroying?
Go back briefly to Part 2. The scalar availability number destroyed information along three axes:
- Event structure – it kept total downtime and discarded how that downtime was distributed: How many times did you go down? How long was each event?
- Time context – it treated every hour as equal: It ignored when downtime coincided with price, demand, or penalties.
- Capacity level – it treated any non‑zero output as “up”: It ignored whether “up” meant 100% of nameplate or 60%.
Part 3 showed that each of these collapses can burn tens of millions per year on its own. Part 4 showed that when they operate together, they do not add, they multiply.
From this, the design constraint is non‑negotiable:
- The replacement cannot be a single number. A scalar always collapses at least one of these dimensions.
- The replacement also cannot be a 47‑line dashboard nobody reads…(people prefer short articles like this one)
What we are looking for is the smallest vector of numbers that:
- Separates event structure, time context, and capacity level.
- Is short enough that a plant manager, a finance director, and a board member can read it in under a minute…so he can come back efficiently to his TikTok videos!!
The answer is five components.
- One that measures how much you actually produced.
- Two that describe how downtime is structured.
- One that measures how value‑weighted the losses are.
- One that tells you whether things are getting better or worse.
We will take them one by one.
5.2 The Five‑Component Vector: Overview
Here is the vector in one table, before we explain each piece:
| Component | What it measures | Which collapse it prevents |
|---|---|---|
| V₁: Throughput Availability | Actual production ÷ maximum possible production | Capacity‑level collapse |
| V₂: Event Frequency | Number of output‑loss events per period | Event‑structure collapse |
| V₃: Duration Severity | 95th‑percentile single‑event duration | Event‑structure collapse (tail risk) |
| V₄: Value‑Weighted Unavailability | Lost production × price at time of loss ÷ total revenue potential | Time‑context collapse |
| V₅: Trend Slope | Slope of cumulative production shortfall over time | All three (trajectory) |
Now, just read the right‑hand column:
- Event structure information is protected by V₂ and V₃.
- Time context significance is protected by V₄.
- Capacity level is protected by V₁.
- V₅ does not introduce a new dimension; it tells you whether the other four are improving, stable, or degrading.
Now with the clarity we are protecting what we need, we can build each one of those “Vsss.”
5.3 V₁ : Throughput Availability: Fixing “Availability of What?”
What V₁ is.
V₁ replaces the question “How often were we up?” with the harder question “How much did we actually produce?”
- Denominator: rated capacity × total calendar hours in the period.
- Numerator: what actually went out the gate. You name it: tonnes, MWh, cubic meters, barrels, taken from the production historian or custody‑transfer meter…(no no no CMMS…why CMMS….forget “CMMSsss”)
No modeling. No interpretation of “up.” No arguing about what the CMMS status should have been. Product either left the plant or it did not.
Why it exists.
Think back to Gamma in Part 4 where we review the “Silent Bleeder.” Gamma reported 90% availability. But once you account for derating during “up” hours, its effective throughput availability was about 83%. That 7‑point gap was not abstract; it was tens of millions in lost revenue.
V₁ is aimed to exists to make that gap impossible to hide. It is not interested in whether the plant was flagged “running.” It is interested in whether the plant delivered production at the rate it should.
The same with Alpha.. the “Fragile Workhorse.” Alpha’s availability was 90%, but derating during uptime left it with an effective capacity factor of 88.8%, giving a true throughput availability of 79.9%. Binary availability said 90%. V₁ says:
- 79.9% of what you could have produced actually left the plant.
That is not a status. That is a missing line on the revenue statement.
How it looks in practice.
For Alpha, Beta, Gamma:
| Plant | Binary availability | V₁ (Throughput Availability) | Gap |
|---|---|---|---|
| Alpha | 90.0% | 79.9% | 10.1 points |
| Beta | 90.0% | 86.0% | 4.0 points |
| Gamma | 90.0% | 82.3% | 7.7 points |
In one row you see:
- The headline “90%” is an illusion of control.
- The real throughput sits 4–10 points lower.
Those gaps are now explicit, in the only unit that matters: product that did or did not leave the gate.
Where Barringer fits.
Barringer’s process reliability work uses Weibull plots of daily or shift output to understand how consistent production is. In that framework:
- V₁ is the central production level.
- The Weibull shape parameter β tells you how tightly the daily output clusters around V₁.
High β means most days deliver close to V₁, meaning stable, predictable production. Low β (≈2–5) means some days are high, some are disastrous, but the average happens to be V₁.
You do not need to use Weibull plots to apply V₁. But if you already work with them, V₁ becomes the obvious anchor:
- The vector reports “how much.”
- Barringer‑style plots explain “how consistently.”
And …the market?.. Yes!…I know, market COULD push your production to lower limits…but this is a first version of the KPI…and this is a guiding article…so YES! Indeed market variability can impact this V₁ , but the good news, then, we can start to talk about sync between market and plant..but that would have to wait for another blog entry.
5.4 V₂ and V₃ : Event Frequency and Duration Severity: Restoring Event Structure
Availability’s first crime was to treat “876 hours of downtime” as a single blob, ignoring the dangerous tails…whether that is 4 long events or 36 short ones. V₂ and V₃ put the structure back.
V₂: Event Frequency.
V₂ is the count of output‑loss events in the reporting period.
We define an output‑loss event as any continuous interval where actual production falls below a threshold. Let’s say typically something like 95% of rated capacity, but the site can choose the threshold.
Why below 95%, not just 0%? Because a day at 60% capacity is economically an outage for 40% of your plant, even if the CMMS status says “running.” V₂ counts both full shutdowns and serious deratings.
V₃ : Duration Severity.
V₃ is the 95th‑percentile single‑event duration. Not the average. The near worst‑case.
Part 3 showed why this matters:
- Plant A’s short outages were mostly absorbed by buffers.
- Plant B’s 9‑day outages blew through buffers and shut the whole chain down.
The average duration hides this difference; but measuring the tail prevents that!. The 95th percentile captures exactly the events that:
- Drain surge bins and tanks.
- Trigger contract continuity penalties.
- Create the cascades that empty or block entire systems.
Why the pair is necessary.
Part 2 gave you the identity:
- System 1: failure rate λ, repair rate μ.
- System 2: failure rate 2λ, repair rate 2μ.
Same availability. Different worlds.
One has half as many events, twice as long. The other has twice as many events, half as long.
V₂ and V₃ together expose exactly that:
- V₂ distinguishes “few events” from “many.”
- V₃ distinguishes “short events” from “long.”
How they look for Alpha, Beta, Gamma.
| Plant | V₂ (Events/year) | V₃ (P95 duration) | Pattern |
|---|---|---|---|
| Alpha | 48 | 32 hours | Frequent, short — “death by a thousand cuts” |
| Beta | 8 | 9.2 days | Infrequent, long — “buffer‑breaker” |
| Gamma | 3 | 13.8 days | Rare, very long — but planned and in shoulder seasons |
The availability number said these plants were identical.
V₂ and V₃ say they are as similar as three different universes. An operations professional can look at that table and know, in seconds, where buffers will be overwhelmed, where crews will burn out, and where contracts will break.
Try it yourself…See the table…Stare at the table…You’ll feel the plant…and you’ll say “Aha!!…!”..Yes..my friend..that is “the force”….growing inside you!
5.5 V₄: Value‑Weighted Unavailability: Putting Time and Money Back Together
V₄ is the component that directly ties reliability to money. It answers the question availability refuses to ask:
“When did the losses happen, and what were those hours worth?”
Definition.
Where:
- ΔQₜ = production shortfall in hour t (rated capacity minus actual output).
- Pₜ = price or contribution margin in hour t.
- Denominator = total revenue potential if the plant had run at rated capacity every hour.
V₄ is the fraction of potential revenue destroyed by unreliability in the period.
Why it matters.
Return to Plant C and Plant D from Part 3 — both at 91% availability, same 788 hours of downtime, but a $74.3 million revenue gap.
- Plant C’s downtime fell mostly in baseload hours at $35/MWh.
- Plant D’s downtime fell in scarcity hours at $1,500–$4,000/MWh.
Same hours lost. These guys had completely different value losses.
V₄ captures that directly:
- For Plant C, V₄ would be roughly 5.4% . It lost 5.4% of total revenue potential.
- For Plant D, V₄ would be about 41% . It lost 41% of total revenue potential.
Same total downtime. An eight‑fold difference in V₄. That difference is the $74.3 million. Availability cannot see it. V₄ cannot avoid it.
How V₄ looks for Alpha, Beta, Gamma.
| Plant | Scalar unavailability (1 − A) | V₄ (Value‑weighted unavailability) | Revenue destroyed |
|---|---|---|---|
| Alpha | 10.0% | 17.8% | $33.9M/year |
| Beta | 10.0% | 14.2% | $16.0M/year |
| Gamma | 10.0% | 15.8% | $24.5M/year |
In every case, V₄ > (1 − A). That is not a coincidence because ..“now you see me!”.
V₄ is built to express that anti‑correlation in one number.
Where the prices come from.
You already have the price or margin vector Pₜ:
- In a merchant plant: from the market desk.
- In an offtake/contracted plant: from contract schedules and penalty curves.
- In an integrated operation: from a seasonal margin model in your ERP.
You also have the production history: ΔQₜ comes from your historian or SCADA..(..s-ca-da…did you notice I didn’t say CMMS??’).
V₄ is simply the multiplication of two data sets you already own. The scalar availability metric ensured nobody was required to connect them.
5.6 V₅ : Trend Slope: Is It Getting Better or Worse?
V₁–V₄ describe where you are now. They say nothing about where you are heading. That is V₅’s job.
V₅ is a single parameter that says:
- Are losses slowing down?
- Are they steady?
- Or are they accelerating?
How we define V₅.
Take the cumulative production shortfall. Yes, the running sum of ΔQₜ over time. Plot it against cumulative calendar time on log‑log axes.
Empirically, in many industrial systems, that plot can be approximated by a power law:
Where:
- N(t) = cumulative shortfall at time t.
- b = slope of the line on the log‑log plot.
This is the same functional form widely used in reliability growth and degradation analysis (often referred to as the Crow‑AMSAA model in failure counting). Here, we apply the same idea to lost production instead of failure counts.
What the slope means.
- b < 1: the rate of production loss is decreasing. The plant is improving. Reliability investments are working.
- b = 1: the rate of production loss is constant. The current pattern will persist.
- b > 1: the rate of production loss is increasing. The plant is degrading. Fouling, corrosion, wear, or operational abuse is outpacing your controls.
V₅ is simply that slope, reported as one number.
Why it is essential.
Imagine Gamma again. V₁–V₄ for Gamma are not good, but they are not catastrophic. Without V₅, you might decide you can tolerate them.
Now imagine you compute V₅ for Gamma and find b = 1.4:
- Losses are not just present; they are accelerating between turnarounds…are you going to pay for that?.
- Every year you delay intervention, next year’s V₁ will be lower, V₄ will be higher, and the NPV gap will widen faster than before.
Conversely, Alpha might have ugly current values but a V₅ less than 1 (say 0.7):
- The interventions you are paying for, like root‑cause work, instrument campaigns, cycle‑reduction, and all that..are working!….there’s growing value there!!
- The plant is moving toward a healthier state, things are getting better in a sustained manner, even if the snapshot is still bad today.
V₅ turns V₁–V₄ from static diagnostics into a trajectory. And most importantly You are measuring a probabilistic situation with a probabilistic tool!..You’re not “trapping” a stochastic process inside a lame excel cell!…That’s how brave asset manager play!
Cusp as audit trail.
When you plot cumulative shortfall over time, any sharp change in slope, a “kink” (no, not those kinks… asset management kinks…) marks something real:
- A new maintenance policy.
- A redesign.
- A change in operating regime.
By fitting slopes before and after each kink, you can quantify exactly how much each intervention changed the trajectory. This is not just technically, emotionally and personally satisfying; it is what standards like ISO 15663‑3 demand of life‑cycle costing, when they say “ track assumptions, decisions, and measured effects so later phases can confirm or challenge them”…
5.7 Assembling the Vector for Allowing the Board to Actually See!
Now put the five components together for the three plants from Part 4, plus the naive “90%” view:
| Component | Naive 90% | Alpha | Beta | Gamma |
|---|---|---|---|---|
| V₁ Throughput availability | 90.0% | 79.9% | 86.0% | 82.3% |
| V₂ Events/year | — | 48 | 8 | 3 |
| V₃ P95 event duration | — | 32 hours | 9.2 days | 13.8 days |
| V₄ Value‑weighted unavailability | 10.0% | 17.8% | 14.2% | 15.8% |
| V₅ Trend slope | — | 0.85 | 1.05 | 1.38 |
Again use your focus…read each column, one plant at a time:
- Alpha: worst V₁ (heavy derating), highest V₂ (chronic firefighting), short V₃ (no single long events), high V₄ (lots of value lost), but V₅ below 1 —> trajectory improving. It is ugly now, but your investments are pushing it in the right direction.
- Beta: best V₁ (runs clean when it runs), moderate V₂, very long V₃ (buffer‑breaking outages), moderate V₄ (downtime falls in expensive quarters), V₅ ≈ 1 —> stable but structurally mis‑timed. You will not grow out of this; you must fix the timing.
- Gamma: weak V₁ (degradation), lowest V₂ (few events), longest V₃ (long planned outages), high V₄ (derating in premium periods), worst V₅ —> losses accelerating. This is the plant boards feel safest about today and should be most worried about for the next decade.
No single number in the vector tells the whole story. Together, they tell you what is wrong, where it hurts, and whether it is getting better or worse.
5.8 The Financial Translation: From Vector to Dollars
The vector becomes decisive when it speaks in money. The translation is straightforward.
Step 1: Revenue baseline.
Compute maximum possible revenue if you hit nameplate every hour:
Remember Part 4 plants, R_max ≈ $335.7 million per year.
Step 2: Apply V₄ directly.
Because V₄ is “fraction of revenue potential destroyed,” annual revenue lost to unreliability is simply:
No modeling. No forward assumptions. Just measured production and measured prices.
- Alpha: 17.8% × $335.7M ≈ $59.8M/year destroyed.
- Beta: 14.2% × $335.7M ≈ $47.7M/year.
- Gamma: 15.8% × $335.7M ≈ $53.0M/year.
Those numbers explain the NPV gaps you saw in Part 4.
Step 3: Project using V₅.
V₅ tells you whether V₄ is likely to:
- Shrink over time (b < 1).
- Stay roughly constant (b ≈ 1).
- Grow (b > 1).
You can use that slope to project a plausible path for V₄ over the next 5–10 years and build a free‑cash‑flow model. The 10‑year NPVs in Part 4 were built exactly that way: each year’s cash flow adjusted by that year’s V₄, and the trend in V₄ driven by V₅.
You do not need perfect prediction. You need a consistent, explicit rule tied to measured behavior, not wishful thinking.
Step 4: The intervention test.
Now the vector becomes a capital‑allocation tool.
Take Gamma. Engineers propose a mid‑cycle cleaning: a three‑day mini‑turnaround in each premium quarter to arrest fouling and scaling. Cost: $1.2M per intervention, $2.4M per year. Expected effects:
- V₁ improves from 82.3% to 87.5%.
- V₄ drops from 15.8% to 11.2%.
- V₅ flattens from 1.38 to ~1.0 (no further acceleration).
The financial translation:
- V₄ improvement of 4.6 points = 4.6% × $335.7M ≈ $15.4M/year additional captured revenue.
- Cost: $2.4M/year.
- Net benefit: about $13.0M/year.
Over 10 years at 10% discount rate, that is an NPV of roughly $80 million — essentially equal to the NPV Gamma was losing in Part 4.
That is what the vector enables:
- You see where money is leaking (which V is off).
- You quantify how fast it is getting better or worse (V₅).
- You can test proposed interventions in hard cash, not slogans.
The scalar availability number could never do this. It never saw the leak.
5.9 What Changes When You Report the Vector
Once you report these five components regularly, behavior changes. Not because of a workshop. Because the numbers leave no place to hide.
- Maintenance stops optimizing for “uptime” and starts optimizing for V₁ and V₂/V₃. A day at 60% capacity is no longer celebrated as uptime; it is logged as an output‑loss event. The day‑to‑day consistency of production; the equivalent of a high β in Weibull terms, now becomes the real reliability badge, not MTBF.
- Planning and operations stop treating all downtime hours as equal. V₄ forces them to place turnarounds and major work where the revenue density is lowest. Beta‑style seasonal failure modes become visible and unacceptable, not “bad luck in winter.”
- Finance stops projecting revenue as “availability × capacity × price” and starts from V₁ and V₄, with V₅ driving the trend. The $76M–$136M NPV gap you saw in Part 4 disappears; not because the plants improved overnight, but because the model finally matches the physics and the market.
The board sees five numbers instead of one. Those five numbers cost them an extra thirty seconds per asset… not too much, I guess, because this prevents tens or hundreds of millions in hidden value destruction per decade.
If anyone argues that five numbers is “too complicated” for a board that routinely reads 50‑page financial models, they are not arguing for simplicity. They are arguing for blindness…( or free time for more “TikTokss”)
5.10 I know…this is not “that short!
If you remember nothing else from this section, remember this:
Replace the scalar availability number with a five‑component vector: throughput availability, event frequency, duration severity, value‑weighted unavailability, and trend slope
And you will see, in every reporting period, how much money unreliability is actually destroying, where it is destroying it, and whether the destruction is accelerating or receding.
Part 6 — Closing: Stop Targeting Availability
If you have read this far, you can never again look at “90% availability” as reassurance because:
- You have seen that the same availability number can hide a plant bleeding twenty million dollars more than its twin through long outages that blow through buffers and contracts.
- You have seen that identical availability can mask a seventy‑million‑dollar revenue gap when downtime clusters in scarcity hours instead of baseload hours.
- You have seen that a unit can be “available” at half power, silently eroding eight percentage points of energy‑equivalent availability while every dashboard light stays green.
- You have also seen that none of this is mysterious.
..and the seen can’t be unseen!!!—(insert dramatic closing music here with your mind!)
Once you stop collapsing away event structure, time context, and capacity level, the economics fall straight out of the data you already own.
A handful of numbers (again):
- throughput availability,
- event frequency,
- duration severity,
- value‑weighted unavailability, and
- trend slope
Are enough to make value creation visible and value destruction impossible to hide.
At that point the question is not whether you can do better. It is whether you are willing to keep aiming at the wrong target. So retire the comfort blanket. Stop targeting availability and start engineering visibility to downtime form and consequential impact.
My hope is that you are not walking away from this article with a new acronym, or a “blurred idea that availability is bad”..If so, I did a terrible job writing this. You are walking away with a new suspicion.
You now know that:
- Two plants can both show 90% availability and sit $20.8M/year apart because one dies in short cuts and the other dies in nine‑day outages that blow through every buffer and contract.
- Two power stations can both show 91% availability and sit $74.3M/year apart because one fails in baseload hours and the other fails exactly in scarcity hours.
- A unit can proudly report 94% availability while silently running at 85.8% energy‑equivalent availability — technically “up,” functionally cheating you out of 142,040 MWh.
Those are not edge cases. They are the logical consequences of a metric that averages away exactly the dimensions where value lives: event structure, time context, and capacity level.
Now, (and I know this could be “too much” for some people in love with other relevant measures) I need to put a closing note about the obvious escape route:
“Fine, availability is flawed. But we knew that…that’s why use OEE, so we are Okay!!.”
Well…OEE is a better efficiency metric for discrete manufacturing. It multiplies three ratios: availability, performance, and quality. It is useful on a line, for a shift, to see how much of the theoretical capacity you actually used.
It does not solve the problem we just dismantled:
- OEE still uses a scalar availability in its first term.
- OEE is typically reported as a single number per period, so it still does not care when the good units were produced, or when the losses occurred.
- OEE does not know if your worst losses landed in peak‑price weeks or in the cheapest hours of the year.
- OEE does not know if your “performance loss” is hundreds of tiny slow‑downs or a handful of catastrophic collapses.
You can have a beautiful OEE and still:
- Fail every continuity clause in your contracts.
- Miss every high‑margin hour in your market.
- Run “at speed” during the hours nobody cares about, and limp through the ones that define your NPV.
OEE, like availability, is a scorpion body measurement. It tells you how large the scorpion looks. It does not tell you what is in the tail of those averages…and how dangerous that is!
You have something better now.
You have a five‑component vector that:
- Counts what actually left the gate (V₁).
- Exposes how downtime is structured (V₂ and V₃).
- Prices the losses in the hours they actually occurred (V₄).
- Shows whether the pattern is tightening or spiraling (V₅).
It does not ask you to believe in anything. It just refuses to throw away the parts of the distribution that hurt you most.
Let’s Go back to the scorpion.
The body is theater. The tail (statistical spread) is where the neurotoxin lives.
“90% availability” — and every scalar built on top of it, including OEE — is the body. It is what you see on the dashboard. It is big, glossy, and reassuring.
The real risk in your plant is in the tails of the distribution:
- The long outages that drain every buffer.
- The clusters of failures exactly where price and penalties spike.
- The derated hours where you are “up” at 60–80% while the KPI calls it 100%.
You have seen, in hard numbers, what those tails are worth: tens of millions per year, hundreds of millions over asset life. And you have seen that simple, explicit measurements:
- not “cleverness” ( ←I love this word..sounds like cleanliness but it’s really dirty)
- not AI,
- no fancy stuff..
just honest accounting of structure, timing, and capacity are enough to surface them.
At this point the question is no longer technical.
It is this:
- Are you willing to keep signing off on a single scalar that you now know is structurally blind to the very patterns that decide your P&L?
- Or do you want to see the tail of your own distribution before it stings you in the next capital cycle?
Take “90% availability” off the scorecard. Not because it is too harsh, but because it is too kind.
Replace it with the structure of downtime that actually drives cash.
Next time you see a comforting availability or OEE number on a slide, do not look at the body. Ask to see the tails.
