Hundreds of Millions of People Use AI Tools Every Day.
There must be no cooler job in physics than stress testing in materials science, where teams of PhDs figure out how to push materials to their breaking point. Their job is to make things fail on purpose so the rest of us can trust what doesn’t. Stress testing is a feature of many fields, but aerospace is a particularly famous example. This is because aircraft designed to perform at high altitudes, high speeds, and under extreme g-force conditions need to be tested so their structure or their fundamental materials do not fail under those conditions. In space, temperatures are 248 degrees in direct sunlight and –302 in the shade, a 550-degree swing. Under those conditions, untested materials could warp or become brittle and such failures are not tolerated.
The reason we stress materials in aerospace is simple: lives depend on it. Every component must perform consistently, even in edge cases, because the consequences of failure are immediate and catastrophic. For decades, we’ve accepted that nothing leaves the ground until we know it can withstand whatever conditions it might face. When we build things that impact human safety, we owe the same rigor. Hidden failures at 35,000 feet are deadly. Hidden failures anywhere are too.
Stress testing against specific threats and edge cases is clear to us regarding aircraft and spacecraft, but the development of artificial intelligence (AI) should make us think differently about stress testing. AI is used by hundreds of millions of people every day and has direct and documented impacts on human mental well-being from children to adults. For organizations there are documented cases of guardrail and policy failures that have resulted in billions of dollars in losses or fines to say nothing of reputational damage. But unlike aircraft construction, we don’t stress test AI before we put it in front of humans. This represents a gap between our traditional cybersecurity practices and what it really takes to secure an AI system, not necessarily from intrusion, but from how the model performs under different conditions. AI stress testing is not different from material stress testing in intent, only in method. Stress testing will save lives and prevent incidents in AI, and we need more of it with the AI systems that millions of people use every day.
Some Numbers
According to the Air Transport Action Group, 5 billion people were transported by commercial aircraft in 2024. That’s 13.7 million people per day.
ChatGPT alone has 190.6 million daily users. Google’s Gemini has around 35-45 million daily users and Microsoft Copilot (counting its Bing search engine) has 140 million daily users. If we include other providers like Anthropic and Perplexity, the number of daily users of foundational large language models (LLM) is in the high hundreds of millions…of humans…PER DAY.
For an industry that moves 13.7 million people per day, we’ve instituted rigorous stress testing procedures and inspections that are rooted in regulation globally. For an industry that touches
potentially above a half a billion people per day, we have no such requirements. Recognition of the need to secure traditional cybersecurity and software as a service (SaaS) dates back decades, but the same has not been true for AI testing. Stories about mental and physical harm from AI bots continue to hit the headlines as do stories of corporations that abandon AI projects. If the world had airline crashes at this rate, there would be a global ground stop.
AI stress testing is important because it prevents harm across physical, financial, and reputational domains for hundreds of millions of people and companies. Those that do not stress test risk running afoul of a growing body of state AI and data regulations as well as the potential for reputational loss and customer harm.
Stress Testing for the .0000001% Chance
In aerospace stress testing, the macro level issues matter as much as the edge cases. Putting airframes through extreme conditions that the actual airframe may never face is part of process. Particularly with military aircraft, the airframes are subject to significant stress and must perform with potential damage from enemy fire. In space, huge temperature swings and dust particles can cause real damage and result in catastrophic failure. But it is not just these known unknowns but the unknown unknowns that are tested for (this was a maxim of aerospace design far before Donald Rumsfeld made the phrase famous). That’s why edge cases are tested. Here is an example of the types of tests that are performed when testing aircraft:
| Test Type | Simulated Extreme Environment | Purpose |
| Mechanical Loading | Launch forces, internal pressure, payload forces | Measures yield strength, ultimate tensile strength, fatigue life, and fracture toughness under tension, compression, and shear. |
| Thermal Cycling/Shock | Rapid, drastic temperature changes | Tests the material’s ability to withstand cycles of extreme heat and cold without cracking, delamination (in composites), or excessive thermal expansion. |
| High Vacuum | The near-perfect vacuum of space | Assesses outgassing, where volatile materials are released. This is critical as outgassed particles can contaminate sensitive surfaces (like optics) elsewhere on the spacecraft. |
| Radiation Exposure | High-energy protons, electrons, and solar/cosmic radiation | Evaluates material degradation like embrittlement, changes in electrical properties, or darkening of optics/coatings. |
| Vibration and Shock | Launch and in-flight maneuvers | Measures the material’s resilience to high-frequency and high-amplitude forces to prevent catastrophic failure or structural instability. |
| Atomic Oxygen (for LEO) | Highly reactive single oxygen atoms in Low Earth Orbit (LEO) | Tests the material’s resistance to chemical erosion caused by high-velocity atomic oxygen. |
Before the physical tests begin, physicists run countless simulations to understand how a material or structure will perform under stress. Those processes evolved to protect human life, and the same logic applies to AI. Stress testing AI means running controlled experiments to see how models behave under both normal and extreme conditions, ensuring they perform reliably no matter what they encounter.
In stress testing AI, there are likewise a number of tests that are intended to ensure the model preforms consistently under macro and edge conditions:
| Test Type | Simulated Environment | Purpose |
| Out-of-Distribution (OOD) Data Stress | Production data that has shifted significantly (e.g., a sudden economic shock, new user behavior, or a crisis event not seen in training). | To test the model’s resilience and predictive performance when faced with inputs that deviate from its original training distribution, identifying potential for rapid degradation. |
| Adversarial Attack Stress | A subtle, malicious manipulation of input data (e.g., small, human-imperceptible pixel changes in an image or text perturbations) designed to fool the AI. | To evaluate the model’s security and robustness against deliberate attempts to cause misclassification or unexpected, potentially harmful, behavior. |
| Data Imbalance/Bias Stress | Test datasets with intentionally skewed or missing data for specific demographic or transactional groups, or comparing model performance across sensitive attributes (e.g., race, gender, geography). | To identify and quantify model bias and fairness issues, ensuring the AI performs equitably and reliably across all groups and does not disproportionately harm any of them. |
| Resource/Load Stress | A sudden, massive increase in query volume, input data size, or concurrent user requests, often exceeding the system’s normal operational capacity. | To test the operational resilience, latency, and scalability of the entire AI system (model and infrastructure) under peak-demand or ‘flash-crowd’ conditions. |
| Feature Fragility Stress | Artificially distorting or introducing noise into specific, important input features (e.g., inflating debt-to-income ratios, adding random words to text, or degrading image quality). | To assess the model’s stability and reliance on individual features, revealing if it depends too heavily on unstable or easily manipulated data points. |
| Uncertainty/Calibration Stress | Inputs designed to be ambiguous or border-line cases, or intentionally feeding the model inputs that are known to be far from the decision boundary. | To test the accuracy of the model’s confidence scores (calibration), ensuring the model is not overconfident when it’s wrong or under-confident when it’s right, which is critical for risk management. |
When you look at them side by side, the parallels are obvious. AI stress testing and materials stress testing share the same intent; they exist to reveal failure before it happens. Society demands absolute assurance from technologies like aircraft because lives are on the line. Stress testing isn’t a luxury; it’s a catalyst for innovation.
AI deserves the same treatment. Hundreds of millions of people interact with these systems every day, and the evidence of harm is growing. Just as we accepted a zero-failure standard in aerospace, we should do the same for AI. The difference is in the medium. Aerospace engineers use lasers, cryogenics, and force. AI stress testers work at the point of interaction, where humans and models meet. That is where failures hide and where we learn the most. The scale is massive, but that is not an excuse to avoid testing. It is the reason to innovate and make it possible.