Toyota halts production after hard drive fills up: The 6 things you must know.
OUT OF SPACE AND TIME
The seeds for the catastrophic event were laid uneventfully on August 27th, when, according to two anonymous sources quoted by Reuters, something went wrong after IT staff attempted to update the parts ordering system. In a press release, Toyota confirms one of the server drives ran into an all-too-common problem:
“During the maintenance procedure, data that had accumulated in the database was deleted and organized, and an error occurred due to insufficient disk space, causing the system to stop. Since these servers were running on the same system, a similar failure occurred in the backup function, and a switchover could not be made.”
The maxed-out drive soon forced the automaker – which pioneered ultra-lean just-in-time parts delivery methods – to halt production. The assembly lines restarted August 30th only after crews transferred the data to a server with a larger drive. The outage idled 28 production lines representing about 30% of Toyota’s worldwide output and resulted in a production loss of 13,000 vehicles.
In its statement, the company confirmed that, “Countermeasures have also been put in place by replicating and verifying the situation,” and said it will review its maintenance procedures. It also apologized to “customers, suppliers, and related parties for any inconvenience.” In a follow-up statement, Toyota confirmed the outage was not due to a cyberattack, and that it would continue to investigate the cause.
The event reinforces a number of weaknesses in the automaker’s internal processes that any organization – not just an automaker – must learn from. Consider the following tweaks to your own processes when evaluating your operational roadmap:
- Identify and eliminate DRP weak spots. Toyota’s disaster recovery plan included a backup system using servers installed on what it describes as “the same system”. This resulted in a critical single point of failure that compromised any restore operations in the event of a failure or outage. Conduct regular assessments and in-use tests of the DRP to highlight where these weaknesses are and redesign the architecture to remove them.
- Test all solutions. Simulated test-restores of data can help uncover vulnerabilities in the documented plan and give maintenance teams the information necessary to bolster recovery procedures. Build as realistic a test environment as possible, and test in isolation to build knowledge and confidence before applying changes into the production environment. Turn this into a regular exercise for maintenance teams to ensure they are optimally trained and ready.
- Prioritize real-time analytics. The best disaster is an avoided one. The automaker’s focus on the DRP meant it wasn’t aware beforehand that the server in question was low on disk space. This is the result of legacy thinking that focuses on disaster, failover, and recovery instead of real-time, pre-emptive monitoring and response.
- Double down on disk space monitoring. Databases are at particular risk of catastrophic outages due to insufficient space issues. Build greater resiliency into operational systems by implementing data streaming analytics to track performance in real-time, optimize resource use, maximize response to known issues, and minimize downtime.
- Implement version control. Don’t just think this is useful only for developers. IT maintenance can benefit from tighter version control, as well. Apply it to configuration files, scripts, and any other automated and manual maintenance procedures. Use change management principles to document and track changes from version to version and build comprehensive rollback procedures. Proactive and repeated communication to the maintenance teams is essential for version control success.
- Document the maintenance plan. An exhaustively comprehensive plan – including detailed procedures, milestones, timelines, anticipated issues, initial assessments, final validations, and contingency plans – can reduce risk, maximize robustness, and avoid downtime. It also provides critical information in the event of an outage and can bolster compliance efforts in regulated sectors.
THE BOTTOM LINE
Toyota’s experience could just as easily be your – or any – organization’s experience. Not even being the world’s dominant automaker protected the company from the kind of system failure that’s been haunting IT professionals for as long as IT has existed.
Here at STEP Software, we’re constantly thinking of ways to equip our clients and stakeholders with the tools and the knowledge to run their organizations more effectively. Our upcoming StringNetwork platform can help streamline many of the technology limitations that prevent organizations from unleashing their true potential. Head here to be among the first to learn more.