Ensuring IT Resilience: The Importance of Redundancy, Diversity and Due Diligence

Share: Print

Redundancy and Diversity in IT Infrastructure

The recent CrowdStrike update bug, which led to widespread blue screen of death (BSOD) incidents, significantly impacted businesses worldwide. Enterprises and public institutions quickly realized the need for strategies that mitigate risks associated with dependencies on single service providers. One of the key lessons from the CrowdStrike event is the importance of redundancy and diversity in IT infrastructure.

By implementing redundancy and diversity, businesses can prevent single points of failure. Redundancy involves setting up backup systems that can take over if the primary system fails; diversity involves using different operating systems and cloud services to ensure that a problem with one does not bring down the entire operation. Had the companies affected by the CrowdStrike bug employed fallback systems running on different operating systems, they could have avoided the widespread disruptions.

Two takeaways from this:

  1. Regularly update and test backup systems.
    Ensure that backup systems are not only in place but regularly updated and tested to confirm they can seamlessly take over in the event of a primary system failure. It's important to note that testing may not have been able to catch the CrowdStrike issue. The only way to "test" such an incident would be to run multiple CrowdStrike deployments, which is both economically and technically challenging.
  2. Use diverse operating systems and cloud services.
    By employing multiple operating systems and cloud service providers, businesses can avoid being entirely dependent on one system and reduce the risk of widespread disruption from a single point of failure. Implementing this approach requires a comprehensive strategy to ensure continuity for entire business clusters.

Several businesses have successfully mitigated disruptions through redundancy and diversity. For instance, during a major cloud service outage, companies with a diversified cloud strategy could maintain operations by shifting workloads to unaffected providers. Similarly, financial institutions often use a mix of operating systems to ensure that an update error in one does not impact their entire network.

This approach underscores the importance of preparing for a wide range of potential disruptions.

Comparing Internal and External Threats

Internal threats or "non-intentional" incidents, such as the CrowdStrike update error, can have effects comparable to those of major cyber-attacks like Log4Shell or the SolarWinds supply chain attack. This incident demonstrates that, even without malicious intent, significant disruptions can occur, affecting operations, financial stability and reputation.

The diagram below illustrates this comparison, showing that while the financial impact of internal errors (CrowdStrike, rated 3) may be lower than that of major cyber-attacks (Log4Shell, rated 5), the operational downtime and customer impact, can be significantly higher (CrowdStrike, rated 5, Log4Shell and SolarWinds, rated 3). Internal errors can cause considerable downtime and service interruptions, necessitating urgent recovery efforts and impacting customer trust.

Ensuring IT Resilience Impact Comparison

Figure 1: Impact comparison of internal and external disruptions

It is crucial to note that avoiding security solutions like CrowdStrike Falcon due to potential update errors is not a viable strategy. Intentional cyber-attacks, which these solutions help prevent, often pose more significant risks. For instance, Log4Shell's ability to allow remote code execution led to extensive data breaches and severe regulatory implications (rated 5 in both categories). The proactive defense provided by robust security solutions outweighs the potential for occasional update errors, emphasizing the importance of maintaining comprehensive cybersecurity measures.

While internal threats can have substantial impacts, the protection against intentional cyber-attacks remains paramount, underscoring the necessity of using advanced security solutions like CrowdStrike Falcon.

How to Select and Evaluate Cybersecurity Providers

The CrowdStrike update bug underscores the importance of thorough care and due diligence when selecting and managing service providers. Relying solely on service providers does not absolve businesses from their obligations to ensure IT operations and security. The table below outlines key areas to consider when evaluating potential service providers, along with actionable steps to guide your evaluation.

Key Area Description Action Steps
QA ProcessesEvaluate the vendor's quality assurance practices, including automated and manual testing, and validation processes. Request QA documentation, interview the QA team and review testing protocols and results.
Past Performance Review the vendor's track record with previous clients, focusing on reliability, issue resolution and customer satisfaction. Ask for references and case studies, check online reviews and conduct performance audits.
Risk Management CapabilitiesAssess the vendor's ability to identify, assess and mitigate risks associated with their services. Review risk management policies, verify certifications and conduct risk assessment workshops.
Security Protocols Examine the vendor's security measures, including data encryption, access controls and incident response plans. Request security policy documentation, conduct penetration testing and review incident response history.
Customer SupportEvaluate the availability, responsiveness and quality of the vendor's customer support. Test response times through inquiries, review support SLAs and conduct satisfaction surveys with existing clients.
Compliance Standards Check compliance with industry standards and regulations (e.g., ISO 9000, ISO 27000, GDPR, HIPAA). Verify compliance certifications and conduct compliance audits.

Table 1: Best practices for selecting and evaluating service providers

See the ISG Provider Lens™ Cybersecurity – Solutions and Services - Global 2024 report for more details about evaluating and selecting service providers.

Additionally, it's crucial to review the master services agreement (MSA) with service providers carefully. The MSA sets out the contractual terms and protections for both parties, including liability limits, SLAs and termination conditions. Ensuring these agreements include appropriate protections can safeguard against issues like inadequate service delivery or security breaches.

Key MSA considerations:

  1. Liability and indemnification
    Define the vendor's liability in case of failures and include indemnification clauses.
  2. Service level agreements (SLAs)
    Specify performance metrics, uptime guarantees and penalties for unmet standards.
  3. Data security and privacy
    Include terms related to data protection and compliance with regulations.
  4. Termination clauses
    Clarify conditions and implications for terminating the agreement.

The CrowdStrike incident underscores the need for thorough management and oversight of service providers. Businesses must assess providers' quality assurance, past performance, risk management, security protocols, customer support and compliance standards. Regular reviews and audits help identify and address vulnerabilities before they impact operations, ensuring a resilient IT infrastructure and mitigating risks from service provider dependencies.

Business Continuity and Disaster Recovery

Reliable business continuity and disaster recovery (BCDR) programs are essential as a "second line of defense." These programs should be well-designed, regularly tested and automated to ensure rapid recovery from incidents, minimizing downtime and business impact. It's crucial to acknowledge that technologies like BitLocker, which encrypt drives to protect data, can complicate recovery processes. In situations like the CrowdStrike update bug, users with BitLocker or similar encryption technologies might face manual recovery challenges.

A resilient IT environment requires reducing dependencies, leveraging managed security services, performing due diligence and maintaining effective BCDR programs. Together, these strategies improve protection against internal failures and external threats, ensuring sustainable operations and security in a complex digital landscape.

ISG helps enterprises navigate the rapidly evolving cybersecurity landscape to find right-fit providers and manage their security ecosystems. Contact us to find out how we can help.

Share:

About the authors

Doug Saylors

Doug Saylors

Doug currently leads the ISG Cybersecurity unit and offers expertise in cybersecurity strategy, large scale transformation projects,  infrastructure, Digital enablement,  relationship management, and service delivery. Clients benefit from Doug's expertise from years of working with global clients within the life sciences, automotive manufacturing, aerospace, banking, insurance, financial services, healthcare, utilities and retail industries, as well as his deep and current knowledge of the service provider market.  Doug routinely performs Strategy and Assessment engagements to assist clients in understanding how to select the optimal organizational and operational models to meet their business needs while minimizing security exposure and risk of loss.

LinkedIn Profile
Matthias Paletta

Matthias Paletta

Dr. Matthias Paletta is a Director and Technology Modernization Solution Lead, EMEA at ISG.
Tim Merscheid

Tim Merscheid

Tim works as a Consulting Manager for ISG in Germany. His main area of responsibility is client projects with a Cybersecurity focus. Here, he mainly concentrates on management and strategic tasks such as drafting and evaluating policies and assessing the overall maturity level, but also individual topics such as the assessment of the network and communication channels.