Continuous availability of healthcare applications has always been a goal for healthcare IT and is a mandatory requirement for the delivery of clinical applications. Healthcare organizations can be severely impacted by application, hardware or data center outages. The risks to patients and increased hospital liability can be substantial and creates the need for an availability strategy that supports the future of healthcare IT.
In this post, Ken Bradberry - SVP of Managed Services Operations and Innovation gives his perspective on the importance of having continuous availability of clinical applications in healthcare organizations.
In the future, most hospitals will no longer have manual procedures to overcome a failed IT system. Caregivers rely totally on the on-line EMR applications; emergency rooms and first responders depend on IT systems. Hospitals will continue to function and will not be tasked with the difficultly of back loading hours of activity if just the smallest outage occurs.
These are not just inconveniences caused by lack of IT availability, but absolute disasters that affect the organization and its patients. Every healthcare provider has several mission-critical applications where an outage would be catastrophic. To mitigate the risk of disaster, more and more healthcare organizations are demanding very high availability targets backed by stringent service level agreements (SLAs) from their IT service providers.
This has driven the IT industry in general to create different availability solutions to address these needs. These solutions largely revolve around specific technology platforms. The challenge to achieve a higher the degree of availability has been the application architecture, which creates an environment that is more complex to run and the more expensive to deploy and maintain.
This blog post explores the technologies that deliver high-availability solutions that support the goal of continuous availability for critical healthcare applications and how to evaluate them. As a starting point, the IDC defines high availability in four different areas:
- Contains some fault tolerant components. In the event of a system failure, work simply stops.
- Typically found in warm stand-by clusters. When a system fails, user connection is impacted and the transaction is lost.
- The user connection will be maintained in the event of a failure. However, the transaction will be lost. This is commonly found on automatic failover implementations.
- The highest of all implementations, whereby failure is transparent to the user and results in no transaction loss. Such systems have 100% component replication, a feature of fault tolerant servers. Measurement of availability levels is communicated in the percentage of system availability over a period – typically a year.
For example:99% = 87.6 hours
99.9% = 8.76 hours
99.99% = 52.5 minutes
99.999% = 5.25 minutes
A few minutes downtime a year may not sound like a lot of time, but to put these figures into context a 10-minute outage on an EMR or Operating Rooms procedure that can create significant dangers to care delivery and chaos not only in the OR but also throughout the hospital and business office. To drive the availability to the highest level, fault-tolerant and clustered systems need to be enhanced to support critical hospital applications – but the real issue is on the prevention of any downtime in the first place.
Continuous Availability - Defined
From the outside, a continuously available network closely resembles a conventional network. It consists of servers running applications, databases and networking software. The servers are linked with storage arrays by router-based networks running protocols over hard-wired and wireless connections. In a manufacturing environment, interfaces with hospital systems pull in data from radiology, labs and other modalities located throughout the hospital environment.
The primary difference between a conventional and a continuously available solution is the approach to errors. Continuous availability’s focus is on error prevention. This differs from the prevailing attitude in healthcare networking, where the focus is on recovery from errors and failures.
A recovery-oriented solution assumes downtime, even if it’s only a few minutes during failover from one server to another.
Continuously available systems are built or retrofitted around redundancy and error detection that prevents failure in the first place. Detecting errors at all layers of the IT architecture is paramount to ensuring error conditions and critical events are escalated in a timely fashion. This ensures that corrective action can be taken before the client is impacted. The technology exists for all layers to be properly monitored and managed. The entire system must be designed and configured for this purpose to scale across the entire network environment. This is called fault tolerant computing. Achieving this level of computing can be done through managed services by focusing on application architecture and tools, and focusing on the right level of attention to detail when designing new, or retrofitting existing solutions.
Problems often arise in hospitals where an attempt has been made to upgrade the availability, often using clustered servers. The issue that arises is that this additive and complex approach to availability focuses on individual points of failure, and does not provide a holistic solution to continuous availability. For example, a crash that occurs outside the server, in a network interface card or in an application running on the server, will still cause an outage.
Additive approaches, such as server clusters, aren’t going to solve those problems and neither will any other individual network element. Hardening a network means considering the hardware and software as a whole.
All potential points of failure between the two must be discovered and hardened against crashes. The components themselves must be of high quality, with built-in management and monitoring functions, so IT staff can anticipate failures and head them off. The HCI Group is exploring ways to automate the management of application environment to ensure all aspects of the environment can be managed.
Application upgrades, patches and routine maintenance can cause slowdowns and crashes, as can simply plugging an unauthorized laptop into the network.
Hospitals and IT providers must factor together these elements – hardware, software and outside influences – in the design of a continuously available network.
High Availability Through Clustering
Clustering technology is considered to be the most widely-used option for achieving higher server availability. A clustered solution is typically a configuration of two or more servers that interoperate with each other to increase availability. This is the typical healthcare vendor solution for high availability. For example, Epic and Cerner use a LINUX based solution to resolve primary server failures.
Meanwhile, continuously available hardware is where ‘fault tolerance’ is built into the hardware itself, transparent to the solution and the user.
Challenges with Clustered Solutions
There are many benefits that come from implementing a clustered solution. Not only do you achieve higher levels of availability, but you can also achieve higher levels of scalability using tools such as load balancing. That said, there are many problems associated with clusters as well.
To start with, you need a support organization that understands clustering and the application environment order to implement and maintain them. Epic and Cerner architectures are typically configured for active/passive failover. The HCI Group is designing solutions that focus on active/active solutions with replication.
With an application/database-centric approach, failover can be streamlined. Configuring clustering from an application and database layer provides the ability to enable a more concurrent environment. This reduces the time to transfer the production load between hardware platforms.
Overall, clusters bring a lot of complexity to healthcare IT, which often spells more cost initially, as well as over the lifetime of the solution. While clustering can deliver excellent levels of system availability, the collection of published data and the experience of many hospitals indicate that clustering results vary widely.
Creating a successful high-availability environment using clustering requires a combination of careful planning, best-of-breed hardware components, mission-critical level service contracts and disciplined testing and change management processes on the part of the internal IT staff. Application vendors and knowledge of the architecture is critical to ensuring a successful deployment.
Fault Tolerant Computing
Fault tolerant solutions are critical to the improved availability of healthcare applications. Hospitals previously forced to compromise availability can now enjoy the benefits of continuous availability through fault tolerance.
Instead of the multiple boxes approach used by clusters, fault tolerant technology looks to eliminate single points of failure using replicated components that continue uninterrupted processing, even in the event of a component malfunction. This can involve server hardware as well as the data center they are hosted in, moving towards multiple data centers to achieve the desired availability level.
Fault-tolerant I/O is implemented using replicated PCI buses, replicated I/O adapters and replicated devices. Base configurations should include two independent PCI buses with additional buses that can be configured.
Which Way to Go to Obtain Continuous Availability?
Clustering technology can provide a highly-available application environment if adequate time, effort and resources are devoted to proper planning, installation, configuration and operation of the clustering solution. The issue is that it leaves to many areas open to error – and with that – the risk of compromised availability.
Errors or shortcuts in selection and configuration of hardware, cluster software configuration and customizing, cluster testing, change management, support contracts, staff training and use of consulting services can all result in a clustering solution that falls far short of delivering the availability levels expected.
Enabling a cluster-aware application environment is paramount to the success of any continuously available application environment. Applications and their supporting infrastructure must be considered with any high-availability environment to ensure that any transfer of service can be accomplished with the least amount of interruption of application services.
The HCI Group can provide advanced recovery solutions for applications suites like Epic, McKesson and Meditech via our Managed Services offering with replicated hardware staged in two different data centers, and even that model requires a new approach to support clients into the future. The HCI Group is committed to the design and architecture of these solutions. We will continue to research new ways to ensure availability for our healthcare clients.
Additional Managed Service Solutions Resources:
- Checkout this blog post about how Achieving Cost Savings with Healthcare IT Managed Services
- Read more about HCI's Infrastructure Management Services offerings
Schedule a time to speak with our Managed Services Team by completing the contact form below.