Why TempoIQ Moved Off HBase
When we decided to shift from time series storage to sensor analytics, we not only revamped our product, but we also revisited our backend architecture. One of the critical aspects we looked closely at was our storage layer. In our previous system, we used HBase for data storage. HBase is a distributed big data store in the Apache Hadoop family, a set of projects that provides tools for reliable, scalable, distributed computing.
After evaluating our options, we decided to build a proprietary data store instead of continuing with HBase. It didn’t fit in with our architecture goals, and we wanted to address some of the issues we’d seen writing, storing and reading large-scale sensor data.
In order to help anyone else evaluating HBase for a sensor analytics application, we wanted to provide some detail behind the switch. Below, we cover the issues in isolation, predictable performance, recovery scenarios, and maintenance that led us to write our own storage engine. Some of these points are specific to our use case, but we wanted to give you a complete picture.
It’s important to note that these are all properties of the storage layer of TempoIQ. Even with our previous system, we built an analytics engine on top of HBase to analyze time series data.
When evaluating HBase, one of the key tenets of our new architecture was data isolation for our customers. The need for isolation is two-fold: 1) to improve security around customer data and 2) eliminate the effects of noisy neighbors in a shared environment.
Originally, we had one monolithic HBase cluster that housed all of our customers data. As with most SaaS applications, our customer’s data was separated by application logic, but we were seeing issues when one customer wanted to read or write large amounts of data at once. Because all of our customers shared the same pool of resources, other customers began to have problems with their performance. In cloud computing, this is referred to as the noisy neighbor problem. Noisy neighbors are not ideal for any cloud customer, but they are especially troublesome when they affect performance while accessing critical operational data.
We knew isolation would be the best way to solve this problem, and one of the options we considered was creating separate HBase clusters for each customer. That option was short-lived, however, as there is too much server overhead needed to spin up a cluster, and configuration management would have become a huge operational burden.
If we were going to create private environments for our customers that could scale up and down, we needed a tool that had less operational overhead.
As we came to a decision about our underlying data store, one major request from our clients was consistent performance. Many of them were running client-facing services on top of TempoIQ, and it was (and is) extremely important for those applications to be highly available and predicable.
The issue with delivering high availability and predictable performance was noticeable performance variability on reads stemming from design decisions in HBase. There are a few key processes that can cause spikes in read latency, specifically compaction and garbage collection.
The first process, compaction, is meant to optimize read operations. Essentially, the compaction process combines multiple database files in the underlying file system, HDFS, removing redundant entries and reducing disk seeks during reads. In HBase, there are two types of compactions - minor compactions and major compactions. The type that can significantly affect performance is a major compaction.
Major compactions happen under a few different circumstances, but the consequences are always the same - a subset of reads suffer negative performance while the operation is underway. This occurs because the underlying file system is in flux. As a consequence, users see higher latency and, in extreme conditions, request timeouts.
When looked at through the lens of consistent performance, this is a huge problem. In a large system with evenly-distributed keys, a major compaction is taking place very frequently. If an end user loads their web or mobile application during the process, they can experience high latency.
Another process that can cause latency spikes is garbage collection. This one is slightly more involved, and the HBase community calls the associated scenario that causes latency a Juliet Pause. In a Juliet Pause, there is a server that occasionally performs long-running garbage collection (Juliet) and a master server (Romeo) overseeing Juliet and the rest of the cluster. In HBase, the server (Juliet) is supposed to check in with the master (Romeo) at some interval to tell Romeo that it is still alive. If garbage collection runs too long, all of Juliet’s processes are blocked and she cannot check in with Romeo. If Romeo does not receive this signal, he assumes that the Juliet is dead and does something drastic; he performs a recovery action that assigns all of Juliet’s data responsibilities to different nodes. When Juliet awakes (finishes garbage collection), she signals Romeo and realizes that he has made a huge mistake and takes her own life (shuts down).
In the case of a Juliet Pause, the data Juliet was responsible for becomes unreadable for the period of time between Romeo wondering whether Juliet is dead, and Romeo assigning her data to other servers. We’ll cover this situation, also known as node failure, in the next section.
Unlike the noisy neighbor problem, compactions and garbage collection make performance unpredictable for customers regardless of being hosted in a shared or single tenant environment. This posed additional problems for us as we re-architected our system to be consistently performant.
As we mentioned in the last section, the failure of any server causes some subset of data to be unreadable for a period of time. We realized quickly as we scaled that this wasn’t an ideal failure mode for our application.
When a server goes down, the data associated with it still exists in HDFS, but it needs to be reassigned in HBase so that it can be read, written to, and managed. HBase does this in a 3 step failure and recovery process. In the first step, HBase identifies that a node is down. In the Juliet Pause scenario, the node can’t check in, so it is considered dead. In other scenarios, a region server may have actually crashed. In both of these cases, HBase doesn’t know that a node is down until it becomes time for that node to check in. In some cases, that can be minutes.
In the second step, HBase reads the log of writes that weren’t flushed to HFiles so that it can recover the data that was stored in memory. One thing that HBase does to ensure consistency of newly written data is to store it in memory AND to write the data to a log file on disk. That process allows recoveries like this one.
In the third step, HBase reassigns the data responsibilities from the fallen server to other servers in the cluster that can handle the load.
This process allows HBase to recover from a lost node without losing data, but during the first two steps of the process, data is unreadable by client applications. This causes spikes in latency for end-users who are accessing the data frequently.
In reality, larger clusters have nodes failing frequently, and even in cases where node failure is infrequent, this behavior makes it difficult to deliver a highly available and consistently performant service. Some of these issues have gotten better over time, but as you’ll see in the next section, a large HBase cluster with high uptime requirements can’t take advantage of all of them.
One of requirements we face when running a sensor analytics application is high uptime. This is extremely important not only from an end user application perspective, but also from a data collection perspective. We’ve put a buffer in front of HBase to handle many write issues we may run into, but when customers have tens of thousands of sensors writing data at second or minute intervals, the backlog becomes large quickly. As a result, it takes longer for the system to “catch-up” to real-time, causing performance problems for our customers.
In another scenario, we could look at running extensive tests on a cloned cluster with similar size and properties, but as a production cluster grows to the 10s of terabytes, this becomes an extremely expensive proposition. This prevents us from taking advantage of bleeding edge changes that may have implemented better compaction algorithms or better recovery scenarios. We are left running our cluster on more stable versions of HBase.
All the effects we’ve described above - from compaction issues to garbage collection to node failure - can be minimized if you configure HBase correctly from the beginning. The reality is that it’s extremely difficult to predict the usage of a system one to two years into the future. We’ve been dealing with big data for over 10 years, and even we made a few mistakes along the way.
Additionally, it is worth noting that HBase continues its rapid development. There’s no doubt that the HBase project continues to improve its time to recovery and performance characteristics, but this often requires running on brand new versions of HBase with less burn-in or, in the worst case, cherry-picking patches and compiling HBase yourself. But keeping up to date on such a fast-moving project and assessing the risk versus reward for rolling in these new improvements is another full-time job and can be costly for a company to manage.
That being said, if conditions change drastically, you can correct some early configuration problems by taking downtime. But again, for a real-time system like one needed to handle sensor analytics, this is extremely costly and in many cases not realistic.
HBase is a great project for a wide range of applications, but using it as the data store for a real-time sensor analytics backend posed challenges that took it out of the running for our long-term solution. Write-heavy patterns like ours caused issues in HBase, especially around isolation, variability in read latency, recovery, and maintenance.
While we ultimately settled on building our own distributed data store, we considered other options along the way. In future posts, we’ll cover why other open source solutions weren’t a good fit for our sensor analytics backend.