5 Alternatives to the Traditional Relational Database
Hadoop is a popular way of managing Big Data – but it is not the only one. Here are five alternatives to the relational database that may belong in your optimal information infrastructure.
The ongoing popularity of the Hadoop approach to data management should not blind us to the opportunities offered by other database technologies – technologies that, when used in the proper mix, can deliver adequate scaling to meet the demands of the endless 50 percent-plus yearly growth in data to be analyzed, along with higher data quality for better business decisions. The new(er) technologies basically can be categorized as follows:
- In-memory databases
- Virtualized or “federated” databases
- Columnar databases
- Streaming databases
Let’s consider briefly the pros and cons of each, and where each might fit in an optimized information infrastructure.
The core idea of an in-memory database is to assume that most or all storage is “flat” — that is, any piece of data takes about as long to access as any other. Without the endless software to deal with disks and tape, such a database can perform one or two orders of magnitude faster than a traditional relational database, for problems like analyzing stock-market data on the fly.
Moreover, the boundary between “flat” and disk storage is moving upwards rapidly. Thus, IBM has just announced plans to deliver mainframes this year with more than a terabyte of “flat” memory per machine. Thus, an increasing percentage of business analytics and other needs can be handled by in-memory technology.
The major database vendors have in-memory technology of their own (e.g., Oracle TimesTen) and offer it as either a low-end solution or lashed together with their relational databases so that users can semi-manually assign the right transaction stream to the right database. An interesting exception is SAP HANA, which is “in-memory-centric,” focusing on in-memory-type transaction schemes and leaving users to link HANA with their relational databases if they wish.
Hadoop grew out of, and specializes in, the need to apply looser standards to data management (e.g., allowing temporary inconsistency between data copies or related data) in order to maximize throughput. Thus, NoSQL does not mean “no relational allowed” but rather “where relational simply can’t scale adequately.” Over time, this approach has standardized on storing massive amounts of Web/cloud data as files, handled by the Hadoop data-access software. In turn, “enterprise Hadoop” has shown up inside large enterprises as a way of downloading and processing key customer and other social media data in-house.
Several vendors also attempt to encourage such downloads rather than providing on-cloud processing as part of their solutions. Optimum Hadoop solutions tend to be (a) good at allocating tasks between Hadoop and the user’s relational database, and (b) good at minimizing data transport.
The fundamental idea of the virtualized database as offered by vendors such as Composite Software (now owned by Cisco) and Denodo is to provide a “veneer” that looks like a database and allows common SQL-like access to widely disparate data sources (e.g., text/content, video/graphic, relational, or email/texting). Over time, this aim has come pretty close to complete reality, as virtualized databases now offer administration, one-interface development and, of course, dynamically evolving support for most if not all of today’s new data types.
One key underestimated feature of virtualized databases is support for data discovery and global metadata repositories. This means that users can now get a much better picture of the range of data that’s in-house than data warehouses ever gave them, plus support for data quality initiatives such as data governance.
And, of course, virtualized databases’ performance optimization has led to excellent “touch first, move if necessary” data processing — eliminating a key problem with Hadoop use. In fact, while virtualized databases will always be used together with other database technologies, it is now hard to find a large enterprise that would not benefit from a virtualized database.
One of the technologies that was unnecessarily discarded when row-oriented relational databases came along is the columnar database. The basic idea is that instead of storing each field in a data store only once, as the relational database does to save storage and hence enhance performance, the columnar database emphasizes storing eachvalue in a field/column only once, and in the smallest possible area. Thus, indexing is columnar technology applied to yes/no-value fields.
Relational databases are superior, as of now, for OLTP (online transactional processing) transaction streams involving massive amounts of data updates, as well as queries involving two or less fields. However, columnar technology is moving ahead faster now than relational, so we are beginning to see situations in which columnar technology is the most appropriate technology for a query-only data warehouse.
Oracle, IBM and SAP (including both HANA and Sybase IQ) now offer columnar technology, and newer entrants such as EMC (Greenplum) allow users to choose columnar processing. Both IBM and SAP HANA offer integrated in-memory and columnar technology. In fact, IBM in BLU Acceleration (available with IBM DB2) goes one step further and parallelizes transactional operations at the level of the microprocessor core — leading to performance improvements of more than two magnitudes in some cases.
As the name implies, streaming databases treat data as a single stream passing under the “head” of the database engine, which must make an immediate decision whether to store it, process it, use it to generate an alert and/or re-route it to some other appropriate data source. Thus, the streaming database is often used as a “rapid-response” approach, unable to bring to bear the full context of a data warehouse in analyzing the next bit of information but far quicker to note obvious important and time-critical data.
Over time, the amount of content stored by streaming databases has grown rapidly, so that basic analytics can be performed “on the front line,” as it were. Still, streaming databases are not viewed as a substitute for any other database technology, but rather as a kind of “intelligent cache” taking some of the load off the back-end database. They are proving quite useful in seeding near real-time business-executive dashboards. Products range from IBM Streams for large enterprises to Software AG Apama, whose “sweet spot” is the needs of smaller firms.
Bottom Line: If Possible, All of the Above
The message of the above analysis is that those who believe that they are optimizing their information infrastructure by choosing relational plus one of the other technologies, or just by handing the job over to the public cloud, are often mistaken. Adding the rest of these in one way or another often leads to order-of-magnitude improvements in analytic deep-dive scaling, not to mention major improvements in the quality and speed of delivery of the data on which business decisions depend.
An optimized information infrastructure means that in many if not most cases, all five of the above technologies will play a role – but not one that goes beyond each technology’s strengths. Finally, these truths should remain applicable in the next two to three years, as there seem to be no clear game-changing technologies presently entering the market.
Neither in-house relational nor public-cloud Hadoop is a cure-all. In many situations, this is one of the rare occasions where “all of the above” is actually the best answer.