Fast data: The next step after big data
VoltDB软件架构师John Hugg认为：相较于传统在数据存储后再分析处理的简单工作机制，也许我们需要调整思路，利用类似Apache Kafka等工具在高速获取数据输入的同时实现同步处理分析。
The way that big data gets big is through a constant stream of incoming data. In high-volume environments, that data arrives at incredible rates, yet still needs to be analyzed and stored.
John Hugg, software architect at VoltDB, proposes that instead of simply storing that data to be analyzed later, perhaps we’ve reached the point where it can be analyzed as it’s ingested while still maintaining extremely high intake rates using tools such as Apache Kafka.
— Paul Venezia 保罗·威尼斯
Less than a dozen years ago, it was nearly impossible to imagine analyzing petabytes of historical data using commodity hardware. Today, Hadoop clusters built from thousands of nodes are almost commonplace. Open source technologies like Hadoop reimagined how to efficiently process petabytes upon petabytes of data using commodity and virtualized hardware, making this capability available cheaply to developers everywhere. As a result, the field of big data emerged.
A similar revolution is happening with so-called fast data. First, let’s define fast data. Big data is often created by data that is generated at incredible speeds, such as click-stream data, financial ticker data, log aggregation, or sensor data. Often these events occur thousands to tens of thousands of times per second. No wonder this type of data is commonly referred to as a “fire hose.”
When we talk about fire hoses in big data, we’re not measuring volume in the typical gigabytes, terabytes, and petabytes familiar to data warehouses. We’re measuring volume in terms of time: the number of megabytes per second, gigabytes per hour, or terabytes per day. We’re talking about velocity as well as volume, which gets at the core of the difference between big data and the data warehouse. Big data isn’t just big; it’s also fast.
The benefits of big data are lost if fresh, fast-moving data from the fire hose is dumped into HDFS, an analytic RDBMS, or even flat files, because the ability to act or alert right now, as things are happening, is lost. The fire hose represents active data, immediate status, or data with ongoing purpose. The data warehouse, by contrast, is a way of looking though historical data to understand the past and predict the future.
Acting on data as it arrives has been thought of as costly and impractical if not impossible, especially on commodity hardware. Just like the value in big data, the value in fast data is being unlocked with the reimagined implementation of message queues and streaming systems such as open source Kafka and Storm, and the reimagined implementation of databases with the introduction of open source NoSQL and NewSQL offerings.
Capturing value in fast data
The best way to capture the value of incoming data is to react to it the instant it arrives. If you are processing incoming data in batches, you’ve already lost time and, thus, the value of that data.
To process data arriving at tens of thousands to millions of events per second, you will need two technologies: First, a streaming system capable of delivering events as fast as they come in; and second, a data store capable of processing each item as fast as it arrives.
在过去几年当中，两种流系统Apache Storm与Apache Kafka获得了广泛认同。作为最初由Twitter团队发起的项目，Storm能够非常可靠地处理每秒高达百万级的数据流。另外由LinkedIn团队发起Kafka项目则是一套具备极高数据吞吐能力的分布式集群消息队列系统。这两大流系统方案为快数据处理提供了坚实的基础。然而Kafka的技术体系自成一家。
Delivering the fast data
Two popular streaming systems have emerged over the past few years: Apache Storm and Apache Kafka. Originally developed by the engineering team at Twitter, Storm can reliably process unbounded streams of data at rates of millions of messages per second. Kafka, developed by the engineering team at LinkedIn, is a high-throughput distributed message queue system. Both streaming systems address the need of processing fast data. Kafka, however, stands apart.
Kafka was designed to be a message queue and to solve the perceived problems of existing technologies. It’s sort of an über-queue with unlimited scalability, distributed deployments, multitenancy, and strong persistence. An organization could deploy one Kafka cluster to satisfy all of its message queueing needs. Still, at its core, Kafka delivers messages. It doesn’t support processing or querying of any kind.
Processing the fast data
Messaging is only part of a solution. Traditional relational databases tend to be limited in performance. Some may be able to store data at high rates, but fall over when they are expected to validate, enrich, or act on data as it is ingested. NoSQL systems have embraced clustering and high performance, but sacrifice much of the power and safety that traditional SQL-based systems offered. For basic fire hose processing, NoSQL solutions may satisfy your business needs. However, if you are executing complex queries and business logic operations per event, in-memory NewSQL solutions can satisfy your needs for both performance and transactional complexity.
Like Kafka, some NewSQL systems are built around shared-nothing clustering. Load is distributed among cluster nodes for performance. Data is replicated among cluster nodes for safety and availability. To handle increasing loads, nodes can be transparently added to the cluster. Nodes can be removed — or fail — and the rest of the cluster will continue to function. Both the database and the message queue are designed without single points of failure. These features are the hallmarks of systems designed for scale.
In addition, Kafka and some NewSQL systems have the ability to leverage clustering and dynamic topology to scale, without eschewing strong guarantees. Kafka provides message-ordering guarantees, while some in-memory processing engines provide serializable consistency and ACID semantics. Both systems use cluster-aware clients to deliver more features or to simplify configuration. Finally, both achieve redundant durability through disks on different machines, rather than RAID or other local storage schemes.
Big data plumbers toolkit
What do you look for in a system for processing the big data fire hose?
Look for a system with the redundancy and scalability benefits of native shared-nothing clustering.
Look for a system that leans on in-memory storage and processing to achieve high per-node throughput.
Look for a system that offers processing at ingestion time. Can the system perform conditional logic? Can it query gigabytes or more of existing state to inform decisions?
Look for a system that isolates operations and makes strong guarantees about its operations. This allows users to write simpler code and focus on business problems, rather than handling concurrency problems or data divergence. Beware of systems that offer strong consistency but at greatly reduced performance.
Systems with these properties are emerging from the NewSQL, NoSQL, and Hadoop communities, but different systems make different trade-offs, often based on their starting assumptions. For organizations that want to act on fast data in real time, these tools can remove much of the complexity involved in understanding data with velocity.
Kafka provides a safe and highly available way to move data between myriad producers and consumers, while offering performance and robustness to put admins at ease. An in-memory database can offer a full relational engine with powerful transactional logic, counting, and aggregation, all with enough scalability to meet any load. More than acting as a relational database, this system should serve as a processing engine complementary to Kafka’s messaging infrastructure.
Whatever your organization’s needs, it’s likely that some combination of these tools can help you do more faster and know more than you know today, often while replacing more fragile or disparate systems.
Fast data: The next step after big data —— was originally published at InfoWorld.com