快数据:大数据的下一个脚步

快数据 deepdata 1699℃ 0评论

fastdata

快数据:大数据的下一个脚步
Fast data: The next step after big data

大数据之所以称为“大数据”,主要依靠稳定的数据流入来源。在大容量数据生产环境中,数据的到达速度快到难以置信,当然数据的存储和分析处理就是最大的挑战。
VoltDB软件架构师John Hugg认为:相较于传统在数据存储后再分析处理的简单工作机制,也许我们需要调整思路,利用类似Apache Kafka等工具在高速获取数据输入的同时实现同步处理分析。
The way that big data gets big is through a constant stream of incoming data. In high-volume environments, that data arrives at incredible rates, yet still needs to be analyzed and stored.
John Hugg, software architect at VoltDB, proposes that instead of simply storing that data to be analyzed later, perhaps we’ve reached the point where it can be analyzed as it’s ingested while still maintaining extremely high intake rates using tools such as Apache Kafka.

— Paul Venezia 保罗·威尼斯

在十几年前,还无法想象利用现有商用硬件资源对PB级别的历史数据进行分析处理。今天,由上千个处理节点构成的Hadoop集群完成这种任务已经游刃有余。开源技术(如Hadoop)的不断涌现帮助我们得以在现有商用及虚拟化硬件上有效处理PB级以上的数据,并且这种处理能力的获得还相对廉价。总之大数据处理时代已经成型。
Less than a dozen years ago, it was nearly impossible to imagine analyzing petabytes of historical data using commodity hardware. Today, Hadoop clusters built from thousands of nodes are almost commonplace. Open source technologies like Hadoop reimagined how to efficiently process petabytes upon petabytes of data using commodity and virtualized hardware, making this capability available cheaply to developers everywhere. As a result, the field of big data emerged.

类似的,新的技术变革——“快数据”已悄然启动。大数据通常是由生产速度惊人的数据流所汇集,如页面点击数据,金融报价数据,操作日志和传感器自动采样数据等。通常这种事件每秒钟会发生成千上万次。无怪乎人们将这种数据管道称为“消防水龙”。
A similar revolution is happening with so-called fast data. First, let’s define fast data. Big data is often created by data that is generated at incredible speeds, such as click-stream data, financial ticker data, log aggregation, or sensor data. Often these events occur thousands to tens of thousands of times per second. No wonder this type of data is commonly referred to as a “fire hose.”

当我们在讨论状如高压消防水龙喷射的大数据话题时,计量单位并非GB、TB以及PB等为数据仓库所定义数量级。我们更倾向进一步用时间周期来衡量数据量:每秒MB数、每小时GB数或每日TB数。当我们在探讨为何要采用速率计量时,正好体现了大数据与传统数据仓库之间的核心区别。大数据不仅仅表现为“大”,它同样表现出相当“快”。
When we talk about fire hoses in big data, we’re not measuring volume in the typical gigabytes, terabytes, and petabytes familiar to data warehouses. We’re measuring volume in terms of time: the number of megabytes per second, gigabytes per hour, or terabytes per day. We’re talking about velocity as well as volume, which gets at the core of the difference between big data and the data warehouse. Big data isn’t just big; it’s also fast.

一旦新鲜且极高流速的数据被倒进HDFS(Hadoop分布式文件系统)、RDBMS(关系数据库管理系统)亦或是文本文件群中,因为处理能力的原因导致数据不能被有效处理。这是因为消防水龙中喷涌而出的是活动数据、即时状态或者正在进行当中的数据。与之相反,数据仓库本身只适应处理历史数据并由此预测未来的能力。
The benefits of big data are lost if fresh, fast-moving data from the fire hose is dumped into HDFS, an analytic RDBMS, or even flat files, because the ability to act or alert right now, as things are happening, is lost. The fire hose represents active data, immediate status, or data with ongoing purpose. The data warehouse, by contrast, is a way of looking though historical data to understand the past and predict the future.

在数据到达同时进行快速处理在传统架构下是一种不可能完成的任务——或者需要极高的实施成本。正如大数据中蕴藏的价值一样,快数据的价值已然可以通过Kafka与Storm对消息队列及流处理系统的实现并引进开源的NoSQL与NewSQL产品而释放。
Acting on data as it arrives has been thought of as costly and impractical if not impossible, especially on commodity hardware. Just like the value in big data, the value in fast data is being unlocked with the reimagined implementation of message queues and streaming systems such as open source Kafka and Storm, and the reimagined implementation of databases with the introduction of open source NoSQL and NewSQL offerings.

在快数据中捕捉价值
捕捉输入数据价值的最佳方式就是在数据到达时立即进行处理。如果还是以批量方式处理数据,就意味着在失去了数据的时效性同时衰减了快数据的核心价值。
Capturing value in fast data
The best way to capture the value of incoming data is to react to it the instant it arrives. If you are processing incoming data in batches, you’ve already lost time and, thus, the value of that data.

为了即时处理每秒数万乃至数百万事件的数据量,你需要二种技术:一套在适应流事件加工系统和一套快速存储方案。
To process data arriving at tens of thousands to millions of events per second, you will need two technologies: First, a streaming system capable of delivering events as fast as they come in; and second, a data store capable of processing each item as fast as it arrives.

快数据的技术方向
在过去几年当中,两种流系统Apache Storm与Apache Kafka获得了广泛认同。作为最初由Twitter团队发起的项目,Storm能够非常可靠地处理每秒高达百万级的数据流。另外由LinkedIn团队发起Kafka项目则是一套具备极高数据吞吐能力的分布式集群消息队列系统。这两大流系统方案为快数据处理提供了坚实的基础。然而Kafka的技术体系自成一家。
Delivering the fast data
Two popular streaming systems have emerged over the past few years: Apache Storm and Apache Kafka. Originally developed by the engineering team at Twitter, Storm can reliably process unbounded streams of data at rates of millions of messages per second. Kafka, developed by the engineering team at LinkedIn, is a high-throughput distributed message queue system. Both streaming systems address the need of processing fast data. Kafka, however, stands apart.

Kafka的系统设计立足于使用消息队列技术来解决问题。具有很强的可扩展性,分布式部署,多租户,和强大的持久性。通过部署一个Kafka集群可以满足流数据处理需求。不过作为项目核心,Kafka只能交付消息——也就是说,它不支持任何形式的处理或者查询操作。
Kafka was designed to be a message queue and to solve the perceived problems of existing technologies. It’s sort of an über-queue with unlimited scalability, distributed deployments, multitenancy, and strong persistence. An organization could deploy one Kafka cluster to satisfy all of its message queueing needs. Still, at its core, Kafka delivers messages. It doesn’t support processing or querying of any kind.

快数据的处理
传递消息只是解决方案的一部分。传统关系型数据库在性能方面存在局限。其中一些能够以较高速率实现数据存储,但对数据接收后的校验、索引、查询方面却不尽如人意。NoSQL型数据库系统目前已经拥有集群化能力与出色的性能表现,但欠缺传统SQL数据所能提供的数据一致性能力及可靠性。对于基本的高速数据处理任务,NoSQL方案已经足以满足业务需求。如果进行复杂的查询以及商用事务处理,基于内存的NewSQL解决方案能够兼顾性能和事务一致性这两大难题。
Processing the fast data
Messaging is only part of a solution. Traditional relational databases tend to be limited in performance. Some may be able to store data at high rates, but fall over when they are expected to validate, enrich, or act on data as it is ingested. NoSQL systems have embraced clustering and high performance, but sacrifice much of the power and safety that traditional SQL-based systems offered. For basic fire hose processing, NoSQL solutions may satisfy your business needs. However, if you are executing complex queries and business logic operations per event, in-memory NewSQL solutions can satisfy your needs for both performance and transactional complexity.

以Kafka为代表,一些NewSQL系统基于无共享集群进行建立。通过节点分布降减负载提升性能表现。为保证数据的可用和安全,一份数据会在各个集群节点之间进行复制。随着数据的增长,可以轻松地将新节点添加到集群中。节点还可以被移除,移除后集群中的其它部分仍能继续正常工作。数据库层面的加强和消息队列机制在加强避免了单点故障的问题。这种系统设计特性非常具有弹性。
Like Kafka, some NewSQL systems are built around shared-nothing clustering. Load is distributed among cluster nodes for performance. Data is replicated among cluster nodes for safety and availability. To handle increasing loads, nodes can be transparently added to the cluster. Nodes can be removed — or fail — and the rest of the cluster will continue to —. Both the database and the message queue are designed without single points of failure. These features are the hallmarks of systems designed for scale.

除此之外,Kafka与一些NewSQL系统有能力快带集群化与动态弹性部署。Kafka提供消息序列保障,通过开发内存处理引擎实现序列化一致性与ACID语义。两个系统通过集群提供了更多功能并简化配置。通过不同主机间的磁盘冗余替代了RAID和其它逻辑存储方案。
In addition, Kafka and some NewSQL systems have the ability to leverage clustering and dynamic topology to scale, without eschewing strong guarantees. Kafka provides message-ordering guarantees, while some in-memory processing engines provide serializable consistency and ACID semantics. Both systems use cluster-aware clients to deliver more features or to simplify configuration. Finally, both achieve redundant durability through disks on different machines, rather than RAID or other local storage schemes.

大数据处理工具箱
您在处理高流量大数据时正在寻找哪些工具?
Big data plumbers toolkit
What do you look for in a system for processing the big data fire hose?

寻找一套通过集群化机制实现冗余与可扩展性系统方案。
Look for a system with the redundancy and scalability benefits of native shared-nothing clustering.

寻找一套基于内存存储与实现处理机制实现数据高吞吐能力的系统方案。
Look for a system that leans on in-memory storage and processing to achieve high per-node throughput.

寻找一套能够在数据到达时进行即时处理的系统。可以处理逻辑约束并承受GB以上大数据的处理来辅助决策?
Look for a system that offers processing at ingestion time. Can the system perform conditional logic? Can it query gigabytes or more of existing state to inform decisions?

寻求一套能够将不同操作隔离开来,并为操作提供有力保障的系统方案。用户编写更为简单的代码并将注意力集中在业务难题上——而非忙于处理并发问题或者数据一致性。当然某些系统确实能够提供强大的一致性,但却面临严重的性能问题。
Look for a system that isolates operations and makes strong guarantees about its operations. This allows users to write simpler code and focus on business problems, rather than handling concurrency problems or data divergence. Beware of systems that offer strong consistency but at greatly reduced performance.

具备这些特性的系统正在NewSQL、NoSQL以及Hadoop业界当中不断涌现,但不同方案有各自的重点。与开发团队需要解决的初始问题关系密切。对于那些希望以实时方式处理快数据的企业来说,这些工具能够有效解决需要快速理解数据内容时面临的复杂性难题。
Systems with these properties are emerging from the NewSQL, NoSQL, and Hadoop communities, but different systems make different trade-offs, often based on their starting assumptions. For organizations that want to act on fast data in real time, these tools can remove much of the complexity involved in understanding data with velocity.

Kafka带来了一种安全及具备高可用性的处理方式,能够有效实现数据在生产者和消费者之间的互动,同时提供卓越的性能与稳健性。内存型数据库则可以替代传统关系型引擎,同时具备强大的事务型逻辑、计数与聚合能力,并拥有足以满足任何负载的超强扩展性。这类数据系库统应当被作为与Kafka系统的配套设施。
Kafka provides a safe and highly available way to move data between myriad producers and consumers, while offering performance and robustness to put admins at ease. An in-memory database can offer a full relational engine with powerful transactional logic, counting, and aggregation, all with enough scalability to meet any load. More than acting as a relational database, this system should serve as a processing engine complementary to Kafka’s messaging infrastructure.

无论您的实际需求如何,这些工具都展现了以更快速度处理更多数据的能力,并且能够全面替代其它类型的系统方案。
Whatever your organization’s needs, it’s likely that some combination of these tools can help you do more faster and know more than you know today, often while replacing more fragile or disparate systems.

 

Fast data: The next step after big data —— was originally published at InfoWorld.com

转载请注明:深数据DeepData » 快数据:大数据的下一个脚步

喜欢 (0)
发表我的评论
取消评论
表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址