thozzology: CASCADING

CASCADING

Cascading is a thin Java library that sits on top of Hadoop's MapReduce layer. It is not a new text based query syntax (like Pig) or another complex system that must be installed on a cluster and maintained (like Hive). Though Cascading is both complimentary to and is a valid alternative to either application.

The love of writing useful software and helping other developers innovate has been the primary motivator for creating Cascading. Support this project by using it, tweeting about it, blogging about it, releasing extensions, and possibly buying support for you and your team.

Cascading provides means for defining arbitrarily complex, reusable, and fault tolerant data processing workflows, and a job planner for rendering those workflows into cluster executable jobs.

Cascading allows the developer to assemble predefined workflow tasks and tools, collect those workflows into a logical 'unit of work', and to efficiently schedule and execute them. Where these processes can scale laterally on clusters running in the local datacenter or on Amazon EC2.

Cascading currently relies on Hadoop to provide the storage and execution infrastructure. But the Cascading API insulates developers from the particulars of Hadoop, offering opportunites for Cascading to target different compute frameworks in the future without changes to the original processing workflow definitions.

Those familiar with Hadoop know it is an implementation of the MapReduce programming model. And any developer that has built any sort of application using MapReduce to solve 'real world' problems knows such applications can get complex very quickly. This is further aggravated by the need to 'think' in MapReduce throughout application development.

Thinking in MapReduce is typically unnatural, and tends to push the developer to constantly try to 'optimize' the application. This results in harder to read code, and likely more bugs. Further, most real world problems are a collection of dependent MapReduce jobs. Building them all and orchestrating them by hand does not scale well.

Cascading uses a 'pipe and filters' model for defining data processes. It efficiently supports splits, joins, grouping, and sorting. These are the only processing concepts the developer needs to think in.

During runtime, Cascading generates the minimum necessary number of MapReduce jobs, and executes them in the correct order locally, or on an Hadoop cluster. Any intermediate files are automatically cleared, and if target files already exist and aren't stale, those jobs can optionally be skipped.

We firmly believe applications should be built rapidly and designed as 'loosely coupled' as possible. Once an application is working and there are sufficient tests, only then should an application be optimized to remove any clear bottlenecks. Cascading supports this philosophy.

Cascading is also very suitable for 'ad-hoc' applications and scripts that might be needed to extract data from a Hadoop filesystem or to import data from various remote data sources. Or to just simply allow a user to poke around in various files and datasets.

Developers may also reuse existing Hadoop MapReduce jobs with Cascading, allowing them to participate with other Cascading dynamic MapReduce jobs on the cluster.

Sumber:

http://www.cascading.org

thozzology

Like a Dream and Fantasy

Pages

CASCADING

0 comments:

Post a Comment

Owl City - To The Sky

About this blog

Blog Archive

Followers

About Me

Facebook Badge