ACM Data Mining: 1) Hadoop - Distributed Data Processing 2) Facebook’s Petabyte Scale Data Warehouse Using Hive and Hadoop
January 25, 2010 at 6:30 PM - 8:30 PM
LinkedIn, Mountain View
NEW MEETING DATE & LOCATION!
TITLE 1: ”Hadoop: Distributed Data Processing”
Hadoop is an open-source distributed platform designed to economically store and process data using clustered commodity hardware. Hadoop is Apache’s implementation of the MapReduce/GFS frameworks popularized by Google. In this talk we will demystify this powerful platform, and describe how it enables you to consolidate many different data storage and processing needs in an economically scalable cloud resource.
SPEAKER BIOGRAPHY
Dr. Amr Awadallah is Chief Technical Officer and Founder for Cloudera, Inc. Before Cloudera, he was vice president of product intelligence engineering at Yahoo! Inc., where he worked since June 2000 after Yahoo acquired his first startup (VivaSmart). Dr. Awadallah received his PhD from Stanford University in 2007 and his BS/MS degrees from Cairo University in 1992 and 1995, respectively.
TITLE 2: ”Facebook’s Petabyte Scale Data Warehouse Using Hive and Hadoop”
Hive is an open source, peta-byte scale date warehousing framework built on top of Hadoop that enables scalable analytics on large data sets using SQL and some language extensions. Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook – both engineering and non-engineering. This talk will highlight how Hive and Hadoop allow us at Facebook to offer a cheap, scalable and flexible infrastructure to do different kinds of analysis. We will talk about the architecture, applications and capabilities of this infrastructure which handles close to 8000 jobs a day and stores nearly 2.5PB of compressed data.
SPEAKER BIOGRAPHY
Ashish Thusoo has been with Facebook for the last couple of years and is managing the Facebook data infrastructure team in his most recent role. He started the Hive project at Facebook along with Joydeep and serves at the project lead for Hive at Apache.
Event Owner: Greg Makowski (Director of Risk Analytics and Policy at CashEdge)