Hadoop权威指南(第二版,影印版)
Tom White
出版时间:2011年06月
页数:600
揭示了Apache Hadoop如何为你释放数据的力量。这本内容全面的书籍展示了如何使用Hadoop架构搭建和维护可靠、可伸缩的分布式系统。Hadoop架构是MapReduce算法的一种开源应用,是Google开创其帝国的重要基石。程序员可从中探索如何分析海量数据集,管理员可以了解如何建立与运行Hadoop集群。
本修订版涵盖了Hadoop最近的更新,包括诸如Hive、Sqoop和Avro之类的新特性。它也提供了案例学习来展示Hadoop如何解决特殊问题。期待尽情享受你的数据?这就是你要的书。

·使用Hadoop分布式文件系统(HDFS)来存储海量数据集,通过MapReduce对这些数据集运行分布式计算
·熟悉Hadoop的数据和I/O构件,用于压缩、数据集成、序列化和持久处理
·洞悉编写MapReduce实际应用程序时的常见陷阱和高级特性
·设计、构建和管理专用的Hadoop集群或在云上运行Hadoop
·使用Pig这种高级的查询语言来处理大规模数据
·使用Hive、Hadoop的数据仓库系统来分析数据集
·利用HBase这个Hadoop数据库来处理结构化和半结构化数据
·学习Zookeeper,这是一个用于构建分布式系统的协作原语工具箱

“祝贺你有此良机向大师学习Hadoop,在享用技术本身的同时,体验大师的睿智和朴素的文风。”
——Doug Cutting
Cloudera公司
本书作者Tom White从2007年起就是Apache Hadoop的理事。他是Apache软件基金会的成员和Cloudera的工程师。Tom为oreilly.com,java.net和IBM的developerWorks撰文,并为业内会议演讲。

Cloudera是基于Hadoop的软件和服务的领先提供商。Hadoop的Cloudera发行版(CDH)是全面的基于Apache Hadoop的数据管理平台。Cloudera公司提供了使用Hadoop所需的工具、平台和支持。

适用于有编程经验的读者
  1. Foreword
  2. Preface
  3. 1. Meet Hadoop
  4. Data!
  5. Data Storage and Analysis
  6. Comparison with Other Systems
  7. RDBMS
  8. Grid Computing
  9. Volunteer Computing
  10. A Brief History of Hadoop
  11. Apache Hadoop and the Hadoop Ecosystem
  12. 2. MapReduce
  13. A Weather Dataset
  14. Data Format
  15. Analyzing the Data with Unix Tools
  16. Analyzing the Data with Hadoop
  17. Map and Reduce
  18. Java MapReduce
  19. Scaling Out
  20. Data Flow
  21. Combiner Functions
  22. Running a Distributed MapReduce Job
  23. Hadoop Streaming
  24. Ruby
  25. Python
  26. Hadoop Pipes
  27. Compiling and Running
  28. 3. The Hadoop Distributed Filesystem
  29. The Design of HDFS
  30. HDFS Concepts
  31. Blocks
  32. Namenodes and Datanodes
  33. The Command-Line Interface
  34. Basic Filesystem Operations
  35. Hadoop Filesystems
  36. Interfaces
  37. The Java Interface
  38. Reading Data from a Hadoop URL
  39. Reading Data Using the FileSystem API
  40. Writing Data
  41. Directories
  42. Querying the Filesystem
  43. Deleting Data
  44. Data Flow
  45. Anatomy of a File Read
  46. Anatomy of a File Write
  47. Coherency Model
  48. Parallel Copying with distcp
  49. Keeping an HDFS Cluster Balanced
  50. Hadoop Archives
  51. Using Hadoop Archives
  52. Limitations
  53. 4. Hadoop I/O
  54. Data Integrity
  55. Data Integrity in HDFS
  56. LocalFileSystem
  57. ChecksumFileSystem
  58. Compression
  59. Codecs
  60. Compression and Input Splits
  61. Using Compression in MapReduce
  62. Serialization
  63. The Writable Interface
  64. Writable Classes
  65. Implementing a Custom Writable
  66. Serialization Frameworks
  67. Avro
  68. File-Based Data Structures
  69. SequenceFile
  70. MapFile
  71. 5. Developing a MapReduce Application
  72. The Configuration API
  73. Combining Resources
  74. Variable Expansion
  75. Configuring the Development Environment
  76. Managing Configuration
  77. GenericOptionsParser, Tool, and ToolRunner
  78. Writing a Unit Test
  79. Mapper
  80. Reducer
  81. Running Locally on Test Data
  82. Running a Job in a Local Job Runner
  83. Testing the Driver
  84. Running on a Cluster
  85. Packaging
  86. Launching a Job
  87. The MapReduce Web UI
  88. Retrieving the Results
  89. Debugging a Job
  90. Using a Remote Debugger
  91. Tuning a Job
  92. Profiling Tasks
  93. MapReduce Workflows
  94. Decomposing a Problem into MapReduce Jobs
  95. Running Dependent Jobs
  96. 6. How MapReduce Works
  97. Anatomy of a MapReduce Job Run
  98. Job Submission
  99. Job Initialization
  100. Task Assignment
  101. Task Execution
  102. Progress and Status Updates
  103. Job Completion
  104. Failures
  105. Task Failure
  106. Tasktracker Failure
  107. Jobtracker Failure
  108. Job Scheduling
  109. The Fair Scheduler
  110. The Capacity Scheduler
  111. Shuffle and Sort
  112. The Map Side
  113. The Reduce Side
  114. Configuration Tuning
  115. Task Execution
  116. Speculative Execution
  117. Task JVM Reuse
  118. Skipping Bad Records
  119. The Task Execution Environment
  120. 7. MapReduce Types and Formats
  121. MapReduce Types
  122. The Default MapReduce Job
  123. Input Formats
  124. Input Splits and Records
  125. Text Input
  126. Binary Input
  127. Multiple Inputs
  128. Database Input (and Output)
  129. Output Formats
  130. Text Output
  131. Binary Output
  132. Multiple Outputs
  133. Lazy Output
  134. Database Output
  135. 8. MapReduce Features
  136. Counters
  137. Built-in Counters
  138. User-Defined Java Counters
  139. User-Defined Streaming Counters
  140. Sorting
  141. Preparation
  142. Partial Sort
  143. Total Sort
  144. Secondary Sort
  145. Joins
  146. Map-Side Joins
  147. Reduce-Side Joins
  148. Side Data Distribution
  149. Using the Job Configuration
  150. Distributed Cache
  151. MapReduce Library Classes
  152. 9. Setting Up a Hadoop Cluster
  153. Cluster Specification
  154. Network Topology
  155. Cluster Setup and Installation
  156. Installing Java
  157. Creating a Hadoop User
  158. Installing Hadoop
  159. Testing the Installation
  160. SSH Configuration
  161. Hadoop Configuration
  162. Configuration Management
  163. Environment Settings
  164. Important Hadoop Daemon Properties
  165. Hadoop Daemon Addresses and Ports
  166. Other Hadoop Properties
  167. User Account Creation
  168. Security
  169. Kerberos and Hadoop
  170. Delegation Tokens
  171. Other Security Enhancements
  172. Benchmarking a Hadoop Cluster
  173. Hadoop Benchmarks
  174. User Jobs
  175. Hadoop in the Cloud
  176. Hadoop on Amazon EC2
  177. 10. Administering Hadoop
  178. HDFS
  179. Persistent Data Structures
  180. Safe Mode
  181. Audit Logging
  182. Tools
  183. Monitoring
  184. Logging
  185. Metrics
  186. Java Management Extensions
  187. Maintenance
  188. Routine Administration Procedures
  189. Commissioning and Decommissioning Nodes
  190. Upgrades
  191. 11. Pig
  192. Installing and Running Pig
  193. Execution Types
  194. Running Pig Programs
  195. Grunt
  196. Pig Latin Editors
  197. An Example
  198. Generating Examples
  199. Comparison with Databases
  200. Pig Latin
  201. Structure
  202. Statements
  203. Expressions
  204. Types
  205. Schemas
  206. Functions
  207. User-Defined Functions
  208. A Filter UDF
  209. An Eval UDF
  210. A Load UDF
  211. Data Processing Operators
  212. Loading and Storing Data
  213. Filtering Data
  214. Grouping and Joining Data
  215. Sorting Data
  216. Combining and Splitting Data
  217. Pig in Practice
  218. Parallelism
  219. Parameter Substitution
  220. 12. Hive
  221. Installing Hive
  222. The Hive Shell
  223. An Example
  224. Running Hive
  225. Configuring Hive
  226. Hive Services
  227. The Metastore
  228. Comparison with Traditional Databases
  229. Schema on Read Versus Schema on Write
  230. Updates, Transactions, and Indexes
  231. HiveQL
  232. Data Types
  233. Operators and Functions
  234. Tables
  235. Managed Tables and External Tables
  236. Partitions and Buckets
  237. Storage Formats
  238. Importing Data
  239. Altering Tables
  240. Dropping Tables
  241. Querying Data
  242. Sorting and Aggregating
  243. MapReduce Scripts
  244. Joins
  245. Subqueries
  246. Views
  247. User-Defined Functions
  248. Writing a UDF
  249. Writing a UDAF
  250. 13. HBase
  251. HBasics
  252. Backdrop
  253. Concepts
  254. Whirlwind Tour of the Data Model
  255. Implementation
  256. Installation
  257. Test Drive
  258. Clients
  259. Java
  260. Avro, REST, and Thrift
  261. Example
  262. Schemas
  263. Loading Data
  264. Web Queries
  265. HBase Versus RDBMS
  266. Successful Service
  267. HBase
  268. Use Case: HBase at Streamy.com
  269. Praxis
  270. Versions
  271. HDFS
  272. UI
  273. Metrics
  274. Schema Design
  275. Counters
  276. Bulk Load
  277. 14. ZooKeeper
  278. Installing and Running ZooKeeper
  279. An Example
  280. Group Membership in ZooKeeper
  281. Creating the Group
  282. Joining a Group
  283. Listing Members in a Group
  284. Deleting a Group
  285. The ZooKeeper Service
  286. Data Model
  287. Operations
  288. Implementation
  289. Consistency
  290. Sessions
  291. States
  292. Building Applications with ZooKeeper
  293. A Configuration Service
  294. The Resilient ZooKeeper Application
  295. A Lock Service
  296. More Distributed Data Structures and Protocols
  297. ZooKeeper in Production
  298. Resilience and Performance
  299. Configuration
  300. 15. Sqoop
  301. Getting Sqoop
  302. A Sample Import
  303. Generated Code
  304. Additional Serialization Systems
  305. Database Imports: A Deeper Look
  306. Controlling the Import
  307. Imports and Consistency
  308. Direct-mode Imports
  309. Working with Imported Data
  310. Imported Data and Hive
  311. Importing Large Objects
  312. Performing an Export
  313. Exports: A Deeper Look
  314. Exports and Transactionality
  315. Exports and SequenceFiles
  316. 16. Case Studies
  317. Hadoop Usage at Last.fm
  318. Last.fm: The Social Music Revolution
  319. Hadoop at Last.fm
  320. Generating Charts with Hadoop
  321. The Track Statistics Program
  322. Summary
  323. Hadoop and Hive at Facebook
  324. Introduction
  325. Hadoop at Facebook
  326. Hypothetical Use Case Studies
  327. Hive
  328. Problems and Future Work
  329. Nutch Search Engine
  330. Background
  331. Data Structures
  332. Selected Examples of Hadoop Data Processing in Nutch
  333. Summary
  334. Log Processing at Rackspace
  335. Requirements/The Problem
  336. Brief History
  337. Choosing Hadoop
  338. Collection and Storage
  339. MapReduce for Logs
  340. Cascading
  341. Fields, Tuples, and Pipes
  342. Operations
  343. Taps, Schemes, and Flows
  344. Cascading in Practice
  345. Flexibility
  346. Hadoop and Cascading at ShareThis
  347. Summary
  348. TeraByte Sort on Apache Hadoop
  349. Using Pig and Wukong to Explore Billion-edge Network Graphs
  350. Measuring Community
  351. Everybody’s Talkin’ at Me: The Twitter Reply Graph
  352. Symmetric Links
  353. Community Extraction
  354. A. Installing Apache Hadoop
  355. B. Cloudera’s Distribution for Hadoop
  356. C. Preparing the NCDC Weather Data
  357. Index
书名:Hadoop权威指南(第二版,影印版)
作者:Tom White
国内出版社:东南大学出版社
出版时间:2011年06月
页数:600
书号:978-7-5641-2676-6
原版书书名:Hadoop: The Definitive Guide, Second Edition
原版书出版商:O'Reilly Media
Tom White
 
自从 2007 年 2 月以来,Tom White 一直担任 Apache Hadoop 项目负责人。他是 Apache 软件基金会的成员之一。他就职于 Cloudera,该公司提供 Hadoop 产品、服 务、支持和培训服务。在此之前,Tom 是一名独立的 Hadoop 顾问,曾帮助很多公 司搭建、使用和扩展 Hadoop 应用。他曾为 O’Reilly.com,Java.net 和 IBM 的 developerWorks 写过大量文章,并定期在行业大会上发表 Hadoop 主题演讲。Tom 拥有英国剑桥大学数学学士学位和利兹大学科学哲学硕士学位。现在,他和他的家 人居住在旧金山。
 
 
The animal on the cover of Hadoop: The Definitive Guide is an African elephant. These
members of the genus Loxodonta are the largest land animals on earth (slightly larger
than their cousin, the Asian elephant) and can be identified by their ears, which have
been said to look somewhat like the continent of Asia. Males stand 12 feet tall at the
shoulder and weigh 12,000 pounds, but they can get as big as 15,000 pounds, whereas
females stand 10 feet tall and weigh 8,000–11,000 pounds. Even young elephants are
very large: at birth, they already weigh approximately 200 pounds and stand about 3
feet tall.
African elephants live throughout sub-Saharan Africa. Most of the continent’s elephants
live on savannas and in dry woodlands. In some regions, they can be found in
desert areas; in others, they are found in mountains.
The species plays an important role in the forest and savanna ecosystems in which they
live. Many plant species are dependent on passing through an elephant’s digestive tract
before they can germinate; it is estimated that at least a third of tree species in west
African forests rely on elephants in this way. Elephants grazing on vegetation also affect
the structure of habitats and influence bush fire patterns. For example, under natural
conditions, elephants make gaps through the rainforest, enabling the sunlight to enter,
which allows the growth of various plant species. This, in turn, facilitates more abundance
and more diversity of smaller animals. As a result of the influence elephants have
over many plants and animals, they are often referred to as a keystone species because
they are vital to the long-term survival of the ecosystems in which they live.