Edward Capriolo, Dean Wampler, Jason Rutherglen
你是否需要把一个关系型数据库应用迁移到Hadoop上?这本全面的指南将为你介绍Apache Hive,它是Hadoop的数据仓库平台。你将快速了解如何使用Hive的SQL方言——HiveQL——来汇总、查询和分析存储在Hadoop分布式文件系统中的大数据集。

· 使用Hive来创建、改变和删除数据库、表、视图、函数和索引
· 定制文件和外部数据库中的数据格式和存储选项
· 从表中加载和提取数据——以及使用查询、分组、过滤、连接和其他常用查询方法
· 获取创建用户自定义函数的最佳实践
· 了解你应该使用的Hive模式和你应该避免的错误模式
· 把Hive集成到其他数据处理程序中
· 在NoSQL数据库和其他数据存储中使用存储处理器
· 了解在Amazon公司的Elastic MapReduce上运行Hive的优点和缺点

Edward Capriolo是Media6degrees的系统管理员,也是Apache软件基金会的成员和Hadoop-Hive项目的委员之一。
Dean Wampler是Think Big Analytics公司的资深咨询顾问,他专长于大数据问题以及诸如Hadoop这样的工具和Machine Learning(机器学习)。
Jason Rutherglen是Think Big Analytics公司的软件架构师,他专长于大数据、Hadoop、搜索和安全。
  1. Chapter 1: Introduction
  2. An Overview of Hadoop and MapReduce
  3. Hive in the Hadoop Ecosystem
  4. Java Versus Hive: The Word Count Algorithm
  5. What’s Next
  6. Chapter 2: Getting Started
  7. Installing a Preconfigured Virtual Machine
  8. Detailed Installation
  9. What Is Inside Hive?
  10. Starting Hive
  11. Configuring Your Hadoop Environment
  12. The Hive Command
  13. The Command-Line Interface
  14. Chapter 3: Data Types and File Formats
  15. Primitive Data Types
  16. Collection Data Types
  17. Text File Encoding of Data Values
  18. Schema on Read
  19. Chapter 4: HiveQL: Data Definition
  20. Databases in Hive
  21. Alter Database
  22. Creating Tables
  23. Partitioned, Managed Tables
  24. Dropping Tables
  25. Alter Table
  26. Chapter 5: HiveQL: Data Manipulation
  27. Loading Data into Managed Tables
  28. Inserting Data into Tables from Queries
  29. Creating Tables and Loading Them in One Query
  30. Exporting Data
  31. Chapter 6: HiveQL: Queries
  32. SELECT … FROM Clauses
  33. WHERE Clauses
  34. GROUP BY Clauses
  35. JOIN Statements
  36. ORDER BY and SORT BY
  39. Casting
  40. Queries that Sample Data
  42. Chapter 7: HiveQL: Views
  43. Views to Reduce Query Complexity
  44. Views that Restrict Data Based on Conditions
  45. Views and Map Type for Dynamic Tables
  46. View Odds and Ends
  47. Chapter 8: HiveQL: Indexes
  48. Creating an Index
  49. Rebuilding the Index
  50. Showing an Index
  51. Dropping an Index
  52. Implementing a Custom Index Handler
  53. Chapter 9: Schema Design
  54. Table-by-Day
  55. Over Partitioning
  56. Unique Keys and Normalization
  57. Making Multiple Passes over the Same Data
  58. The Case for Partitioning Every Table
  59. Bucketing Table Data Storage
  60. Adding Columns to a Table
  61. Using Columnar Tables
  62. (Almost) Always Use Compression!
  63. Chapter 10: Tuning
  64. Using EXPLAIN
  66. Limit Tuning
  67. Optimized Joins
  68. Local Mode
  69. Parallel Execution
  70. Strict Mode
  71. Tuning the Number of Mappers and Reducers
  72. JVM Reuse
  73. Indexes
  74. Dynamic Partition Tuning
  75. Speculative Execution
  76. Single MapReduce MultiGROUP BY
  77. Virtual Columns
  78. Chapter 11: Other File Formats and Compression
  79. Determining Installed Codecs
  80. Choosing a Compression Codec
  81. Enabling Intermediate Compression
  82. Final Output Compression
  83. Sequence Files
  84. Compression in Action
  85. Archive Partition
  86. Compression: Wrapping Up
  87. Chapter 12: Developing
  88. Changing Log4J Properties
  89. Connecting a Java Debugger to Hive
  90. Building Hive from Source
  91. Setting Up Hive and Eclipse
  92. Hive in a Maven Project
  93. Unit Testing in Hive with hive_test
  94. The New Plugin Developer Kit
  95. Chapter 13: Functions
  96. Discovering and Describing Functions
  97. Calling Functions
  98. Standard Functions
  99. Aggregate Functions
  100. Table Generating Functions
  101. A UDF for Finding a Zodiac Sign from a Day
  102. UDF Versus GenericUDF
  103. Permanent Functions
  104. User-Defined Aggregate Functions
  105. User-Defined Table Generating Functions
  106. Accessing the Distributed Cache from a UDF
  107. Annotations for Use with Functions
  108. Macros
  109. Chapter 14: Streaming
  110. Identity Transformation
  111. Changing Types
  112. Projecting Transformation
  113. Manipulative Transformations
  114. Using the Distributed Cache
  115. Producing Multiple Rows from a Single Row
  116. Calculating Aggregates with Streaming
  118. GenericMR Tools for Streaming to Java
  119. Calculating Cogroups
  120. Chapter 15: Customizing Hive File and Record Formats
  121. File Versus Record Formats
  122. Demystifying CREATE TABLE Statements
  123. File Formats
  124. Record Formats: SerDes
  125. CSV and TSV SerDes
  126. ObjectInspector
  127. Think Big Hive Reflection ObjectInspector
  128. XML UDF
  129. XPath-Related Functions
  130. JSON SerDe
  131. Avro Hive SerDe
  132. Binary Output
  133. Chapter 16: Hive Thrift Service
  134. Starting the Thrift Server
  135. Setting Up Groovy to Connect to HiveService
  136. Connecting to HiveServer
  137. Getting Cluster Status
  138. Result Set Schema
  139. Fetching Results
  140. Retrieving Query Plan
  141. Metastore Methods
  142. Administrating HiveServer
  143. Hive ThriftMetastore
  144. Chapter 17: Storage Handlers and NoSQL
  145. Storage Handler Background
  146. HiveStorageHandler
  147. HBase
  148. Cassandra
  149. DynamoDB
  150. Chapter 18: Security
  151. Integration with Hadoop Security
  152. Authentication with Hive
  153. Authorization in Hive
  154. Chapter 19: Locking
  155. Locking Support in Hive with Zookeeper
  156. Explicit, Exclusive Locks
  157. Chapter 20: Hive Integration with Oozie
  158. Oozie Actions
  159. A Two-Query Workflow
  160. Oozie Web Console
  161. Variables in Workflows
  162. Capturing Output
  163. Capturing Output to Variables
  164. Chapter 21: Hive and Amazon Web Services (AWS)
  165. Why Elastic MapReduce?
  166. Instances
  167. Before You Start
  168. Managing Your EMR Hive Cluster
  169. Thrift Server on EMR Hive
  170. Instance Groups on EMR
  171. Configuring Your EMR Cluster
  172. Persistence and the Metastore on EMR
  173. HDFS and S3 on EMR Cluster
  174. Putting Resources, Configs, and Bootstrap Scripts on S3
  175. Logs on S3
  176. Spot Instances
  177. Security Groups
  178. EMR Versus EC2 and Apache Hive
  179. Wrapping Up
  180. Chapter 22: HCatalog
  181. Introduction
  182. MapReduce
  183. Command Line
  184. Security Model
  185. Architecture
  186. Chapter 23: Case Studies
  187. (Media6Degrees)
  188. Outbrain
  189. NASA’s Jet Propulsion Laboratory
  190. Photobucket
  191. SimpleReach
  192. Experiences and Needs from the Customer Trenches
  193. Glossary
  194. Appendix: References
原版书书名:Programming Hive
原版书出版商:O'Reilly Media
Edward Capriolo
Dean Wampler
Think Big Analytics公司总顾问,对大数据问题以及Hadoop和机器学习有专门的研究。
Jason Rutherglen
Think Big Analytics公司软件架构师,对大数据、Hadoop、搜索和安全有专门的研究。
The animal on the cover of Programming Hive is a European hornet (Vespa cabro) andits hive. The European hornet is the only hornet in North America, introduced to thecontinent when European settlers migrated to the Americas. This hornet can be foundthroughout Europe and much of Asia, adapting its hive-building techniques to differentclimates when necessary.

The hornet is a social insect, related to bees and ants. The hornet’s hive consists of onequeen, a few male hornets (drones), and a large quantity of sterile female workers. Thechief purpose of drones is to reproduce with the hornet queen, and they die soon after.It is the female workers who are responsible for building the hive, carrying food, andtending to the hornet queen’s eggs.

The hornet’s nest itself is the consistency of paper, since it is constructed out of woodpulp in several layers of hexagonal cells. The end result is a pear-shaped nest attachedto its shelter by a short stem. In colder areas, hornets will abandon the nest in the winterand take refuge in hollow logs or trees, or even human houses, where the queen andher eggs will stay until the warmer weather returns. The eggs form the start of a newcolony, and the hive can be constructed once again.