基于开源工具的数据分析(影印版)
Philipp K. Janert
出版时间:2011年06月
页数:509
数据收集相对比较简单,而要把原始信息转化为有用的数据则需要知道如何精确地抽取你想要的内容。通过这本书的深入讲解,那些对数据分析感兴趣的中等或者富有经验的程序员将可以学习到在商业环境中与数据打交道的技术。你将了解到如何观察数据来找出它所包含的信息,如何在概念模型里捕捉到这些想法,然后把你的理解通过商业计划、度量标准的精确报告和其他方式反馈给你所在的机构。
你将会通过每章结束部分的动手实践来慢慢体验各种概念。最重要的是,你将了解到如何思考你所希望获取的数据——而不是依赖于工具来替你思考。

·使用图形来描述带有一个、两个或者十多个变量的数据
·使用粗略计算以及维度和概率参数来开发概念模型
·使用诸如模拟和聚类的集约计算方法来挖掘数据
·通过报告、信息板和其他度量程序来让你的结论更容易理解
·理解财务计算,包括货币时间价值
·利用降维技术或者预测分析来克服数据分析过程中面临的挑战
·熟悉数据分析的不同开源编程环境

“终于出现了一本简明的参考手册,让你理解如何征服数据”
——Austin King,高级网站开发人员, Mozilla

“有理想的数据研究员不可或缺的资料。”
——Michael E. Driscoll,CEO/创建者, Dataspora
本书作者Philipp K. Janert目前提供数据分析和数学模型的咨询服务,他曾经是物理学家和软件工程师。他是《Gnuplot in Action:Understanding Data with Graphs》 (Manning出版)的作者, 他为O’Reilly Network, IBM developerWorks和IEEE Software写过文章。他拥有Washington大学理论物理学的博士学位。

适用于有编程经验的读者
  1. PREFACE
  2. 1 INTRODUCTION
  3. Data Analysis
  4. What’s in This Book
  5. What’s with theWorkshops?
  6. What’s with the Math?
  7. What You’ll Need
  8. What’sMissing
  9. PART I Graphics: Looking at Data
  10. 2 A SINGLE VARIABLE: SHAPE AND DISTRIBUTION
  11. Dot and Jitter Plots
  12. Histograms and Kernel Density Estimates
  13. The Cumulative Distribution Function
  14. Rank-Order Plots and Lift Charts
  15. Only When Appropriate: Summary Statistics and Box Plots
  16. Workshop: NumPy
  17. Further Reading
  18. 3 TWO VARIABLES: ESTABLISHING RELATIONSHIPS
  19. Scatter Plots
  20. Conquering Noise: Smoothing
  21. Logarithmic Plots
  22. Banking
  23. Linear Regression and All That
  24. Showing What’s Important
  25. Graphical Analysis and Presentation Graphics
  26. Workshop: matplotlib
  27. Further Reading
  28. 4 TIME AS A VARIABLE: TIME-SERIES ANALYSIS
  29. Examples
  30. The Task
  31. Smoothing
  32. Don’t Overlook the Obvious!
  33. The Correlation Function
  34. Optional: Filters and Convolutions
  35. Workshop: scipy.signal
  36. Further Reading
  37. 5 MORE THAN TWO VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS
  38. False-Color Plots
  39. A Lot at a Glance: Multiplots
  40. Composition Problems
  41. Novel Plot Types
  42. Interactive Explorations
  43. Workshop: Tools for Multivariate Graphics
  44. Further Reading
  45. 6 INTERMEZZO: A DATA ANALYSIS SESSION
  46. A Data Analysis Session
  47. Workshop: gnuplot
  48. Further Reading
  49. PART II Analytics: Modeling Data
  50. 7 GUESSTIMATION AND THE BACK OF THE ENVELOPE
  51. Principles of Guesstimation
  52. How Good Are Those Numbers?
  53. Optional: A Closer Look at Perturbation Theory and
  54. Error Propagation
  55. Workshop: The Gnu Scientific Library (GSL)
  56. Further Reading
  57. 8 MODELS FROM SCALING ARGUMENTS
  58. Models
  59. Arguments from Scale
  60. Mean-Field Approximations
  61. Common Time-Evolution Scenarios
  62. Case Study: How Many Servers Are Best?
  63. Why Modeling?
  64. Workshop: Sage
  65. Further Reading
  66. 9 ARGUMENTS FROM PROBABILITY MODELS
  67. The Binomial Distribution and Bernoulli Trials
  68. The Gaussian Distribution and the Central Limit Theorem
  69. Power-Law Distributions and Non-Normal Statistics
  70. Other Distributions
  71. Optional: Case Study—Unique Visitors over Time
  72. Workshop: Power-Law Distributions
  73. Further Reading
  74. 10 WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS
  75. Genesis
  76. Statistics Defined
  77. Statistics Explained
  78. Controlled Experiments Versus Observational Studies
  79. Optional: Bayesian Statistics—The Other Point of View
  80. Workshop: R
  81. Further Reading
  82. 11 INTERMEZZO: MYTHBUSTING—BIGFOOT, LEAST SQUARES,
  83. AND ALL THAT
  84. How to Average Averages
  85. The Standard Deviation
  86. Least Squares
  87. Further Reading
  88. PART III Computation: Mining Data
  89. 12 SIMULATIONS
  90. AWarm-Up Question
  91. Monte Carlo Simulations
  92. Resampling Methods
  93. Workshop: Discrete Event Simulations with SimPy
  94. Further Reading
  95. 13 FINDING CLUSTERS
  96. What Constitutes a Cluster?
  97. Distance and Similarity Measures
  98. Clustering Methods
  99. Pre- and Postprocessing
  100. Other Thoughts
  101. A Special Case:Market Basket Analysis
  102. AWord ofWarning
  103. Workshop: Pycluster and the C Clustering Library
  104. Further Reading
  105. 14 SEEING THE FOREST FOR THE TREES: FINDING
  106. IMPORTANT ATTRIBUTES
  107. Principal Component Analysis
  108. Visual Techniques
  109. Kohonen Maps
  110. Workshop: PCA with R
  111. Further Reading
  112. 15 INTERMEZZO: WHEN MORE IS DIFFERENT
  113. A Horror Story
  114. Some Suggestions
  115. What About Map/Reduce?
  116. Workshop: Generating Permutations
  117. Further Reading
  118. PART IV Applications: Using Data
  119. 16 REPORTING, BUSINESS INTELLIGENCE, AND DASHBOARDS
  120. Business Intelligence
  121. Corporate Metrics and Dashboards
  122. Data Quality Issues
  123. Workshop: Berkeley DB and SQLite
  124. Further Reading
  125. 17 FINANCIAL CALCULATIONS AND MODELING
  126. The Time Value of Money
  127. Uncertainty in Planning and Opportunity Costs
  128. Cost Concepts and Depreciation
  129. Should You Care?
  130. Is This All That Matters?
  131. Workshop: The Newsvendor Problem
  132. Further Reading
  133. 18 PREDICTIVE ANALYTICS
  134. Introduction
  135. Some Classification Terminology
  136. Algorithms for Classification
  137. The Process
  138. The Secret Sauce
  139. The Nature of Statistical Learning
  140. Workshop: Two Do-It-Yourself Classifiers
  141. Further Reading
  142. 19 EPILOGUE: FACTS ARE NOT REALITY
  143. A PROGRAMMING ENVIRONMENTS FOR SCIENTIFIC COMPUTATION
  144. AND DATA ANALYSIS
  145. Software Tools
  146. A Catalog of Scientific Software
  147. Writing Your Own
  148. Further Reading
  149. B RESULTS FROM CALCULUS
  150. Common Functions
  151. Calculus
  152. Useful Tricks
  153. Notation and Basic Math
  154. Where to Go from Here
  155. Further Reading
  156. C WORKING WITH DATA
  157. Sources for Data
  158. Cleaning and Conditioning
  159. Sampling
  160. Data File Formats
  161. The Care and Feeding of Your Data Zoo
  162. Skills
  163. Terminology
  164. Further Reading
  165. INDEX
书名:基于开源工具的数据分析(影印版)
作者:Philipp K. Janert
国内出版社:东南大学出版社
出版时间:2011年06月
页数:509
书号:978-7-5641-2674-2
原版书书名:Data Analysis with Open Source Tools
原版书出版商:O'Reilly Media
Philipp K. Janert
 
After previous careers in physics and software development, Philipp K. Janert currently
provides consulting services for data analysis, algorithm development, and mathematical
modeling. He has worked for small start-ups and in large corporate environments, both in
the U.S. and overseas. He prefers simple solutions that work to complicated ones that
don’t, and thinks that purpose is more important than process. Philipp is the author of
“Gnuplot in Action: Understanding Data with Graphs” (Manning Publications), and has
written for the O’Reilly Network, IBM developerWorks, and IEEE Software. He is named
inventor on a handful of patents, and is an occasional contributor to CPAN. He holds a
Ph.D. in theoretical physics from the University of Washington. Visit his company website
at www.principal-value.com.
 
 
The animal on the cover of Data Analysis with Open Source Tools is a common kite, most
likely a member of the genus Milvus. Kites are medium-size raptors with long wings and
forked tails. They are noted for their elegant, soaring flight. They are also called “gledes”
(for their gliding motion) and, like the flying toys, they appear to ride effortlessly on air
currents.
The genus Milvus is a group of Old World kites, including three or four species and
numerous subspecies. These kites are opportunistic feeders that hunt small animals, such
as birds, fish, rodents, and earthworms, and also eat carrion, including sheep and cow
carcasses. They have been observed to steal prey from other birds. They may live 25 to
30 years in the wild.
The genus dates to prehistoric times; an Israeli Milvus pygmaeus specimen is thought to be
between 1.8 million and 780,000 years old. Biblical references to kites probably refer to
birds of this genus. In Coriolanus, Shakespeare calls Rome “the city of kites and crows,”
commenting on the birds’ prevalence in urban areas.
The most widespread member of the genus is the black kite (Milvus migrans), found in
Europe, Asia, Africa, and Australia. These kites are very common in many parts of their
habitat and are well adapted to city life. Attracted by smoke, they sometimes hunt by
capturing small animals fleeing from fires.
The other notable member of Milvus is the red kite (Milvus milvus), which is slightly larger
than the black kite and is distinguished by a rufous body and tail. Red kites are found only
in Europe. They were very common in Britain until 1800, but the population was devastated by poisoning and habitat loss, and by 1930, fewer than 20 birds remained.
Since then, kites have made a comeback in Wales and have been reintroduced elsewhere
in Britain.