自然语言标注——用于机器学习(影印版)
James Pustejovsky, Amber Stubbs
出版时间:2013年06月
页数:344
“语言标注是自然语言处理的关键环节,但是它很少在计算语言学课程中被提及。这是第一本手把手讲解标注的书籍,从规范和设计到使用机器学习算法面面俱到。它必然成为本科和研究生的计算语言学课程的范本。”
——Nancy Ide
Vassar学院的计算机科学教授

本书是O’Reilly出版社的《Python自然语言处理》的最佳伴读书籍。

是时候创建属于你自己的用于机器学习的自然语言训练语料库了。无论你使用英语、汉语或者其他任何一种自然语言,本书都可以手把手地指导你一种经验证的标注开发周期——把元语添加到你的训练语料库中来帮助机器学习算法更有效工作的过程。你无需任何编程或者语言学方面的经验就可以上手。
通过每一步中的详细示例,你将学到“标注开发过程”是如何帮助你建模、标注、训练、测试、评估和修正你的训练语料库。你也将了解到一个实际标注项目的完整演示。

· 在收集你的数据集(语料库)之前定义一个清晰的标注目标
· 学习用于分析你的语料库中语言内容的工具
· 搭建用于你的标注项目的模型和规范
· 检查从基本的XML到语言标记框架这样一些不同的标注格式
· 创建适合于训练和测试机器学习算法的黄金标准语料库
· 选择用来处理你的标注数据的机器学习算法
· 评估测试结果并修正你的标注任务
· 学习如何使用用于标注文本和调整标注的轻量级软件

James Pustejovsky是Brandeis大学的教授,他在该大学的计算机科学系讲解和研究人工智能及计算语言学。
Amber Stubbs刚刚获得了Brandeis大学标注方法论的博士学位。她现在是SUNY Albany大学的博士后。
  1. Chapter 1: The Basics
  2. The Importance of Language Annotation
  3. A Brief History of Corpus Linguistics
  4. Language Data and Machine Learning
  5. The Annotation Development Cycle
  6. Summary
  7. Chapter 2: Defining Your Goal and Dataset
  8. Defining Your Goal
  9. Background Research
  10. Assembling Your Dataset
  11. The Size of Your Corpus
  12. Summary
  13. Chapter 3: Corpus Analytics
  14. Basic Probability for Corpus Analytics
  15. Counting Occurrences
  16. Language Models
  17. Summary
  18. Chapter 4: Building Your Model and Specification
  19. Some Example Models and Specs
  20. Adopting (or Not Adopting) Existing Models
  21. Different Kinds of Standards
  22. Summary
  23. Chapter 5: Applying and Adopting Annotation Standards
  24. Metadata Annotation: Document Classification
  25. Text Extent Annotation: Named Entities
  26. Linked Extent Annotation: Semantic Roles
  27. ISO Standards and You
  28. Summary
  29. Chapter 6: Annotation and Adjudication
  30. The Infrastructure of an Annotation Project
  31. Specification Versus Guidelines
  32. Be Prepared to Revise
  33. Preparing Your Data for Annotation
  34. Writing the Annotation Guidelines
  35. Annotators
  36. Choosing an Annotation Environment
  37. Evaluating the Annotations
  38. Creating the Gold Standard (Adjudication)
  39. Summary
  40. Chapter 7: Training: Machine Learning
  41. What Is Learning?
  42. Defining Our Learning Task
  43. Classifier Algorithms
  44. Sequence Induction Algorithms
  45. Clustering and Unsupervised Learning
  46. Semi-Supervised Learning
  47. Matching Annotation to Algorithms
  48. Summary
  49. Chapter 8: Testing and Evaluation
  50. Testing Your Algorithm
  51. Evaluating Your Algorithm
  52. Problems That Can Affect Evaluation
  53. Final Testing Scores
  54. Summary
  55. Chapter 9: Revising and Reporting
  56. Revising Your Project
  57. Reporting About Your Work
  58. Summary
  59. Chapter 10: Annotation: TimeML
  60. The Goal of TimeML
  61. Related Research
  62. Building the Corpus
  63. Model: Preliminary Specifications
  64. Annotation: First Attempts
  65. Model: The TimeML Specification Used in TimeBank
  66. Annotation: The Creation of TimeBank
  67. TimeML Becomes ISO-TimeML
  68. Modeling the Future: Directions for TimeML
  69. Summary
  70. Chapter 11: Automatic Annotation: Generating TimeML
  71. The TARSQI Components
  72. Improvements to the TTK
  73. TimeML Challenges: TempEval-2
  74. Future of the TTK
  75. Summary
  76. Chapter 12: Afterword: The Future of Annotation
  77. Crowdsourcing Annotation
  78. Handling Big Data
  79. NLP Online and in the Cloud
  80. And Finally...
  81. Appendix: List of Available Corpora and Specifications
  82. Corpora
  83. Specifications, Guidelines, and Other Resources
  84. Representation Standards
  85. Appendix List of Software Resources
  86. Annotation and Adjudication Software
  87. Machine Learning Resources
  88. Appendix MAE User Guide
  89. Installing and Running MAE
  90. Loading Tasks and Files
  91. Saving Files
  92. Defining Your Own Task
  93. Frequently Asked Questions
  94. Appendix: MAI User Guide
  95. Installing and Running MAI
  96. Loading Tasks and Files
  97. Adjudicating
  98. Saving Files
  99. Appendix Bibliography
  100. References for Using Amazon’s Mechanical Turk/Crowdsourcing
书名:自然语言标注——用于机器学习(影印版)
国内出版社:东南大学出版社
出版时间:2013年06月
页数:344
书号:978-7-5641-4281-0
原版书书名:Natural Language Annotation for Machine Learning
原版书出版商:O'Reilly Media
James Pustejovsky
 
James Pustejovsky是布兰迪斯大学计算机科学系教授,从事人工智能和计算语言学领域的教学和研究工作。
James Pustejovsky teaches and does research in Artificial Intelligence and Computational Linguistics in the Computer Science Department at Brandeis University. His main areas of interest include: lexical meaning, computational semantics, temporal and spatial reasoning, and corpus linguistics. He is active in the development of standards for interoperability between language processing applications, and lead the creation of the recently adopted ISO standard for time annotation, ISO-TimeML. He is currently heading the development of a standard for annotating spatial information in language. More information on publications and research activities can be found at his webpage: pusto.com.
 
 
Amber Stubbs
 
Amber Stubbs博士于2013年在布兰迪斯大学计算机科学系取得博士学位,其博士论文的主题是自然语言标注方法论。之后Amber Stubbs博士任纽约州立大学阿尔巴尼分校博士后研究员,目前是波士顿西蒙斯学院图书馆与信息科学学院及计算机科学专业的助理教授。
Amber Stubbs recently completed her Ph.D. in Computer Science at Brandeis University, and is currently a Postdoctoral Associate at SUNY Albany. Her dissertation focused on creating an annotation methodology to aid in extracting high-level information from natural language files, particularly biomedical texts. Her website can be found at http://pages.cs.brandeis.edu/~astubbs/
 
 
The animal on the cover of Natural Language Annotation for Machine Learning is the cockatiel (Nymphicus hollandicus). Their scientific name came about from European travelers who found the birds so beautiful, they named them for mythical nymphs. Hollandicus refers to “New Holland,” an older name for Australia, the continent to which these birds are native. In the wild, cockatiels can be found in arid habitats like brushland or the outback, yet they remain close to water. They are usually seen in pairs, though flocks will congregate around a single body of water.

Until six to nine months after hatching, female and male cockatiels are indistinguishable, as both have horizontal yellow stripes on the surface of their tail feathers and a dull orange patch on each cheek. When molting begins, males lose some white or yellow feathers and gain brighter yellow feathers. In addition, the orange patches on the face become much more prominent. The lifespan of a cockatiel in captivity is typically 15–20 years, but they generally live between 10–30 years in the wild.

The cockatiel was considered either a parrot or a cockatoo for some time, as scientists and biologists hotly debated which bird it actually was. It is now classified as part of the cockatoo family because they both have the same biological features—namely, upright crests, gallbladders, and powder down (a special type of feather where the tips of barbules disintegrate, forming a fine dust among the feathers).