2017年12月26日星期二

Apache Nutch 1.14 发布,Web 爬虫


Linuxeden 开源社区 --Nutch

 

Nutch

Apache Nutch 1.14 发布了。Nutch 是一个成熟的、可用于生产的 Web 爬虫。 Nutch 1.x 可以依靠 Apache Hadoop™ 数据结构进行细粒度配置,这对于批处理非常有用。

更新内容:

Bug 修复

  • [NUTCH-2071] – A parser failure on a single document may fail crawling job
  • [NUTCH-2235] – Classpath discrepancy with protocol-selenium in deploy mode
  • [NUTCH-2269] – Clean not working after crawl
  • [NUTCH-2295] – Nutch master docker container broken
  • [NUTCH-2297] – CrawlDbReader -stats wrong values for earliest fetch time and shortest interval
  • [NUTCH-2316] – Library conflict with Parser-Tika Plugin and Lib Folder

提升

  • [NUTCH-1763] – Improving comments on the Injector Class
  • [NUTCH-2034] – CrawlDB filtered documents counter.
  • [NUTCH-2035] – Regex filter using case sensitive rules.
  • [NUTCH-2046] – The crawl script should be able to skip an initial injection.
  • [NUTCH-2135] – Ant Eclipse build does not include protocol-interactiveselenium
  • [NUTCH-2193] – Upgrade feed parser plugin to use rome 1.5

完整更新内容请查看 发布说明

下载地址:

转自 http://ift.tt/2l1C1BS

The post Apache Nutch 1.14 发布,Web 爬虫 appeared first on Linuxeden开源社区.

http://ift.tt/2leUFp4

没有评论:

发表评论