首页| JavaScript| HTML/CSS| Matlab| PHP| Python| Java| C/C++/VC++| C#| ASP| 其他|
购买积分 购买会员 激活码充值

您现在的位置是:虫虫源码 > Java > 一个Hadoop的HBase writerpool实施对Heritrix爬虫

一个Hadoop的HBase writerpool实施对Heritrix爬虫

  • 资源大小:18.37 kB
  • 上传时间:2021-06-30
  • 下载次数:0次
  • 浏览次数:0次
  • 资源积分:1积分
  • 标      签: 爬虫 hadoop heritrix 一个 实施

资 源 简 介

What is HBase-Writer? HBase-Writer is a java extension to the Heritrix open source crawler. Heritrix is written by the Internet Archive and HBase Writer enables Heritrix to store crawled content directly into HBase tables running on the Hadoop Distributed FileSystem. By default, HBase-Writer writes crawled url content into an HBase table as individual records or "rowkeys". Each fetched url is represented by a "rowkey" in an HBaase table. However, HBase-Writer can easily be extended for custom behavior, like writing to multiple tables or anything else. In turn, these HBase tables are directly supported by the MapReduce framework via Hadoop. HBase-Writer"s goal is to facilitate in fast large distributed crawls using Heritrix and to save and manage Web-scale content using HBase. News March 14th, 2015 This project has moved to GitHub in anticipation of the google code shutdown: https://github.com/OpenSourceMasters/hba
VIP VIP
0.179961s