投資検討中: [Solr] Nutchでサイトクロール

Solrを使ってサイトクロールを行う方法が、公開されていたので、

Lucid Imagination » Using Nutch with Solr via kwout

やってみた。

Nutchはウェブスパイダーとしてのクロール機能を提供している。内部に分散ファイルシステムたるHadoopを持っていのだが、今回はこれは使わない方法(のようだ)。

- - - -

1. Nutchを<a href="http://hudson.zones.apache.org/hudson/job/Nutch-trunk/">ダウンロード</a>し、解凍する。

　tar xzf apache-nutch-1.0.tar.gz

2. Solrの設定を行う

Nutchの中にはschema.xmlなどSolrを連携して使うためのサンプルの設定ファイルが含まれています。

a. schema.xmlの配置
apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf

b. “content”フィールドの設定変更

<field name=”content” type=”text” stored=”true” indexed=”true”/>

c. クローラ用リクエストハンドラの作成

以下をsolrconfig.xmlに追記します。

<requestHandler name="/nutch" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.01</float>
<str name="qf">
content^0.5 anchor^1.0 title^1.2
</str>
<str name="pf">
content^0.5 anchor^1.5 title^1.2 site^1.5
</str>
<str name="fl">
url
</str>
<str name="mm">
2<-1 5<-2 6<90%
</str>
<int name="ps">100</int>
<bool hl="true"/>
<str name="q.alt">*:*</str>
<str name="hl.fl">title url content</str>
<str name="f.title.hl.fragsize">0</str>
<str name="f.title.hl.alternateField">title</str>
<str name="f.url.hl.fragsize">0</str>
<str name="f.url.hl.alternateField">url</str>
<str name="f.content.hl.fragmenter">regex</str>
</lst>
</requestHandler>

3. Solrを起動しておきます。

cd apache-solr-1.3.0/example
java -jar start.jar

4. Nutchの設定を行います。

a. apache-nutch-1.0/confにあるnutch-site.xmlの中身を以下のものにまるっと入れ替えます。
ここでは、クローラの名前や、有効化するプラグイン、ドメイン単位でみたときの最大同時接続数などを指定します。

<?xml version="1.0"?>
<configuration>
<property>
<name>http.agent.name</name>
<value>nutch-solr-integration</value>
</property>
<property>
<name>generate.max.per.host</name>
<value>100</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
</configuration>

b. apache-nutch-1.0/confにあるregex-urlfilter.txtの中身を書き換えます。

-^(https|telnet|file|ftp|mailto):

# skip some suffixes
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# allow urls in foofactory.fi domain
+^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/

# deny anything else
-.

5. 初期URLリスト(クロール起点)ファイルを作成します。

mkdir urls
echo "http://www.lucidimagination.com/" > urls/seed.txt

6. 初期URLのページをクロールしてNutchのcrawldbに格納します。

bin/nutch inject crawl/crawldb urls

konpyuta:~/work/nutch tf0054$ bin/nutch inject crawl/crawldb urls
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
konpyuta:~/work/nutch tf0054$

7. 取得したページを解析してコンテンツを抜き出します。

bin/nutch generate crawl/crawldb crawl/segments

konpyuta:~/work/nutch tf0054$ bin/nutch generate crawl/crawldb crawl/segments
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20090416032246
Generator: filtering: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
konpyuta:~/work/nutch tf0054$

上のコマンドで、crawl/segments以下に新たなクロール対象のディレクトリを作成します。 that at this point contains files that store the url(s) to be fetched. In the following commands we need the latest segment dir as parameter so we’ll store it in an environment variable:

export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`

クロールを開始します。

bin/nutch fetch $SEGMENT -noParsing

konpyuta:~/work/nutch tf0054$ bin/nutch fetch $SEGMENT -noParsing
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting
Fetcher: segment: crawl/segments/20090416032246
Fetcher: threads: 10
QueueFeeder finished: total 1 records.
fetching http://www.lucidimagination.com/
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done
konpyuta:~/work/nutch tf0054$

コンテンツをパースします。

bin/nutch parse $SEGMENT

konpyuta:~/work/nutch tf0054$ bin/nutch parse $SEGMENT
konpyuta:~/work/nutch tf0054$

Then I update the Nutch crawldb.
updatedb コマンドを使うと、先にとってきたページのパースと、もうパースされて見つかっているURLページの取得をパイプライン的に並行して行わせることができます。なお、Nutchは取得した各ページの状態を覚えているため、同じページを何度も取得しに行ってしまうことはありません。

bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize

konpyuta:~/work/nutch tf0054$ bin/nutch updatedb crawl/crawldb $SEGMENT -f
ilter -normalize
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20090416032246]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
konpyuta:~/work/nutch tf0054$

これまでが、クロールの1サイクルとなります。

8. linkdbを作る

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

konpyuta:~/work/nutch tf0054$ bin/nutch invertlinks crawl/linkdb -dir crawl/segments

LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/Users/tf0054/work/nutch-2009-04-15_04-01-57/crawl/segments/20090416032246
LinkDb: done
konpyuta:~/work/nutch tf0054$

9. 全セグメントからSolrにコンテンツを送る

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*

10. Solrにて検索可能となったか確認する。

http://127.0.0.1:8983/solr/nutch/?q=solr&version=2.2&start=0&rows=10&indent=on&wt=json

- - - - -

木曜日, 4月 16, 2009

[Solr] Nutchでサイトクロール