NCrawler is designed to be extensible, so it is possible to write custom plugins almost for most aspects of the crawling process. Although mostly used when implementing custom Pipeline steps, it is also possible to write custom thread handling, custom storage, custom robots filtering, custom logging and custom downloaders.

NCrawler provides out of the box implementations of all of these. Typically you don't need to change any of these, although you can. Some of the implementation are implemented in several variants, for example the storage implementations. Currently the different storage implementation are:

In Memory
Isolated File Storage
File Storage (next release)
MS SQL Server / MS SQL Server Express Database Storage
SQLite storage

With so many different storage options, when do you choose what. In general you would choose the in memory storage on smaller crawls, and on larger crawls you would choose a disk media storage. On distributed crawls you would most likely choose database storage.

NCrawler basically needs to keep tracks of two things, that it needs storage for. That is the queue, which is the pages it needs to crawl next, and the history, which is that pages it has already visited. NCrawler only need implementations of these interfaces to do this:

public interface ICrawlerHistory
	long VisitedCount { get; }
	bool Register(string key);

public interface ICrawlerQueue
	CrawlerQueueEntry Pop();
	void Push(CrawlerQueueEntry crawlerQueueEntry);
	long Count { get; }

So NCrawler really does not need to store a lot of information. That means that the InMemory implementation can really handle quite large web sites. 100.000 pages easily on most machines.

All of the storage implementations(except In Memory) also support resume option in case you need to stop a crawl and start it later(you need to set this manually, out of the box resume is disabled).

In the end you need to decide what storage option is appropriate for your application. There are many parameters to consider and only you know them all.

Last edited Feb 14, 2010 at 9:33 AM by EsbenCarlsen, version 1


youngcoder Jan 16, 2013 at 9:21 AM