The simplest example which crawls http://ncrawler.codeplex.com with 1 thread (default) to a maximum depth of 3(default) is:
using(Crawler c = new Crawler("http://ncrawler.codeplex.com/", new HtmlDocumentProcessor()))
{
	c.Crawl();
}
The HtmlDocumentProcessor is a pipeline step that parses HTML (using HtmlAgilityPack) and extracts links and references and adds them to the crawler queue. For each download NCrawler makes it executes a pipeline. A pipeline consists of pipeline steps, each of them performing different tasks. You can build custom steps, by implementing the IPipeLine interface. HtmlDocumentProcessor implements this interface.
public interface IPipelineStep
{
    void Process(Crawler crawler, PropertyBag propertyBag);
}
You specify which pipeline steps NCrawler should execute by adding them as a parameter to the constructor of the Crawler class like shown above.
Now lets make a pipeline step that visually shows us what NCrawler is actually crawling. Lets call is DumperStep, because all it does is write the downloaded url to the console.
public class DumperStep : IPipelineStep
{
	public void Process(Crawler crawler, PropertyBag propertyBag)
	{
		Console.Out.WriteLine(propertyBag.Step.Url);
	}
}
Ok, to let NCrawler execute the step lets add it to the Crawler constructor
using(Crawler c = new Crawler("http://ncrawler.codeplex.com/", new HtmlDocumentProcessor(), new DumperStep()))
{
	c.Crawl();
}
So now we get the urls dumped to the console window, but it is a bit slow. Let's tell NCrawler to use 3 threads
using(Crawler c = new Crawler("http://ncrawler.codeplex.com/", new HtmlDocumentProcessor(), new DumperStep()))
{
	c.ThreadCount = 3;
	c.Crawl();
}
Ok, we only want to crawl the first 2 levels of the site, let's tell NCrawler to stop traversing after 2 steps
using(Crawler c = new Crawler("http://ncrawler.codeplex.com/", new HtmlDocumentProcessor(), new DumperStep()))
{
	c.ThreadCount = 3;
	c.MaxCrawlDepth = 2;
	c.Crawl();
}
Ok great. We don't really want to download images, so lets tell NCrawler not to considor these
using(Crawler c = new Crawler("http://ncrawler.codeplex.com/", new HtmlDocumentProcessor(), new DumperStep()))
{
	c.ThreadCount = 3;
	c.MaxCrawlDepth = 2;
	c.ExcludeFilter = new[] { new RegexFilter(
			new Regex(@"(\.jpg|\.css|\.js|\.gif|\.jpeg|\.png|\.ico)",
				RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase)) },
	c.Crawl();
}

It is that easy. NCrawler has many more buildin features (PipeLine steps) like extracting text from PDF, RTF, Word and Excel documents. Entity extraction like emails. But the really cool thing is the ability to implement your own logic into the pipeline.

Last edited Mar 2, 2010 at 7:35 PM by EsbenCarlsen, version 2

Comments

chzhcpu Jul 7, 2011 at 9:11 AM 
good.