C# Tutorials and offshore development in India
    Tutorials   Resources   Forum   Reviews   Communities   Interview   Jobs   Projects   Training   Your Ad Here    
Silverlight Games | Mentor | Code Converter | Articles | Code Factory | Computer Jokes | Members | Peer Appraisal | IT Companies | Bookmarks | Polls | Revenue Sharing | Lobby | Gift Shop |


Prizes & Awards
My Profile



Active Members
TodayLast 7 Days more...






Resources » Articles » ASP.NET/Web Applications »

Detecting Bots and crawlers


Posted Date: 25 Oct 2009    Resource Type: Articles    Category: ASP.NET/Web Applications
Author: ABitSmartMember Level: Diamond    
Rating: 1 out of 5Points: 7



This article discusses a method to detect bot and crawler visits to your site.

Recently, I was developing a model to collect visitor information to the site. The stats showed a mind boggling 10,000 visits a day. Stunned! This was highly unlikely for us. I tracked this through Google Analytics which showed a different number i.e. 2000. Well, we all are aware of the difference in script based (Google analytics) and server log based tracking arguments. I was experiencing the same. Reading through and understanding the concept, I got very suspicious on spider and crawler visits. Well, so the journey started to find out the and categorize the bot/crawler visits.

The easiest method to detect crawlers and bots is by dissecting the browsers UserAgent. This contains all the information we want. It describes the web browser client type. Example of UserAgent strings are,

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; .NET CLR 2.0.50727)
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; Yahoo! Slurp; +http://help.yahoo.com/help/us/ysearch/slurp)

If you notice properly, in the second string the text bot appears. So the simplest code snipper to detect a bot will be,

if (Request.UserAgent.ToString().Contains("bot") > -1)
{
//Bot!!
}


Looks simple but a horrible way to do something. Some of the crawlers do not have the text "bot" in their string. And this will need a check for each key word the crawlers/bot use.

The next method discussed is the inbuilt functionality within .Net framework to detect the same. Yes, it's true. There is an inherent method to find out if the request has come from a bot/crawler. There is a property on the HTTPRequest's Browser object which detects crawlers. Here is, how to use it -

if (Request.Browser.Crawler)
{
// Bot!!
}


Nice - neat and clean. One call does all!

But wait, it is not that simple also. You need to configure a .brower file with rules to detect crawler. The good news is that .Net framework by default has a number of file for this purpose already installed. They are also kept updated in service packs, perhaps. But nevertheless, you can still add your own ones to substitute for anything missing. The .browser file can be added to your application's App_Browsers directory for application only scope or to the %SystemRoot%\Microsoft.NET\Framework\version\CONFIG\Browsers directory for machine wide scope. Note: replace version with the version folder name on your machine.
After adding the file to the Browsers directory, it is necessary to run the ASP.Net browser registration tool. The ASP.NET Browser Registration tool compiles browser definitions in the Browsers directory of the version of the .NET Framework that corresponds to the tool version. Each version of the .NET Framework has its own copy of the tool. The ASP.NET Browser Registration tool parses and compiles all system-wide browser definitions into an assembly and installs the assembly in the global assembly cache. If there are errors in the system-wide browser definitions, the tool reports those errors. The browser capabilities assembly is used by all Web applications on the system. Note that you can also recompile system-wide browser definition files by using the BrowserCapabilitiesCodeGenerator class.
e.g.,
C:\WINDOWS\Microsoft.NET\Framework\\aspnet_regsql.exe -i
or
Just open the VS command prompt and run, aspnet_regsql.exe -i

I use Oceans.Broswer file in addition to the inherent .Net provided ones.

Well, the above method does work but there is a little bit enhancement I have added to increase the detection of bots/crawlers. In addition to the above, I also run a regex to find certain bot related keywords in the UserAgent. So my final check goes like this,

if (request.Browser.Crawler)
{
// Bot!!
}
else
{
if (!string.IsNullOrEmpty(request.UserAgent))
{
Regex regEx = new Regex("Slurp|slurp|ask|Ask|Teoma|teoma");
if (regEx.Match(request.UserAgent).Success)
{
// Bot!!
}
}
}


Thanks to Peter Bromberg for the regex.

Have fun.

For more details, visit http://abitsmart.com/?p=239



Responses


No responses found. Be the first to respond and make money from revenue sharing program.

Feedbacks      
Popular Tags   What are tags ?   Search Tags  
Sign In to add tags.
ASP.Net Spiders  .  ASP.Net Search Crawlers  .  ASP.Net Search bots  .  ASP.Net Crawlers  .  ASP.Net Bots  .  

Post Feedback


This is a strictly moderated forum. Only approved messages will appear in the site. Please use 'Spell Check' in Google toolbar before you submit.
You must Sign In to post a response.
Next Resource: Linq Quries in ASp.net
Previous Resource: Data Type Conversions in C#.net
Return to Discussion Resource Index
Post New Resource
Category: ASP.NET/Web Applications


Post resources and earn money!
 
More Resources



dotNet Slackers

About Us    Contact Us    Privacy Policy    Terms Of Use