Prizes & Awards
My Profile
Active Members
TodayLast 7 Days
more...
|
Resources » Articles » ASP.NET/Web Applications »
Detecting Bots and crawlers
|
This article discusses a method to detect bot and crawler visits to your site.
Recently, I was developing a model to collect visitor information to the site. The stats showed a mind boggling 10,000 visits a day. Stunned! This was highly unlikely for us. I tracked this through Google Analytics which showed a different number i.e. 2000. Well, we all are aware of the difference in script based (Google analytics) and server log based tracking arguments. I was experiencing the same. Reading through and understanding the concept, I got very suspicious on spider and crawler visits. Well, so the journey started to find out the and categorize the bot/crawler visits.
The easiest method to detect crawlers and bots is by dissecting the browsers UserAgent. This contains all the information we want. It describes the web browser client type. Example of UserAgent strings are,
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; .NET CLR 2.0.50727) Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) Mozilla/5.0 (compatible; Yahoo! Slurp; +http://help.yahoo.com/help/us/ysearch/slurp)
If you notice properly, in the second string the text bot appears. So the simplest code snipper to detect a bot will be,
if (Request.UserAgent.ToString().Contains("bot") > -1) { //Bot!! }
Looks simple but a horrible way to do something. Some of the crawlers do not have the text "bot" in their string. And this will need a check for each key word the crawlers/bot use.
The next method discussed is the inbuilt functionality within .Net framework to detect the same. Yes, it's true. There is an inherent method to find out if the request has come from a bot/crawler. There is a property on the HTTPRequest's Browser object which detects crawlers. Here is, how to use it -
if (Request.Browser.Crawler) { // Bot!! }
Nice - neat and clean. One call does all!
But wait, it is not that simple also. You need to configure a .brower file with rules to detect crawler. The good news is that .Net framework by default has a number of file for this purpose already installed. They are also kept updated in service packs, perhaps. But nevertheless, you can still add your own ones to substitute for anything missing. The .browser file can be added to your application's App_Browsers directory for application only scope or to the %SystemRoot%\Microsoft.NET\Framework\version\CONFIG\Browsers directory for machine wide scope. Note: replace version with the version folder name on your machine. After adding the file to the Browsers directory, it is necessary to run the ASP.Net browser registration tool. The ASP.NET Browser Registration tool compiles browser definitions in the Browsers directory of the version of the .NET Framework that corresponds to the tool version. Each version of the .NET Framework has its own copy of the tool. The ASP.NET Browser Registration tool parses and compiles all system-wide browser definitions into an assembly and installs the assembly in the global assembly cache. If there are errors in the system-wide browser definitions, the tool reports those errors. The browser capabilities assembly is used by all Web applications on the system. Note that you can also recompile system-wide browser definition files by using the BrowserCapabilitiesCodeGenerator class. e.g., C:\WINDOWS\Microsoft.NET\Framework\\aspnet_regsql.exe -i or Just open the VS command prompt and run, aspnet_regsql.exe -i
I use Oceans.Broswer file in addition to the inherent .Net provided ones.
Well, the above method does work but there is a little bit enhancement I have added to increase the detection of bots/crawlers. In addition to the above, I also run a regex to find certain bot related keywords in the UserAgent. So my final check goes like this,
if (request.Browser.Crawler) { // Bot!! } else { if (!string.IsNullOrEmpty(request.UserAgent)) { Regex regEx = new Regex("Slurp|slurp|ask|Ask|Teoma|teoma"); if (regEx.Match(request.UserAgent).Success) { // Bot!! } } }
Thanks to Peter Bromberg for the regex.
Have fun.
For more details, visit http://abitsmart.com/?p=239
|
Responses
|
No responses found. Be the first to respond and make money from revenue sharing program.
|
|