Web Crawler


Write your own Web Crawler.

Simple Web Crawler

Nowadays many web based work and websites based on the Crawler.
Specially many freelancer works based on Crawls the the data from the website.

I have written the simple web scrawler. It shold get data from the website and collect the links.

following is the code segment of my article.

Namespace part


using System;
using System.Net;
using System.IO;


Main codes

public class SimpleCrawler {

public static void Main(string[] args) {
string Mylink = null;
string Mystr;
string Myanswer;

int curPoint;
if(args.Length != 1) {
Console.WriteLine("Please use proper URL");
return ;
}

string Myuristr = args[0];

try {

do {
Console.WriteLine("Connecting to " + Myuristr);

HttpWebRequest MyHttpWebRequest = (HttpWebRequest)
WebRequest.Create(Myuristr);
Myuristr = null;
HttpWebResponse MyHttpWebResponse = (HttpWebResponse)
MyHttpWebRequest.GetResponse();

Stream MyInputString = MyHttpWebRequest.GetResponseStream();

StreamReader MyStreamReader = new StreamReader(MyInputString);

string Mystring = MyStreamReader.ReadToEnd();

curPoint = 0;

do {

Mylink = FindMyLink(Mystring, ref curPoint);
if(Mylink != null) {
Console.WriteLine("Found the link : " + Mylink);
Console.Write("Link, More, Quit?");
answer = Console.ReadLine();

if(string.Compare(answer, "L", true) == 0) {
Myuristr = string.Copy(Mylink);
break;
} else if(string.Compare(answer, "Q", true) == 0) {
break;
} else if(string.Compare(answer, "M", true) == 0) {
Console.WriteLine("Searching for another link.");
}
} else {
Console.WriteLine("No link found.");
break;
}

} while(Mylink.Length > 0);


MyHttpWebResponse.Close();
} while(Myuristr != null);

}
catch(Exception exc) {
Console.WriteLine(exc.Message);
}
Console.WriteLine("Terminating Sample Crawler.");
}

static string FindMyLink(string MyHtmlstr,
ref int MystartPoint) {
int startPoint, endPoint;
string Myuri = null;
string Mylowcasestr = MyHtmlstr.ToLower();
int i = Mylowcasestr.IndexOf("href=\"http", MystartPoint);
if(i != -1) {
startPoint = MyHtmlstr.IndexOf('"', i) + 1;
endPoint = MyHtmlstr.IndexOf('"', startPoint);
Myuri = MyHtmlstr.Substring(startPoint, endPoint-startPoint);
MystartPoint = endPoint;
}

return Myuri;
}

}



Explanation of my code.

1. Holds current location in response
2. Holds current URI
3. Create a WebRequest to the specified URI.
4. Disallow further use of this URI
5. Send that request and return the response.
6. From the response, obtain an input stream.
7. Wrap the input stream in a StreamReader.
8. Read in the entire page.
9. Find the next URI to link to.
10.Close the Response.

Thanks
Nathan


Comments

Author: Gaurav Aroraa21 Jun 2010 Member Level: Gold   Points : 1

Nathan,

Its really appreciable code. You did a good job. But did the code work on such conditions where someone restrict the crawler to check the words?

Author: Nathan21 Jun 2010 Member Level: Gold   Points : 1

No Gaurav, it will not.
It is simple one.
For that we need to do more code

Guest Author: Jitendra06 Nov 2012

At beginning level understanding it is a good sample of crawler. good job.

Guest Author: hussain29 Nov 2012

i have develop an hotel site now they request for a functionality that whenever any user fill feedback or complain or suggestion on any web site then fetch it and show us directly on page and there should no need to go on all websites... i thing it is a part of crawler can u give any suggestion???

Author: puneet singh11 Mar 2013 Member Level: Bronze   Points : 0

sir i am new in programming please help me to run this above code i am not understanding it properly.

It crawl the current url only.



  • Do not include your name, "with regards" etc in the comment. Write detailed comment, relevant to the topic.
  • No HTML formatting and links to other web sites are allowed.
  • This is a strictly moderated site. Absolutely no spam allowed.
  • Name:
    Email: