Web Crawler
Write your own Web Crawler.
Simple Web Crawler
Nowadays many web based work and websites based on the Crawler.
Specially many freelancer works based on Crawls the the data from the website.
I have written the simple web scrawler. It shold get data from the website and collect the links.
following is the code segment of my article.
Namespace part
using System;
using System.Net;
using System.IO;
Main codes
public class SimpleCrawler {
public static void Main(string[] args) {
string Mylink = null;
string Mystr;
string Myanswer;
int curPoint;
if(args.Length != 1) {
Console.WriteLine("Please use proper URL");
return ;
}
string Myuristr = args[0];
try {
do {
Console.WriteLine("Connecting to " + Myuristr);
HttpWebRequest MyHttpWebRequest = (HttpWebRequest)
WebRequest.Create(Myuristr);
Myuristr = null;
HttpWebResponse MyHttpWebResponse = (HttpWebResponse)
MyHttpWebRequest.GetResponse();
Stream MyInputString = MyHttpWebRequest.GetResponseStream();
StreamReader MyStreamReader = new StreamReader(MyInputString);
string Mystring = MyStreamReader.ReadToEnd();
curPoint = 0;
do {
Mylink = FindMyLink(Mystring, ref curPoint);
if(Mylink != null) {
Console.WriteLine("Found the link : " + Mylink);
Console.Write("Link, More, Quit?");
answer = Console.ReadLine();
if(string.Compare(answer, "L", true) == 0) {
Myuristr = string.Copy(Mylink);
break;
} else if(string.Compare(answer, "Q", true) == 0) {
break;
} else if(string.Compare(answer, "M", true) == 0) {
Console.WriteLine("Searching for another link.");
}
} else {
Console.WriteLine("No link found.");
break;
}
} while(Mylink.Length > 0);
MyHttpWebResponse.Close();
} while(Myuristr != null);
}
catch(Exception exc) {
Console.WriteLine(exc.Message);
}
Console.WriteLine("Terminating Sample Crawler.");
}
static string FindMyLink(string MyHtmlstr,
ref int MystartPoint) {
int startPoint, endPoint;
string Myuri = null;
string Mylowcasestr = MyHtmlstr.ToLower();
int i = Mylowcasestr.IndexOf("href=\"http", MystartPoint);
if(i != -1) {
startPoint = MyHtmlstr.IndexOf('"', i) + 1;
endPoint = MyHtmlstr.IndexOf('"', startPoint);
Myuri = MyHtmlstr.Substring(startPoint, endPoint-startPoint);
MystartPoint = endPoint;
}
return Myuri;
}
}
Explanation of my code.
1. Holds current location in response
2. Holds current URI
3. Create a WebRequest to the specified URI.
4. Disallow further use of this URI
5. Send that request and return the response.
6. From the response, obtain an input stream.
7. Wrap the input stream in a StreamReader.
8. Read in the entire page.
9. Find the next URI to link to.
10.Close the Response.
Thanks
Nathan
Nathan,
Its really appreciable code. You did a good job. But did the code work on such conditions where someone restrict the crawler to check the words?