Resources » .NET programming » .NET Framework

Build your own search engine using regular expressions


Posted Date: 03-Mar-2004  Last Updated:   Category: .NET Framework    
Author: Member Level: Gold    Points: 10


Do you want to build your own web search engine? This C# sample code shows how to download webpages, and use regular expressions to parse all hyperlinks from an html source.



Did you ever wanted to develop your own search engine to search in your website site pages?

First thing you may need to do is, spider through all hyperlinks in each of your page and store those page content in some kind of collection in memory. Probably you can use a Hashtable in C# to store the urls and page content as key value pairs. Then the search is very easy - just search in the hashtable and show the matching URLs.

To build your search index, you can access the content of your root page ( homepage ) using the following code :


System.Net.WebResponse response = null;

try
{
// Setup our Web request
System.Net.WebRequest request = System.Net.WebRequest.Create(pageUrl);
request.Timeout = timeoutSeconds * 1000;

// Retrieve data from request
response = request.GetResponse();

System.IO.Stream streamReceive = response.GetResponseStream();
System.Text.Encoding encoding = System.Text.Encoding.GetEncoding("utf-8");
System.IO.StreamReader streamRead = new System.IO.StreamReader( streamReceive, encoding);

// return the retrieved HTML
return streamRead.ReadToEnd();
}
catch (Exception ex)
{
// Error occured grabbing data, return empty string.
MessageBox.Show("Error");
return "";
}
finally
{
// Check if exists, then close the response.
if ( response != null )
{
response.Close();
}
}


This code will retrieve the content of HTML page. Now scan through this page and retrieve all hyperlinks from this page and then retrieve content of all those hyperlinks. Recursively perform this operation until you cover all pages in your site. Add all those URLs and contents as keyvalue pairs into your collection (Hashtable).

You can use the following regular expression to retrieve all hyperlinks from the page content:


Regex regex = new Regex("href\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))",
RegexOptions.IgnoreCase|RegexOptions.Compiled );

for ( Match match = regex.Match( html ); match.Success; match = match.NextMatch() )
{
MessageBox.Show( match.Groups[1].ToString() );
}


Did you like this resource? Share it with your friends and show your love!

Responses to "Build your own search engine using regular expressions"

No responses found. Be the first to respond...

Feedbacks      

Post Comment:




  • Do not include your name, "with regards" etc in the comment. Write detailed comment, relevant to the topic.
  • No HTML formatting and links to other web sites are allowed.
  • This is a strictly moderated site. Absolutely no spam allowed.
  • Name:   Sign In to fill automatically.
    Email: (Will not be published, but required to validate comment)



    Type the numbers and letters shown on the left.


    Submit Article     Return to Article Index

    Subscribe to Subscribers
    Active Members
    TodayLast 7 Daysmore...

    Awards & Gifts
    Talk to Webmaster Tony John

    Online Members

    Bhavik
    More...
    Copyright © SpiderWorks Technologies Pvt Ltd., Kochi, India