C# Tutorials and offshore development in India
    Tutorials   Resources   Forum   Reviews   Communities   Interview   Jobs   Projects   Training   Your Ad Here    
Silverlight Games | Mentor | Code Converter | Articles | Code Factory | Computer Jokes | Members | Peer Appraisal | IT Companies | Bookmarks | Polls | Revenue Sharing | Lobby | Gift Shop |


Prizes & Awards
My Profile



Active Members
TodayLast 7 Days more...






Resources » Articles » .NET Framework »

Build your own search engine using regular expressions


Posted Date: 03 Mar 2004    Resource Type: Articles    Category: .NET Framework
Author: Tony JohnMember Level: Diamond    
Rating: 1 out of 5Points: 10



Did you ever wanted to develop your own search engine to search in your website site pages?

First thing you may need to do is, spider through all hyperlinks in each of your page and store those page content in some kind of collection in memory. Probably you can use a Hashtable in C# to store the urls and page content as key value pairs. Then the search is very easy - just search in the hashtable and show the matching URLs.

To build your search index, you can access the content of your root page ( homepage ) using the following code :


System.Net.WebResponse response = null;

try
{
// Setup our Web request
System.Net.WebRequest request = System.Net.WebRequest.Create(pageUrl);
request.Timeout = timeoutSeconds * 1000;

// Retrieve data from request
response = request.GetResponse();

System.IO.Stream streamReceive = response.GetResponseStream();
System.Text.Encoding encoding = System.Text.Encoding.GetEncoding("utf-8");
System.IO.StreamReader streamRead = new System.IO.StreamReader( streamReceive, encoding);

// return the retrieved HTML
return streamRead.ReadToEnd();
}
catch (Exception ex)
{
// Error occured grabbing data, return empty string.
MessageBox.Show("Error");
return "";
}
finally
{
// Check if exists, then close the response.
if ( response != null )
{
response.Close();
}
}


This code will retrieve the content of HTML page. Now scan through this page and retrieve all hyperlinks from this page and then retrieve content of all those hyperlinks. Recursively perform this operation until you cover all pages in your site. Add all those URLs and contents as keyvalue pairs into your collection (Hashtable).

You can use the following regular expression to retrieve all hyperlinks from the page content:


Regex regex = new Regex("href\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))",
RegexOptions.IgnoreCase|RegexOptions.Compiled );

for ( Match match = regex.Match( html ); match.Success; match = match.NextMatch() )
{
MessageBox.Show( match.Groups[1].ToString() );
}



Responses


No responses found. Be the first to respond and make money from revenue sharing program.

Feedbacks      
Popular Tags   What are tags ?   Search Tags  
Sign In to add tags.
(No tags found.)

Post Feedback


This is a strictly moderated forum. Only approved messages will appear in the site. Please use 'Spell Check' in Google toolbar before you submit.
You must Sign In to post a response.
Next Resource: Regular expression to count number of words
Previous Resource: How to send Email using Dot net PartI
Return to Discussion Resource Index
Post New Resource
Category: .NET Framework


Post resources and earn money!
 
More Resources



dotNet Slackers

About Us    Contact Us    Privacy Policy    Terms Of Use