How to parse Google XML sitemaps in .NET


Do you run a website and have a Google sitemap with it? Are you looking for C# or VB.NET code samples to load and parse Google sitemap files? Find some examples here.

I was writing a C# App to detect copied articles. This tool uses the Google sitemap files from any website or blog, parse the URLs and will then search on the web to see if any other websites have copied the same content.

I looked online to see if I can get any ready made C# or VB.NET code samples so that I can save some time. Unfortunately, I could not get any examples readily available. So, I decided to spend few minutes and write up some code my own.

Here is the C# code I wrote to parse Google sitemap files. This code sample just retrieve the sitemap URL, parse the URLs in the sitemap file and then returns all URLs as an appended string. You may decide to do something else after parsing the URLs.

Code to parse Google sitemap file



private string ParseSitemapFile(string url)
{
XmlDocument rssXmlDoc = new XmlDocument();

// Load the Sitemap file from the Sitemap URL
rssXmlDoc.Load(url);

StringBuilder sitemapContent = new StringBuilder();

// Iterate through the top level nodes and find the "urlset" node.
foreach (XmlNode topNode in rssXmlDoc.ChildNodes)
{
if (topNode.Name.ToLower() == "urlset")
{
// Use the Namespace Manager, so that we can fetch nodes using the namespace
XmlNamespaceManager nsmgr = new XmlNamespaceManager(rssXmlDoc.NameTable);
nsmgr.AddNamespace("ns", topNode.NamespaceURI);

// Get all URL nodes and iterate through it.
XmlNodeList urlNodes = topNode.ChildNodes;
foreach (XmlNode urlNode in urlNodes)
{
// Get the "loc" node and retrieve the inner text.
XmlNode locNode = urlNode.SelectSingleNode("ns:loc", nsmgr);
string link = locNode != null ? locNode.InnerText : "";

// Add to our string builder.
sitemapContent.Append(link + "
");
}
}
}

return sitemapContent.ToString();
}


Here is a sample sitemap file content, just in case you are not familiar with Google Sitemap files:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.techulator.com/articles/Disk-Defragmentation.aspx</loc>
<lastmod>2012-04-25</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.techulator.com/articles/Google-latest-update.aspx</loc>
<lastmod>2012-04-25</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
</urlset?


Comments

Author: Md Aneesuddin Arif30 Apr 2012 Member Level: Silver   Points : 0

Indeed Good Info.



  • Do not include your name, "with regards" etc in the comment. Write detailed comment, relevant to the topic.
  • No HTML formatting and links to other web sites are allowed.
  • This is a strictly moderated site. Absolutely no spam allowed.
  • Name:
    Email: