How to parse Google XML sitemaps in .NET
Do you run a website and have a Google sitemap with it? Are you looking for C# or VB.NET code samples to load and parse Google sitemap files? Find some examples here.
I was writing a C# App to detect copied articles. This tool uses the Google sitemap files from any website or blog, parse the URLs and will then search on the web to see if any other websites have copied the same content.
I looked online to see if I can get any ready made C# or VB.NET code samples so that I can save some time. Unfortunately, I could not get any examples readily available. So, I decided to spend few minutes and write up some code my own.
Here is the C# code I wrote to parse Google sitemap files. This code sample just retrieve the sitemap URL, parse the URLs in the sitemap file and then returns all URLs as an appended string. You may decide to do something else after parsing the URLs.Code to parse Google sitemap file
private string ParseSitemapFile(string url)
{
XmlDocument rssXmlDoc = new XmlDocument();
// Load the Sitemap file from the Sitemap URL
rssXmlDoc.Load(url);
StringBuilder sitemapContent = new StringBuilder();
// Iterate through the top level nodes and find the "urlset" node.
foreach (XmlNode topNode in rssXmlDoc.ChildNodes)
{
if (topNode.Name.ToLower() == "urlset")
{
// Use the Namespace Manager, so that we can fetch nodes using the namespace
XmlNamespaceManager nsmgr = new XmlNamespaceManager(rssXmlDoc.NameTable);
nsmgr.AddNamespace("ns", topNode.NamespaceURI);
// Get all URL nodes and iterate through it.
XmlNodeList urlNodes = topNode.ChildNodes;
foreach (XmlNode urlNode in urlNodes)
{
// Get the "loc" node and retrieve the inner text.
XmlNode locNode = urlNode.SelectSingleNode("ns:loc", nsmgr);
string link = locNode != null ? locNode.InnerText : "";
// Add to our string builder.
sitemapContent.Append(link + "
");
}
}
}
return sitemapContent.ToString();
}
Here is a sample sitemap file content, just in case you are not familiar with Google Sitemap files:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.techulator.com/articles/Disk-Defragmentation.aspx</loc>
<lastmod>2012-04-25</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.techulator.com/articles/Google-latest-update.aspx</loc>
<lastmod>2012-04-25</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
</urlset?
Indeed Good Info.