
XML Sitemaps
A Sitemap is the representation of the architecture of the website. It is an easy way for webmasters to inform visitors about the pages that are available on the website and how are they connected and what is the navigational structure whereas inform search engines about the pages on the website that are available for crawling.
Good sitemaps help humans to find what they are looking for and help search engines to orient themselves and manage their crawl activities. A sitemap gives the spider a rapid guide to the structure of your website and what has changed since last time. Sitemaps are particularly beneficial on websites:
o Where some areas are not accessible through the user interface
o Where webmasters use AJAX, Flash or RIA which is not processed by search engines.
History of Sitemap
1. Google first introduced Sitemaps 0.84 in June 2005 so that web developers could publish lists of links from across their sites. Engineering Director Shivakumar on Google blog posted, “We’re undertaking an experiment called Google Sitemaps that will either fail miserably, or succeed beyond our wildest dreams, in making the web better for webmasters and users alike. It’s a beta “ecosystem” that may help webmasters with two current challenges: keeping Google informed about all of your new web pages or updates, and increasing the coverage of your web pages in the Google index. Initially, we plan to use the URL information webmasters supply to further improve the coverage and freshness of our index. Over time that will lead to our doing an even better job of delivering more search results from more websites.
This project doesn’t just pertain to Google, either: we’re releasing it under the Attribution/Share Alike Creative Commons license so that other search engines can do a better job as well. Eventually we hope this will be supported natively in webservers (e.g. Apache, Lotus Notes, IIS). But to get you started, we offer Sitemap Generator, an open source client in Python to compute sitemaps for a few common use cases. Give it a whirl and give us your feedback.”
2. Google, MSN and Yahoo announced joint support for the Sitemaps protocol in November 2006. The schema version was changed to “Sitemap 0.90”, but no other changes were made.
3. In April 2007, Ask and IBM announced support for Sitemaps. Also, Google, Yahoo, MS announced auto-discovery for sitemaps through robots.txt.
XML Sitemap Format
The sitemap protocol consists of XML tags. All data values in a sitemap must be entity escaped (described below). The file itself must be UTF-8 encoded. The sitemap must:
1. Begin with tag and end with tag.
2. Specify the namespace within the tag.
3. Include a entry for each URL as a parent tag.
4. Include a child entry for each parent tag.
All other tags are optional and their usage may vary among search engines.
XML Tag Definitions
1. urlset – This tag is required. Encapsulates the file and references the current protocol standard.
2. url – This tag is required. Parent tag for each entry.
3. loc – This tag is required. It states the URL of the webpage. It must begin with a protocol (such as http) and end with a trailing slash. It must be less than 2048 characters.
4. lastmod – This tag is optional. It defines the date of last modification of the file. The date should be in W3C Datetime format.
5. changefreq – This tag is optional. It informs how frequently the page is likely to change. It provides general information to the search engines and do not compel them to crawl the page as it is changed. The valid values for it are:
o always
o hourly
o daily
o weekly
o monthly
o yearly
o never
6. priority – This tag is optional. It describes the priority of a URL relative to other URLs on the website. Its value ranges from 0 to 1. Describing priorities does not influence the rankings of URLs in the search engine result pages.
Entity Escaping
As described above, the sitemap must be UTF-8 encoded, any data values must use entity escape codes for the characters:
o Ampersand – &
o Single Quote – ‘
o Double Quote – ”
o Greater Than – >
o Less Than – Sitemap Index Files
There are two factors which have to be kept in mind when creating sitemap. They are:
1. The sitemap must not contain more than 50,000 URLs
2. It must not be larger than 10 MB.
We may compress the sitemap but it must not be more than 10 MB when uncompressed. If the condition arises that sitemap has more than 50,000 URLs, we must create multiple sitemap files. After creating multiple sitemaps we must then list each of them in sitemap index file. The sitemap index file must:
1. Not list more than 1,000 sitemaps
2. Not be larger than 10 MB
The sitemap index file must:
1. Begin with tag and end with sitemapindex > tag.
2. Include a entry for each sitemap as a parent tag.
3. Include a child entry for each parent tag.
The optional tag is also available for sitemap index file.
Sitemap File Location
The location of a sitemap determines the set of URLs that can be included in that sitemap. A sitemap file located at http://www.example.com/xyz/sitemap.xml can include any URLs starting with http://www.example.com but cannot include URLs starting with http://www.example.com/images/. Therefore, it is strongly recommended to place the sitemap file at the root directory of the web server i.e the sitemap file will be located at http://www.example.com/sitemap.xml.
The most important thing that must be kept into mind is that the sitemap file helps in indexing and not ranking of the website. It has been developed to help crawlers know about the URLs which are to be crawled on the website so that those pages can be indexed. It is in no way a help to boost the rankings of the website in the search engine results page.