The Road to Better Site Indexing: Episode 3, Sitemaps (based on a true story)


In our prior episodes:
The crawler known as “Bot” travels across the web, moving from page to page and site to site by following links he discovers along the way. But Bot isn’t the type to let himself be led about aimlessly. He tries to imitate his hero Humphrey Bogart, who never shied away from a tangled web yet always managed to stay on the right track.

But being a perfectionist, Bot wasn’t entirely satisfied with his own method. Was he overlooking a significant thread? Leaving an important page unturned? He had a hunch he could do better.

Leaving important content in the dustbin of unindexed pages was just the sort of slip-up that really peeved Bot’s equally perfectionist client Betty, a.k.a. “The Webmaster.” Betty had specifically called on Bot to crawl her entire site, and Bot had missed several pages.

To get their relationship back on the right track, Bot had an idea: he would ask Betty to tell him flat out everything she wanted him to know about her site. And being a guy always in the know, Bot knew just what tool Betty could use to set the record straight: a sitemap.
He proposed; she accepted.

Now Betty can rest easy knowing all the content she wants to share with the world will be indexed. And just what is this handy tool known as a sitemap?
It’s actually not much more than a laundry list of links. Constructing one is a snap. You simply create a text file listing the URLs you want indexed, along with any key facts you want Bot to know (like how often a file is updated), and place it anywhere you’d like, giving Bot the location in your robots.txt file, for example at the root of your web site: http://www.example.com/sitemap.xml.

Sitemaps can be written in XML (the preferred method), or communicated via syndication feeds or simple text files. A sitemap in XML looks something like this:

<urlset xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″>
<url>
<loc>http://www.example.com/</loc>
<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.example.com/catalog?item=12&desc=vacation_hawaii</loc>
<changefreq>weekly</changefreq>
</url>
<url>
<loc>http://www.example.com/catalog?item=83&desc=vacation_usa</loc>
</url>
</urlset>

You can visit http://www.sitemaps.org/ for all the details. It’s the official site of the Sitemaps protocol, which was first proposed by Google, then fleshed out through discussions with MSN, Yahoo and Ask. It’s now the standard adopted by Google, Yahoo, Ask, and, as of July 2007, Exalead.
But bad guys consider yourselves forewarned: Bot knows not every webmaster is not as straight up as Betty. He stays a step ahead of all nefarious sitemap tricks, checking out every list of links spun his way and skipping right over bum lists.

Sébastien