The Road to Better Site Indexing – Introduction and Episode 1

The question that usually follows “How
can I make my site appear at the top of search engine results?” is “Why don’t
search engines index all my pages?”

Firest, you should know that pages
accessible uniquely through JavaScript or through form submissions are not
reachable by search engines and therefore they cannot be indexed. And there is
no means for a search engine to know whether it’s missing some pages in a site,
whether the missing page count is 10 or 10,000 (outside of site maps, which I
will discuss in a future post).

Next, let’s refresh ourselves on the
fundamental methods search engines use to find the pages they index: 1) They
follow a submission made by a human being (0,0001% of cases), or 2) They follow
a link from another page. Therefore, if there is a link to a given page, the
probability that it will be indexed is high. Alternately a personal site with
no external links to it has little chance of being indexed by a search engine.
So more links are always better, right?

Not necessarily. It should be understood that from the point of view of
a search engine, a risk arises not from a dearth of links, but from too many.
Why? Because search engines seek to provide the most relevant results for
visitors, returning pages with the content most likely to match visitors’ needs
and expectations. A site that arrived at the top of the results solely because
there were tens of thousands of links to it would not pass this test. In fact,
an overabundance of external links may indicate a “spamming” campaign aimed at
search engines and be an indicator of poor site quality.

Here are two cases of what we’ll call legitimate ‘overabundance,’ an overabundance of links due to valid, non-spamming factors that can be properly managed by search engines.


 

Case 1: User Sessions

When you visit
an e-commerce site, unique “session” information will often be assigned to your
computer. This information uniquely identifies your particular connection and
visit. It may include, for example, a unique ID for your computer and a code
for your browser version or geographic location.

This session information tracks your movements,
preferences and selections as you navigate a site. This is not for nefarious
ends, but is rather used to perform practical tasks like maintaining items in
your shopping cart, showing prices in your local currency or displaying a list
of products you’ve viewed. This session information is most often added to the end
of the URL (web address) for every page you visit.

For instance, say you are visiting Amazon.com and you
navigate to a Stanley wrench set. The URL displayed in your browser is

http://www.amazon.com/Stanley-92-716-Combination-Wrench-22-Piece/dp/B000JPUCT0/
ref=sr_1_7/002-6118145-0432018?ie=UTF8&s=hi&qid=1181650669&sr=1-7
.

Only the first part of the URL,

http://www.amazon.com/Stanley-92-716-Combination-Wrench-22-Piece/dp/B000JPUCT0/,
is needed to locate the product information for this wrench set. The rest,
“ref=sr_1_7/002-6118145-0432018?ie=UTF8&s=hi&qid=1181650669&sr=1-7″

is session information for your
particular visit.

A search engine may come across thousands of links
like the longer address, each of which may appear different because unique session
information is appended, and because each may show different user-dependent
content such as navigation history, promotions, or recommended products. But
any search engine worth its salt can discern the repetitive addresses from the
essential URL, and will know this is not a case of spamming.


Case 2: Calendar Menus

Some sites let you navigate through their
content by clicking on a calendar. For example, you may be able to peruse news
articles or events on a site by choosing a date or date range.

Such menus generate links like:
http://www.ecvd.eu/index.php?option=com_events&task=view_month&Itemid=32&year=2011&month=09&day=12

A competent search engine will know which
of these types of links returns valid content and which does not, and what
baseline URL should be included in a search index. In other words, having a
zillion external links for events on dates from 1950 to 2060 for a site with ten
events will definitely not boost that site’s ranking ;-).

Now you may say these two cases look like
easy ones for a search engine to manage, and you’d be right. The real
difficulties arise from the following three cases, because (scoop!) there are
unscrupulous people out there ready to do anything to improve their search
engine ranking. You’ve most likely encountered their handiwork when using a
search engine other than Exalead.

You run your search and click on a page you
think is relevant, only to encounter an endless list of meaningless links or
keywords, a pastiche of content “borrowed” from other more relevant sites, or
an endless loop of promising links that ultimately go nowhere.

These types of pages are generated by the
folks at the top of our list of ballot-box stuffers, those trying to improve
their search engine rank through:

* Link farms and keyword stuffing,

* Content scraping, including the abuse of
RSS Feeds, and

* Creating content labyrinths.

We’ll be covering these tactics in upcoming
episodes. In the meantime, you can see why search engines may need to limit the
number of pages they index for a site. This ‘quota’ is determined based on the
site’s reputation, the duplication of its content, and a thousand other
parameters, all factored in an attempt to keep the game honest so web searchers
get the most relevant search results possible.

 

Sebastien, Head Chef, Web Team

  • Hi this is a nice looking blog, I was just looking for this last night. Pleased I finally found what I needed.