SEO: Indexation and Crawlability
YNOT EUROPE – Before a website can rank for a keyword in any search engines’ results, the pages containing the keyword must be indexed by a piece of software search engines call a “spider” or a “crawler.” Every so often, search engine spiders visit every site in their databases, “crawl” all over every page, and attempt to follow all the text-based links. If a new page is detected, it is checked for relevant content, and if deemed acceptable, the crawler will add the page to its index (database).
The more pages with unique, keyword-rich content you can get a search engine to index, the more chance at least some of the pages will rank high in the results returned to end-users in response to their queries. Better page rankings mean increased opportunity to attract visitors who are interested in the site’s content and services.
Identifying indexed pages
To establish how many pages of your site are indexed in Google, Yahoo! and Bing, query each with the search string site:[url]www.mywebsitename.com[/url]. In response, the search engines will return a list of every page on your site that is included in the search engine’s database. A total will appear somewhere near the top of the page, just below the search box. As with other search results, the pages that appear first are the most highly ranked pages on your site. Are any important pages missing? Is the list in the order you expected (usually homepage first)? If the search results aren’t what you expected to find, it’s time to make adjustments to the way your pages appear to search engines.
To improve crawlability and indexation:
• Routinely check for broken links. Broken links usually develop when old pages are removed or a site’s file structure changes. The developer or webmaster then forgets to clean up the links pointing to the missing pages, resulting in broken links.
For a search engine spider, a broken link is a dead end. If the missing page returns a 404 error (“page not found”), the search engine will identify the page as non-existent. The longer the page linking to the 404 error remains unchanged, the lower the spider will rate the page’s “value.” Spiders want to spare site visitors the experience of wasting time on pages that go nowhere.
Tools like the W3C’s Link Checker can help webmasters ensure their pages don’t contain broken links.
• HTML sitemaps play an important part in the indexation process. Include a link on your homepage to a complete and frequently updated site map on which search engine spiders can find links to every page on your site. As the site grows, be sure to add the new pages to the sitemap.
Google recommends keeping the links on a given page to fewer than 100, so if your site is unusually deep, you may want to break your sitemap into several pages, each containing fewer than 100 links.
If you have a small website, you can link to every other page on the site in “footer” at the bottom of your homepage. An example of a nice bottom-of-the-page sitemap can be found at Playboy.com.
• XML sitemaps also attract spiders and give them something meaty to chew on. Like HTML sitemaps, XML sitemaps must be updated frequently in order to remain current.
Fortunately, a free online tool called GSite Crawler makes creating XML sitemaps quick and easy. Once the sitemap is ready, upload it to your server with a URL like [url]www.mywebsitename/sitemap.xml[/url].
For more about building XML sitemaps, visit Sitemaps.org, which created the standard recognized by major search engine spiders.
• Watch your directory depth. The most effective URLs contain the fewest number of slashes, indicating the pages lie close to the site’s “surface.” A too-deep URL might look like this: [url]www.mysitename.com/products/codebarrer/serv/30430535/sex.html[/url]. A much more effective URL would look like this: [url]www.mysitename.com/sex.html[/url]. Confining directories to major subsections not only makes spiders happier, but it also eases the creation and updating of sitemaps.
• Avoid using long URLs. Crawlers can find it difficult to read and understand long, parameter-filled URLs, especially when they contain stop characters (? # &) or tracking codes and navigational parameters. Always use hyphens, not underscores, to separate words and use clearly named URLs that include the keyword(s) targeted for the page. A good short URL might look like [url]www.mysitename.com/sex-is-good.html[/url]; it’s less-apt cousin might resemble [url]www.mysitename.com/sex_is_good.php?withchains&whips[/url].
• Avoid using JavaScript, Flash and image maps for links. Search engines generally only read HTML source code. The big engines are making advances into indexing JavaScript and Flash, but they still have a long way to go. Pages with unfriendly navigation or linking structures will not always be followed by the spiders, leaving some pages un-indexed.
• Give users and spiders breadcrumbs. Providing navigational assistance to users and crawlers via a “breadcrumb” navigation system improves not only users’ ability to find what they seek but also provides spiders with another clue to keyword relevanc. Breadcrumbs — usually ensconced near the top of a web page — show the path to the current location using text-based links. In the example Home: > Affiliate > Adult Affiliate > Payouts, each new term in the chain would be a hyperlink leading back to the page from which the user or spider came.
• Source code formatting is important. Eliminate extra white space and blank lines between codes snippets. Extra spaces take extra time for spiders to crawl, and that can translate into “points off” for a presumed sloppiness in the appearance or organization of the pages. Some HTML code editors incorporate an “apply source formatting” tool. Use it.
• Delete redundant code. Removing unnecessary characters from code reduces the time it takes for a search engine to crawl a page. Like unformatted source code, pointless code comments and redundant HTML tags also can convince spiders the site isn’t optimized for user experience. As a bonus, removing redundant and extraneous code can make pages load more quickly.
• Call external files instead of embedding code. Embedding large amounts of JavaScript and cascading style sheets (CSS) in the source code of web pages clogs up the indexation process for spiders and increases page-load time. Keep all scripts and styles in external files and “call them in” — but make sure the calls go to correct paths.
• Frequently offer fresh content. Regularly updating pages with fresh content will help ensure search engine spiders return on a regular basis to index and cache content. Any newer pages linked to from frequently updated pages also will be indexed more quickly. News articles, forums and other user-generated pages are a perfect solution for adding fresh content, especially if the pages are rich in keywords. Older pages may be stored in an archive, which in turn adds more keyword-rich content to the site.
• Don’t duplicate content. Website pages that contain identical content or in other ways appear too similar to each other may not be indexed. Search engines employ duplicate content filters to weed out replicas before pages are added to the index. For the best overall rankings, ensure every page on a site contains content not easily found elsewhere on your site or the World Wide Web.
• Use backlinks with care. Links to your site from other websites can help increase a page’s “popularity” score, but all backlinks are not created equal. Search engines discount paid links, and links from unreliable sources actually count against a page’s rank. Although no one can control who links to them and why, try to obtain high-quality links from trustworthy sites with content relevant to the page(s) you want indexed.
Search engine spiders have billions of pages to get through on the web. To improve speed and efficiency and conserve resources, some search engines may program their spiders to give up or return later if they encounter difficulties crawling a page. Applying the recommendations above should help you avoid problems with indexation of your site.
This column was contributed to YNOT Europe by the SEO staff at RIVCash.com.
Comments are closed.