The last couple of weeks we’ve been looking at information architecture with a focus on making your site more usable to real people. Today I want to continue that talk, except with the focus of how to make your site more usable to search engines.
If you remember I mentioned that how we structure content can help people better understand what your site is about and help them find what they want quicker and easier. The same is true in regards to search engines.
Today I want to talk about the later, about how search engines crawl and index your content and what you can do to help make it easier for them. Next week we’ll continue with a discussion of siloing or theming, which is a way to structure your content to help search engines better understand what it’s about and ideally help your content rank better for different keyword themes.
Depth of Site Structure and Search Engine Spiders
This weekend a friend and I went for a hike in one of the hundreds of hiking trails in the mountains where we live. There a are a limited number of starting points to discover those trails. You can start at one park and begin hiking one trail. When you see another trail off the one you’re on you can take the new trail or continue on the same one. The further you walk the more trails you’ll discover.
At some point you’ll likely get tired and come home. The next time you go for a hike you might start at the same park and that same trail and begin walking and exploring again. If you do you’ll encounter some of the same trails from your last walk as well as some new trails. Again you’ll tire at some point and go home likely having found a few more trails than you knew about after your first visit.
There’s more than one park in town, more than one starting point to find different trails. With each new starting point you’ll find new trails and sometimes even find old trails you discovered starting from other parks. Some trails in the mountains you’ll never find, but the more often you hike and you more places you start out from the more you’ll find.
How Search Spiders Crawl and Index Your Content
Search spiders or robots find content on the web in a very similar way to how we might discover new trails in the mountains except they follow links and what they find at the end of those links might look very different than the last time they were there.
You and I will start out on a trail we know about and from there discover new trails connected to the one we’re on. Search engines will start out on a page they know about and discover new pages that are connected to the ones they visit.
In much the same way you and I will discover more trails by varying where we start our hike, search engines find more pages by varying where they start. Also the same way you and I will tire and stop hiking for the day, search engines spiders won’t continue crawling forever at any one time. There’s a limit to how deep they’ll crawl a site on any visit.
If we think about the above we’re left with 3 ways a search engine might crawl and index more of your site.
- More entry points
- A deeper crawl
- Pages closer to the most common entry points
The first two above are a function of the links pointing into your site. We create more entry points by having other sites link to as many pages of our sites as possible. If they all link to the home page of our site we have one entry point. If they link to a variety of pages across our site we have many entry points.
More links flowing into your site generally means your site has more link equity. For Google that link equity is pagerank (PR) and Google has said that the more PR, the more link equity a site has, the deeper it will crawl that site.
The last item above is where information architecture comes in. The closer we can make a page to one of the starting points of a crawl the more likely that page will be found and consequently indexed. A shallower structure becomes a goal to increasing the number of pages indexed on your site.
We learned a few weeks ago about the principle of choices, the idea that the more options we provide, say in a menu, the harder it is for people to choose one of the options. The principle of choices pushes us toward a deeper structure. It wants us to create top level navigation with fewer links, since that’s easier for real people. It pushes pages away from those starting point pages.
Using Sitemaps To Speed Indexing
You’ve likely heard about sitemaps. Sitemaps offer a solution to the above problem of wanting both a deeper and shallower content structure. There are two kinds of sitemaps.
- html sitemaps are page(s) on your site that link to all the other pages on your site.
- xml sitemap are files you submit to search engines to tell them about all the pages on your site.
An xml sitemap is really a backup plan. There’s no guarantee a search engine will crawl all the links in your xml sitemap. Think of them as a supplement to a good site structure. Google says as much on their About Sitemaps page.
Google doesn’t guarantee that we’ll crawl or index all of your URLs. However, we use the data in your Sitemap to learn about your site’s structure, which will allow us to improve our crawler schedule and do a better job crawling your site in the future.
I generally don’t use an xml sitemap and I’ve never had a problem with search engines discovering my pages. That’s not to say you shouldn’t create and submit an xml sitemap, but rather that if your site is structured well to allow crawling the xml sitemap isn’t really necessary.
An html sitemap is one you create on your site. It’s a page like any other that links to all your other pages. If you place a link to your html sitemap on every page of your site, say in the footer, then your html sitemap is never more than one click away from any page on your site. Since that sitemap links to every other page on your site, it also means no page is ever more than two clicks away from any other page on your site.
As your site grows it becomes more difficult to link to every page from a single page so your sitemap begins to have its own structure. Perhaps your top level sitemap page links to is several additional sitemaps (one for each section of your site) that then link directly to your pages. Each page is now never more than 3 clicks away from any other page.
In other words you can have a site structure as deep as you want and through a sitemap you link to from every page, leave a shallow structure for search engines to find your content.
Before leaving sitemaps I want to mention video sitemaps. These are xml sitemaps specific to your video content. Because the content inside a video can’t be crawled they’re probably more useful than regular xml sitemaps. The short video below will tell you what Google wants to see in your video sitemap. Again though, if you can include all those things directly on your site you probably don’t need to submit a video sitemap.
A couple of times above I’ve mentioned that I don’t think xml sitemaps, including video sitemaps, are necessary. That doesn’t mean you shouldn’t use them. They certainly aren’t going to hurt and they may very well help search engines find pages that are difficult for them to find during a normal crawl.
What I want you to understand is that xml sitemaps are a supplement, not a replacement for a good site structure. Better to have search engines find your content by making it easy for them to get to that page following links on your site than to rely on them following the xml you submit.
Eliminate Duplicate Content
Googlebot “works on a budget”: if you keep it busy crawling huge files or waiting for your page to load or following duplicate content URLs, you might be missing the chance to show it your other pages.
Duplicate content might be two completely different pages with the same content or it might be the same page accessed through 2 different URLs. The latter happens a lot with content management systems, where there aren’t pages of content, but rather code that determines which content to pull from the database, depending on different conditions.
It’s very possible the same content can be accessed by going through your main navigation or through a tag cloud or through internal site search. The URLs might end up looking like:
The resulting page is the same in all 3 cases, but to a search engine it’s 3 different pages. You only want one of those URLs crawled and indexed. If all 3 are indexed you’re competing with yourself for the same traffic, which ends up leading to less overall traffic. You also leave it up to the search engines to determine which is the best page (URL) to show.
And if you allow search engines to crawl all 3 URLs it may take them longer to find the one you want while they’re crawling the one you don’t want.
There are a variety of solutions to the above.
- Meta information like noindex and nofollow
- Canonical tags
- 301 redirects
- Robots.txt to block crawling of certain pages
Each of the above is worthy of one or more posts on its own so instead of trying to give you all the details here I’ll offer some resources for more information below.
The main thing I want you to understand from this section is that you need to be aware of the structure of your site and how many different ways (URLs) there are to access the same content. Realize that while you and your visitors understand they’re looking at the same page, search engines don’t and you need to help them understand a little more.
- Specify your canonical
- Learn about the Canonical Link Element in 5 minutes
- Learn More about the Canonical Link Element
- Google, Yahoo & Microsoft Unite On “Canonical Tag” To Reduce Duplicate Content Clutter
- Canonical URL Tag – The Most Important Advancement in SEO Practices Since Sitemaps
- Dispelling a Persistent Rel Canonical Myth
- Canonical URL’s for WordPress
- Canonical URL links
- [When NOT To Use Canonical URL Links ]
- A Standard for Robot Exclusion
- Get yourself a smart robots.txt
- Learn more about robots.txt
- Robots.txt from SEOmoz Knowledgebase
- URL Rewrites & Redirects: The Gory Details (Part 1 of 2)
- URL Rewrites & Redirects: The Gory Details (Part 2 of 2)
- Guide to Applying 301 Redirects with Apache
The anatomy of a server sided redirect: 301, 302 and 307 illuminated SEO wise
- Using htaccess Files for Pretty URLS
The way you structure your content plays a part in how well your content gets crawled and indexed. If you want a search engine to list one of your pages in their results, the search engine first needs to find that page. It’s important that we make it easier for spiders to find all of the pages we want indexed.
Fortunately most of the ways you help search engines find your content also helps real people find that same content. A sitemap for example can serve as a great backup to your main navigation and can be organized in a way that makes it a table of contents for your entire site. Shorter click paths mean people as well as spiders can get to your content quicker.
Sometimes though, we need to understand the difference in how people and search engines see things. Real people won’t have any problem with multiple URLs pointing to the same content. If anything it likely makes it easier for them. Search engines on the other hand still get confused by “duplicate content” and you need to be aware of that so you can help make things clearer for them.
Next week we’ll look beyond crawling and indexing and talk about siloing or theming your content. The idea is to develop the structure of your content in a way to help reinforce the different keyword themes on your site and in the process help your pages rank better for keyword phrases around those themes.
If you liked this post, consider buying my book Design Fundamentals