Problems With WordPress Posts Going Supplemental In Google’s Index

For most of the time I’ve been running this blog it’s been gaining visibility in search engines. Most posts will pick up a few searchers in the long tail and as more posts are added the search traffic for the blog has increased. That is until a month or so ago when I discovered what might be a problem with the way Google indexes WordPress blogs causing most posts to go into the supplemental index.

As I said this blog was gaining visibility in search results, mainly Google, and then all of a sudden nearly all that search traffic stopped. At first I thought little about it as search results can vary from day to day, which can be especially true for keyphrases in the long tail. It was also possible my posts simply weren’t ranking as well as they had been. Not something I wanted to see, but certainly possible. But as the days went by and it was happening to every post I figured there was a sitewide or blogwide problem.

I’m not particularly aggressive with my optimization and near as I could tell I hadn’t done anything I would think would cause problems. Eventually as I dug further I saw most every post here had gone supplemental. When I did a site: search for the site the results went supplemental after about the first 20 listings. I still didn’t do too much, since I don’t think it’s good to overreact to changes with search. Sometimes things change and need to settle a bit before you can really know what’s the best thing to do. I did continue to investigate.

The Solution To My Supplemental Problems

Late last week I discovered two threads on the WebmasterWorld forum that seemed to deal with the issue. The first WordPress and dup content issues talks about the potential for having the same posted listed under more than one URL. This issue may occur if you post full feeds on your index page and consequently also have that post show up on it’s own page or specific category pages. All that content might be seen as duplicate by Googlebot and cause many pages to go supplemental and others to completely drop from Google’s radar.

The solutions are generally one of two options. One being to use a little php and WordPress magic to tell Googlebot not to index certain URLs via dynamically generate noindex meta tags. The other solution is to use 301 redirects in your .htaccess file to get things in order. I’ll let you read through the post for more details of how to achieve both as there are several different methods offered in the thread.

This didn’t seem like the solution for me since my posts other than the most recent are all published as partial feed. While there is some overlap of content it didn’t seem likely it would be enough overlap to be causing duplicate content issues.

The other thread I found, Google indexing /feed URLs discusses how Googlebot may index the feed to posts instead of the posts themselves and place the feed in the supplemental index. This is the problem I’ve been seeing as the pages listed as supplemental for this blog all seemed to end in something like /feed/.

The solution offered in the thread was a simple one. Using my robots.txt I disallowed Googlebot from URLs ending in /feed/, /feed/rss/, and /trackback/ since I had seen each listed as a supplemental result. Because Googlebot does respond to wild card characters in the robots.txt file, it was trivial to add.


User-agent: Googlebot
Disallow: /*/feed/$
Disallow: /*/feed/rss/$
Disallow: /*/trackback/$

The asterisk matches anything and the dollar sign matches the end of the url. So each of the above lines tells Googlebot not to index urls ending in either /feed/, /feed/rss/, or /trackback/.

I added those lines just a few days ago and immediately have begun to see search traffic from Google coming back to the blog. That same site: search is now showing supplemental results starting at the 38th URL and more of the actual posts are now being listed in the regular index. I would imagine before too long the rest of the supplemental results will be gone.

Other Possible Reasons For The Recent Improvement

I have made a few other changes to the blog recently so there might be other reasons why things are returning to where they were. If you remember I created a blog sitemap and also experimented with reducing the click distances to some posts by adding them to the main navigation for the blog. The sitemap too should reduce the click distance for all the posts here.

While it’s possible either could be the reason for the reduction in supplemental results I’m inclined to believe it was the addition to the robots.txt file that’s remedying things. In part because of the timing, but mostly because I had been seeing the feeds indexed as supplemental.

If you are running a WordPress blog you should read the two threads I linked to above. It was about six months after starting this blog before the issue occurred so while it may not be happening to you at the moment it might at some point. Quite honestly this seems to me a problem with Googlebot as most blog owners could likely have the problem and it seems to happen with some pretty standard WordPress settings. But Googlebot’s fault or not there is a solution. Keep a watch on your own blog, WordPress or not, as I would think this issue doesn’t need to be specific to WordPress itself. More a problem in the way Googlebot is indexing feeds.

I’ll keep you updated on the progress with the posts here coming out of the supplemental index. Hopefully in the very near future all the supplemental results will be gone.

Download a free sample from my book, Design Fundamentals.

39 comments

  1. Nice article and I am sure is helpful to many people. I do however got a good simple solution which I use. It is called intro post. Use the break in the wordpress editor so that the home page, category pages, feeds etc etc only have a snippet of the article and the rest is on the full article page. This takes care of the problem. If you have short articles it could be a problem however but then short articles are less likely to get many long-tail searches.

  2. Glad you like the article Randy. I do use just the snippet too on my posts. Only the most recent post is published in full on the main page of the blog. I think that can solve many of the duplicate content issues, but I don’t think it addresses the problem of the feeds getting indexed.

    The feed of any post will always show duplicate content to the post itself. The only difference will be in the code around the content. I don’t know why Google chooses to index the feed and it would seem like they should be able to easily realize they shouldn’t index the feed, but in some cases they do.

    I agree with breaking up the posts on most pages so they only show the first paragraph or so for a few reasons, but I do think the issues I had were from something different.

    By the way I like your blog.

  3. I agree. There are times when dividing isn’t best solution. Your points where good. Even some things I plan to look into further…
    Glad you like my blog!

  4. Col, you’d want to add the code to your robots.txt file. You should also add the Google specific code before your declarations for all robots. If Google sees User-agent: * before it sees the User-agent: Googlebot then Googlebot will follow your declarations for all robots and I believe it will ignore what comes after.

    So your robots.txt should be:

    User-agent: Googlebot
    all your declarations for Googlebot

    User-agent: *
    your declarations for all other robots

    I think you may also need to repeat all the declarations you would want for Google and not just these new specific declarations about the feeds. I didn’t actually do that myself, since I wasn’t aware of it when I added the new declarations about the feed, but I believe Googlebot will see what you specifically have for it and then not look further into what you specifiy for all robots too.

    I’m not 100% sure on that, but I think it means you may need to have many things (the stuff you want both Googlebot and the other bots to follow) listed twice in your robots.txt file, Once for Googlebot and then once more for all other robots.

    I hope that makes sense. Maybe I need to update this post with a full sample robots.txt file or write a new post on creating one.

  5. Or, alternatively, you could just give the link to your robots.txt, which I would assume is http://www.yellowhousehosting.com/robots.txt :) I’ve had similar problems with one of my sites (www.jaisaben.com), which is Joomla based and I haven’t been able to resolve just yet. I’ve also recently posted about supplementals and the issues I’ve been having at my blog, http://www.utheguru.com, and, as always with a new site, would love some traffic to kick it off!

    I’m still uncertain about whether to exclude /feed etc in robots.txt – it’s a hard one.

    It’s incredible how many people don’t even notice the problem – On one of the forums I frequent, I came across a post from a lady with the URL http://www.writersmuster.com. Her name is Christice. She has a fantastic site, lots of members and heaps of pages and inlinks (3553 inlinks!!!), and yet I noticed when I went there that she has NO pagerank, and only 7 out of her 2990 pages are indexed, the rest are supplemental.

    Whoa! What’s going on there! I’d give my eye teeth for that kind of exposure, and yet it’s probably the fact that a few simple mistakes are costing her hundreds of dollars a month in revenue (if not thousands).

    Cheers,

    M

    I’m

  6. Thanks for the info Matthew. I think you’re right that people don’t notice the problem right away. I know I didn’t and I was actively looking and I also have an idea about how all this stuff works. I wouldn’t think the average site owner would even realize there is a problem.

    I’m not sure if this issue is just a WordPress one of if the solution would also help with your Joomla blog. I would think it would, but I can’t say I have first hand experience.

    I can’t see any reason though why you’d want the feeds indexed. Assuming a feed did rank well in results and someone clicked on it I would imagine the immediate response would be to find the back button. No one wants to read an xml file when they’re looking for content.

    I took a quick look at the writersmuster site and I think the issue there is that every link has a PHPSESSID as a parameter. Search engines have issues with IDs like that because they can easily end up indexing the same page many times. Even if they do index the pages it’s likely they’d be seen as duplicate content and since the number of duplicate content pages would be so high the entire site could find itself in the supplemental index.

    Generally you don’t want to use session ids until someone logs in to your site. I’d bet removing the session ids from the links would get most of the site indexed quickly.

  7. Thanks Steven!

    It’s fantastic to get real advice from a real person – I was worried you thought I might be spamming! Personally, I allow one way (ie, don’t activate ‘no-follow’ links) from my commenters as I am one of these radical few that believe Google actually rewards for outbound links -

    I’ll pass on your advice to Christice (whom it turns out is a man, I think :) ).

    By the way, happy new year from a balmy Brisbane (Australia) night – it’s about 90 degrees F here… :) But I know you guys hate hot weather :D

    M

  8. Glad to help Matthew and no I didn’t think your comments were spam.

    I agree that outbound links can be rewarded, but it depends on the links. nofollow is meant as a way to let those search engines who pay attention to the tag know that you haven’t editorially approved some links.

    I think it’s a good practice with blog comments since you never know when someone may leave a comment that links back to a ‘bad neighborhood’ or when real spam might get through. It’s one thing to have those sites link to you, but another to link back to them.

    You’re making me jealous with the weather there. It’s been an unusually wintery winter here in Boulder. Normally winters here are mild, but we’ve had several weeks of snow storms and this weekend we saw single digit temperatures. I’m already looking forward to spring.

  9. Nice article. Got here through your comment on seobook.

    I checked your robots.txt and noticed that on line 15 (if I counted correctly) it reads “\disallow”

    I’m no robots.txt specialist but is this a typo?

    cheers

  10. Thanks Steven, great information, I started a word press blog not to long ago and it’s good to be able to prevent future problems like getting your pages on the supplemental index of Google, I also got to your site through Aaron Walls Blog.

  11. Glad I could help Michael. I know what I did isn’t the solution for all WordPress blogs that have gone supplemental, but it does seem like a common issue. It feels good to see your pages coming out of the supplemental index so fast doesn’t it?

  12. I have found in other article more rules for robot.txt the whole list goes as under:

    Disallow: /wp-
    Disallow: /search
    Disallow: /feed
    Disallow: /comments/feed
    Disallow: /feed/$
    Disallow: /*/feed/$
    Disallow: /*/feed/rss/$
    Disallow: /*/trackback/$
    Disallow: /*/*/feed/$
    Disallow: /*/*/feed/rss/$
    Disallow: /*/*/trackback/$
    Disallow: /*/*/*/feed/$
    Disallow: /*/*/*/feed/rss/$
    Disallow: /*/*/*/trackback/$

    have you short listed some of them, any reason?

  13. muztagh thanks for listing these other rules. I was only listing the ones I used, but there are many more you could add to your robots.tx file depending on the particular issues you might be having.

    Most of these are really the same thing I did, but the site uses a different directory structure, which is why you’ll see more than one /* in front. Each one of those is for another level deeper in your site structure.

    I think having several /* might be redundant though since the wildcard (*) should really take care of as many levels of sub-folders as you have. The wildcard is a substitute for anything so something like:

    /*/feed/$

    means starting at the root directory (/) match everything (*) until you get to a directory called feed (/feed). The /$ at the end means to match the end of the url.

    The three rules I used should cover all but the first two rules you have listed. They’re just a different way to write the same thing.

  14. I found this to be extremely helpful!

    It would appear in my findings that the Google blog search http://blogsearch.google.com actually will allow the duplicate posts to be indexed, however the posts will be systematically dropped from google’s normal search. I had a point and case with an individual post that indexed in one day and was gone the next. This post still remains on the blogsearch.

    Cheers,
    Justin Frost

  15. Thanks for the info Justin. I guess it makes sense too that the posts would stay in the blog search since the feeds are being indexed. I believe the results there are time sensitive, though so it’s probably still a good idea to make sure they stay in the general index as well.

    I’ll have to look and see how recent posts do in the blog search as compared to how they rank later in the general search results. I suppose it could be possible that while the posts here are indexed and get traffic from general search they aren’t indexed in the blog search. If that’s the case then Google should do something to insure that a page can be indexed in both and not have it be an either or.

  16. Sometimes it’s amazing what just a few small changes can do. It happens a lot where duplicate content and the supplemental index is concerned. In many ways it can be likened to the bursting of a dam and the subsequent flood. Often the pages that are being filtered as being duplicates would rank for many phrases, but the duplicate issue is keeping them from appearing in results. As soon as those pages are seen as unique they can all start ranking again.

    If you only have a few pages that have gone supplemental you probably won’t see a big difference, but if you have a few hundred pages the new traffic could be very noticeable.

  17. Thanks for the article Steven, I just moved my free blog at word press to my own host, in the process of learning about SEO. I’m doing all these seo tweaks to have a seo friendly site.

    I’m just wondering whether or not archives and category pages should be added to the robots.txt file, as Google might see them as duplicates?

  18. Glad you liked the post. Good decision moving to your own hosted WordPress blog. You won’t regret the move. With the archives and category pages it depends some on how you set up the structure on the blog. Chances are you will want to block some pages or they’ll be seen as duplicates. I actually need to add that here since Google is picking up some of the category pages as duplicates.

    Michael Gray has a really good video on making WordPress search engine friendly. A lot of good info on best practices for structuring things.

    I see you already set up permalinks for your URLs and grabbed a plugin so you have unique meta descriptions. Both are good. Make sure you do a site: search from time to time and see if some pages on the blog have gone supplemental. If some have you”ll probably see a pattern and then you’ll want to add a couple more lines to the robots.txt.

    The two Webmasterworld threads I linked to in the post should have all the details. I think Google may also be implementing a remove URL feature through Webmaster Tools so you’ll be able to remove duplicate pages that was as well. I think they’re planning on adding that at some point at least.

  19. Good point Colin. I use the same plugin, though in lazy mode where it picks up the first whatever number of characters of the actual post. I noticed an improvement in Yahoo rankings and traffic too shortly after adding it.

  20. I used the following lines in my robots.txt file but did not see any results in a week.
    Disallow: /*/feed/$
    Disallow: /*/feed/rss/$
    Disallow: /*/trackback/$

    I discovered my feed had the index.php in the URL so I changed the paths as follows. Does anyone know if this is correct? When I type http://www.MyBlog.com/blog/index.php/feed in a browser window I see XML data.

    User-agent: Googlebot
    Disallow: /blog/index.php/feed/$
    Disallow: /blog/index.php/feed/rss/$
    Disallow: /blog/index.php/trackback/$
    User-agent: *
    Disallow: /blog/index.php/wp-
    Disallow: /blog/index.php/feed/
    Disallow: /blog/index.php/trackback/
    Disallow: /blog/index.php/rss/
    Disallow: /blog/index.php/comments/feed/
    Disallow: /blog/index.php/page/
    Disallow: /blog/index.php/date/
    Disallow: /blog/index.php/comments/
    Disallow: /rsscb

  21. Glenn I would think the first three lines that didn’t work would still pick up the ones you’re trying now. The /*/ should pick up anything so the commands are just telling Googlebot not to index what anything that ends in feed/ or rss/ or trackback/.

    When you do a site: command at Google for your blog and you see the supplemental results are they ending in feed/, rss/, and trackback/? If not then you might be having a different issue.

    Feel free to post a link to your blog here or email me if you want. I’ll be happy to take a look.

  22. Thanks for the post. My problem is my blogger blog. It has been a PR0 forever and I can’t figure out why. Is this different from being filed in supplemental? (I don’t see the supplemental designation when I check but I am pretty new to the techie side of SEO & blogstuffs.)

    The funny thing is I have gotten to #1 on a few Google searches yet still no PR value? I guess I have a lot to learn. (I mainly blog for the fun of it.)

    • I think what you’re seeing is something different than what I described here.

      PR comes from links pointing into your page. It’s also important to note that Google only updates the PR we see in the toolbar every few months. They just updated over the weekend so you may be seeing PR for your blog now.

      Google and WordPress have also changed things in regards to the feeds and the supplemental index since I wrote this post. While I still think it important to eliminate duplicate content from your site I don’t know that what I described here is the same problem it was a couple years ago.

  23. Just last night I was looking for a good article on wordpress robots.txt, and now I found it. Last night the best article I found said robots.txt was not the proper way to eliminate duplicate content. The suggested noindex meta tags instead. What are your thoughts on this?

    • First, keep in mind this post is a few years old now and things do change.

      I’ve seen people suggesting robots.txt not being the best way to eliminate duplicate content too. There was a video by Matt Cutts a couple years back talking about the different ways Google might find urls on your site. None of the methods for blocking certain pages from getting indexed was 100% effective and Matt suggested trying a few to make sure the page didn’t get crawled.

      Some methods would be robots.txt, meta noindex, meta canonical. I think you have to look at what each does in order to decide which is best to use and I think it’s ok to combine techniques.

      robots.txt is a quick way to tell search engines not to crawl everything inside a directory for example, but if other sites are linking to those pages then they might still get crawled. Adding noindex should ensure the page doesn’t get indexed.

  24. Hi Steven, (must be a good guy, you have my name)

    My first time here and have just spent 15 minutes reading various bits & pieces ;-)

    Anyway I’m just a little curious & have to ask if you use All In One SEO plugin..? Because surely many of the issues mentioned here are dealt with by AIOSP.

    On the subject of people not realising there is a problem, a simple check of their site map on Google Webmasters would very easily confirm if there were issues with a site e.g I currently have 200 pages submitted in my site map & around 170+ currently indexed (which is about the norm), but if anyone was to find the majority of their pages were not indexed then that is a sure sign of issues that need dealing with.

    Finally just a few pointers to help beat duplicate content;
    1. Always use excerpts on your home / index page
    2. Always use excerpts / summary in your feeds, this also helps prevent content theft.
    3. Always use a top quality SEO Plugin such as All In One SEO or Total SEO.
    4. If you have any doubt about whether a page/post/article is original, then just copy all the text and feed it to Plagium.com it is free to use and will check the entire web for any duplicate of the text you submit.

    Hope that helps

    Steve (Taffy)

    • Thanks Steve, I do use All in One now, but I don’t think I was when I wrote this post. I’m not even sure if All in One existed when I wrote this post. It has been 4 years after all. I think the post even predates xml sitemaps. In all honestly much of this post probably isn’t all that relevant today.

      However all your suggestions are relevant and they’re all good too. I don’t use an xml sitemap because I haven’t felt the need. I may start using one in the future though. All the rest of your suggestions I do follow.

      Maybe I should update this post one of these days or at least mention at the top how old it is and that much of it may no longer be an issue, especially with all the tools available today.

  25. Hello,

    Really informative post. But I have one question. I have robots.txt file and I have added rules to prevent Google from indexing feed URLs but looks like Google isn’t getting it. As a result my every post feed URL is also in Google index even wp-admin one is there too.

    Can you please look into my issue. Anxiously waiting for reply.

    Muhammad Qasim
    genius-tips.com

    • Hi Muhammad. I take it you received my email about this. As I mentioned I wasn’t aware this was still an issue, but I do see your feed being indexed.

      I don’t really do anything to block the feeds from Google, unless one of the plugins I’m using is doing it for me.

      Have you been able to sort this out?

Leave a Reply

Your email address will not be published.

css.php