Wednesday, May 16, 2012
Duplicate content means there are two or more posts with the same title or content, or both of them, on a site, and google doesn't love it at all. It may come up when: a publisher posted an article and then make a change on the title or content after some period where google has already indexed it on its server. Or, the publisher activated the "Archive" function on the site. Or even, the publisher running a mobile version of its site. Or any other things that change the url of the original content's url, for example: http://j-smith-site.blogspot.com/example.html to http://j-smith-site.blogspot.com/example.html?m=0, etc. In short, there's a post with two or more different urls in a site.

How To Check If There Or There Are No Duplicate Contents On A Site?

Just go to Google Webmaster Tool (www.google.com/webmasters)> Click on the site you want to check> Optimization> HTML Improvements. There, you'll know if your site has any problems with duplicate content or not.

How To Clean up And Handle This Problem?

1. Remove The Content: Download or explore all the contents shown on the HTML Suggestions page on Google Webmaster and then open a new tab on your browser, from the new tab of your browser, go to Google Webmaster again> Click on the site you want to remove the duplicate contents> Google Webmaster Tool (www.google.com/webmasters)> Click on the site you want to check> Optimization> Remove URLs, and then click "Create A New Removal Request" of the the option and follow the instruction. You'll have to decide whether to remove content from chaced page only or chaced page and search, just select "chaced page and search". But note, remove the duplicate, not the original content. - Here, you'll have to wait before it makes any effects. The eror information on your "HTML Improvements" page on google webmaster won't clean up right away, but it could take a month or maybe longer. It was a month to my site, or there abouts.

2. Handle The Problem: After you remove all posts with the duplicate content issue, you'll have to handle this problem so it may won't ever come back again. There are maybe "several" ways to handle this problem, but i know just two ways. First, you can use webmaster tool, and second you can create a "robots.txt" file. Here we go.

*Using Webmaster To Prevent "Google" (Exactly Google Bot Crawler) From Indexing A Specific Pages On Your Site

Go to Webmaster> Click on your site> Configuration> URL Parameters> Add Parameter. You'll find a column where you should fill in it your parameter to prevent google from indexing a specific pages on your site. And here is the problem, or maybe my problem, i don't even know how to fill in the parameter. The only parameter that i know is only the "m" parameter, which to blogger blog it stands for urls that are ending with "?m=1", or ?m=0, and somekind. So, if your blogger blog confronts such a problem, then you can run the parameter by simply filling the column with a single letter "m" and select "Yes, changes or reorders" from the dropdown list below the column where you fill in its "m" parameter. And after, choose "Narrows" from the dropdown list given to you and tick/choose "No Urls" of the option below it. -However, if you have a different problem, then i'm sorry i don't even know any other parameter to opt you, out.

*Using Robots.txt File To Prevent Google From Indexing A Specific Pages On Your Site

Maybe a few times ago this was a matter to you cuz there was no way to add any robots.txt file to your site. Not from google webmaster or even from blogger itself, but no longer today. A few weeks ago, maybe a month, blogger has again added some new features to keep you pleased with your blog. One among them is a feature to set your own robots.txt file, where you can use this to handle your duplicate content issue. All what you have to do with this robots.txt feature is just to activate it and fill it with short parameter, which in your case is to prevent google from indexing/crawling a specific pages/urls. -Just go to your blogger and get at the setting field and click "Search Preferences". After, click "Edit" of the column "Custom robots.txt" and to your issue with duplicate contents, fill in the column provided with the below parameter:

User-agent: Mediapartners-Google Disallow:

User-agent: *
Disallow: /search
Allow: /

User-agent: *
Disallow: /*.html
Allow: /*.html$

Sitemap: http://j-smith-site.blogspot.com/feeds/posts/default?orderby=updated


Now click save and you are done.

The Disallow: /*.html and Allow: /*.html$ will allow any web crawler to index only a url that ends on ".html", any longer urls will not be indexed, such as .html?m=0, .html?comments, and any others.

Every blogger blog have a robots.txt file by default, it's located at "Your Blog Url/robots.txt. Here is mine: John Smith's Blog | Robots.txt

3. Lastly, the service that will get your posts duplicate contents is "Archive". An archive will generate a url similar to the following: http://j-smith-site.blogspot.com/01_01_2012.html. -To handle the problem, you can place a robot meta tag on the header. Here it is:

< meta content='noarchive' name='robots'/>

Or the better one

< b:if cond='data:blog.pagetype == "archive"'>< meta content='noindex' name='robots'/>< /b:if>

Also include this (additional meta)

< b:if cond='data:blog.pagetype == "static_page"'>< meta content='noindex' name='robots'/>< /b:if>

You can also deactivate the feature through the setting field on your blogger. Just find it yourself cuz i forgot where you can find it exactly.

That's all guy. By implementing all the above tricks, your posts won't ever get duplicate content issue anymore, 99% won't. Perhaps..

Updated: Perfect Way To Get Rid Of Duplicate Content Issue

11 comments:

  1. Hw I.add robots.txt n sitemap to my blogsite.

    ReplyDelete
  2. I told you: Just go to your blogger and get at the setting field and click "Search Preferences". After, click "Edit" of the column "Custom robots.txt" and insert your parameter (robots.txt).

    Example robots.txt and sitemap.

    User-agent: *
    Disallow: /search
    Allow: /

    Sitemap: http://j- smith-site.blogspot.com/ feeds/posts/default?orderby=updated

    ReplyDelete
  3. dear sir i have same problem with blogspot site when i check html suggestion so i found 1056 duplicate tag as example www.example.blogspot.com
    http://example.blogspot.com/html?m=0

    all URL Links end of ?m=0
    so after that i removed all Url Links end of ?m=0 using Remove URLs webmaster tool
    after that i add m parameter Yes, changes or reorders" from the dropdown list below the column choose "No Urls

    but sir i am not Select "Narrows" option it is important that i select this option also



    ReplyDelete
  4. @jay kay : That's not a matter brother. It's just something like a reason for google.

    By the way, i have the better idea to handle duplicate content issue. Check this post: New Methode To Remove Duplicate contents

    ReplyDelete
  5. Hi
    If you modify the robots.txt like you say, I believe that the homepage of your blog will no longer be indexed.

    ReplyDelete
    Replies
    1. Then the question is, "why do you want your homepage to be indexed by google?" You know, your homepage content changes just anytime you update your blog, and google doesn't index a content as fast as a day, it may take 3 days or maybe a week.

      Somewhat, not indexing a page doesn't mean that google doesn't crawl it. And the most important is that google still includes your homepage in search result. That's it..

      Delete
  6. Hello;

    I've been looking for this solution. Thank you for sharing it here. Is it ok to use setup only the gwmt and not to edit robot txt? Or it is a must to setup both?

    ReplyDelete
    Replies
    1. I suggest you to set your robots.txt. I don't know about gmwt, but robots.txt has many functions. Like, to tell google not to crawl a page, keep your blog away from duplicate content issues, and many more.

      Delete
  7. It confuse me. How could it be available in the google search if we made it noindex?

    ReplyDelete
    Replies
    1. Well then.. :)

      1. Indexed: means that google has made a copy of a page and saves it on their server. Try to search something through google, then click on the "chaced page" from the results. It won't show you "the current page", but the old page that google caught in the past.

      2. Followed: Do you know anything about "Dofollow and Nofollow" blog? It's close to it.. It decides if google should count links out (Links in a page that point to pages outside your blog) in your blog as nofollow or dofollow links.

      3. No Archive: I don't know how to explain, but archive creates a lot of duplicate contents. It has a url like this one: http://j-smith-site.blogspot.com/06_2013.html

      Am i clear?

      Delete
    2. Some what, for what is your homepage? Everything that you have on your homepage, you have it too on the post page, right?

      Just look at mine, everything is okay right? I set my blog just like i wrote on my post..

      Delete