Google Ranking Scraped Material on Top?
Today I was doing a small research around the Internet to see how many people was scraping my articles. Needless to say there was a whole bunch of them. The interesting thing, however, was to discover that when I performed narrow search queries with sentences that were contained in my articles, Google would return the scraper site on top, and my original content on the supplemental results.
For example, I searched for “the user should be able to see all the navigation options straight way”, quotation marks included to make it an exact match. That sentence was coming from my most popular article, 43 Web Design Mistakes You Should Avoid.
Google found 1 page with that sentence, and it was a scraper site violating copyright. The original article was in the supplemental pages. The same pattern was found on pretty much all my popular articles.
Maybe the deal was specific to my site, I thought, so I decided to test with some other authority blogs. Next I searched for “Tip Jars and donation buttons have been a part of blogging for years now”, which is a sentence coming from one of the most popular posts on Problogger, How Bloggers Make Money from Blogs. This time four results appeared, and all four of them were scraper sites. The original entry was on the third supplemental page.
Finally I searched for “Instead of using the throwaway plastic utensils available at work”, which is coming from an article on LifeHacker. You can’t have much more authority than these guys, so this was the acid test. Guess what, once again Google returned 4 pages, and all of them were scraper sites.
Try to perform a search for exact sentences coming from your popular articles. Are the scraper sites ranking above your original content? Do you know why Google behaves that way or how one could fix the problem?
Update: Maybe it was not clear on the post, but this problem happens only when I search for exact sentences contained on the articles. If I search for one or two keywords related to the title they will still rank on the first page of results.
30 Responses to “Google Ranking Scraped Material on Top?”
I see this from two different points. One is anger. I found results quickly with a few of my oldest posts on blogs and not even tagged as from where the content has been stolen, with pictures, one post uses all 19 images like in my original(a recipe for apple strudel with pictures of each step.) The other point of view, as my site covers a very small niche, is a feeling of, hey I am the Authority, others think that my work is worth to be copied.
But well, I found my site always to be on top of the few listings, above the derivates. Thanks for DBT, Bernhard.
One solution: boycott Google with robots.txt and META tags to keep Google from indexing new content. It may sound extreme, but if the content lands in supplemental results while scrapers earn cold hard cash, what is the difference? You are not earning any money! And if enough bloggers did this, Google may get the point. Their business model will fail without new content from Joe and Jane User, and if Joe and Jane User can only find scraper sites, then Google’s validity will lessen.
You are not the only one who wonders about this. I would Google some of my own posts. (My site is a quasiblog.) Google returned a page from my site linking to the post, but the post itself would be buried in supplemental results, if the post was returned at all. This made absolutely no sense to me! Both pages were on my own site! It would seem natural that the post, not the page linking to the post, would be the one Google searchers would want returned. Duh.
I may open a new site on a different topic. But the unthinkable has occurred to me: shutting out all search engine spiders. I am tired of playing the SEO game. I am tired of rarely hearing from a soul who browses my site. The hits are inching upward towards 300 per day, but I get almost no feedback. And people are stealing my work.
I want to interact with real people from my own region for a change. So I might use only social networking to promote my new site.
This is the kind of thing that Google really has to focus on if they don’t want to lose what little trust people have left in them.
If their main product – their search engine – doesn’t work correctly and bring the most relevant searches to the end user, what does that say about the rest of their stable of products?
I’ve been a little suspicious of their adsense tracking for some time now. MyBlogLog is telling me that I’m getting 10-20 clicks a day on my adsense ads, but Google tells me I’m getting 4-8? And then Google e-mails me back and tells me that tracking clicks on the ads is against the adsense TOS?
With 16000 employees now, is Google losing it? Or are they going to come out with something big to wow everyone?
Anonymous, as I stated on the update of the post this problem only happens for exact sentences. I still rank pretty well for standard keyword searches, so boycotting Google is not an option (they send a pretty generous traffic my way).
I had this problem as well with a popular post. Yahoo figured it out, MSN figured it out, Google very slow to the party – took about 2 weeks for it to determine I was the originating source (even after 3000 inbound links).
Robert Scoble had this gripe about Google last year too after a splogger continually ranked higher than him for his own content.
I’m no SEO expert and I also have problems with scraped content. The following is just my theory but it could explain the SE behavior you described above. You probably noticed that a good deal of the results returned were subdomain sites sitting under blogspot. Actually some 70% of blogs on blogspot are splogs – automatically generated blogs with scraped content. There are thousands and thousands of them. (somebody did research on this, don’t remember the source)
I think there may be two main reasons why splogs outrank the original content:
1) blogspot carries a lot of weight to its subdomains. And if the subdomains are older than the original content site, that gives them even more trust from SEs. It used to be easy to take over abandoned accounts (on different blog networks) that had good PR and some were quite old, and many spammers took advantage of that (on a massive scale, remember they automate everything).
2) when an original content site gets pinged it attracts all kind of bots, but blog bots, and scraper bots get there first. The regular SE bots that index your pages are much slower at getting to your site. The scraper bots immediately scrape your original content, ping their own feeds and through black hat techniques trigger the SEs bots to come and index their site BEFORE the site with the original content gets indexed. Combine this with an older domain that sits under blogspot (or similar high trust blog network) and SEs will think the original is a duplicate content. At least for a while.
Now, the good thing is that scraped sites should only outrank you on long tail keywords. For example if you type “web design mistakes” you rank on first page. I think this is because you beat the splogs in diverse and solid backlink count. So how come you don’t beat splogs in the long tail? I don’t know but my guess is that SE algos have a good share of holes that are not easily closed. And if you experiment with SEO long enough and have good eye for detail you will spot some of them eventually, and I think that is what many black hatters do.
Bernhard made a good point (Comment #1, in case anyone is interested), and I agree with both sides of it.
I run an intermediary blog (think Drudge on a much [um, much] smaller scale), and am always conscious of maintaining the true source of an article. More to the point, I like to keep whichever article I’m referring people to to a few sentences, or a paragraph if need be.
Perhaps it’s a bit different with an intermediary blog. But the way I rationalize, is that if a sentence search happens to pull up my website, they’re getting only a small preview before being directed to the original site. Sort of a, “I’m so glad I came up in your search, now follow me to the real ‘meat and potatoes’ as I point you towards the source page.”
But how can you tell how many keyword searches do NOT return your site because a scraper’s site returns your content? It would be difficult to assess this. You can track referrer data. You can track which page on your blog is viewed first. You can track keyword search hits. Have you combined all three? Which of your pages rarely or never gets a referral from Google? What keywords does that page have that match keywords on pages that do get Google referrals? What if you tagged out pages that never get a Google hit so that Google stops indexing them? Would that improve your site’s rank?
True, a deeper analysis would be needed to evaluate the issue.
I apologize if I seem like I am harping. I keep commenting because this issue affects my own site. I assume most scrapers scrape RSS feeds. It would be easier to differentiate RSS code from the blog’s template code…what is content from what is not. If David is correct, and the scrapers steal your content, post it, ping it, and have it up and indexed before your post is indexed on your site, then what if you implement an RSS time delay? Upload the content of your blog post a day or two before you release it on the feed. Google will index it before the typical scraper gets his hands on it. That is a stopgap measure, but it might be a long-term solution. I think I may try it on my site.
Definitely report each site to Google’s spam team through your Google Webmaster Console. They will take each spam report seriously.
Now if tons of sites are doing this to you and outranking you, you may have your work cut out for you. How often is the search bot crawling your site? If scraped content is getting indexed before you, a Google sitemap pinged through Google Webmaster Central might help.
Other things to look into is your domain itself – is there any reason your site might be penalized for selling text links or in random link exchange?
Regarding blogspot, I disagree that the domain flows page rank to subdomains — subdomains are treated as different websites in search engines’ eyes.
It happened to my blog too. I did similar research and got the same results. The good thing is people are lazy and usually won’t search for long list of keywords.
For a new blog like mine, I too faced the same problem with splogs and some of them have been used in questionable sites. It is simply very frustrating and I am worried that I might be penalised for content duplication.
I need a solution too.
Reporting these sites for scraping is a pain. Google will ask you to submit a DMCA take down notice.
As most of these sites are “made for Adsense” it can often be more effective to simply get them kicked out of Adsense. Remove the money, remove the motivation. Look for the “Ads by Google” link on the Adsense ads shown on these sites. You’ll be able to click on that and find a “Send Google your thoughts on the site or the ads you just saw” link to report crappy scraper sites.
That often works faster than reporting them for scraping. 😉
Good tip, Beal. I clicked on “Ads by Google” on one site that scraped my content to report the violation. I’ll see if Google does anything about it.
Good one Andy, I will for sure try that.
Dave Starr — ROI Guy
Glad I read to the bottom. Indeed a good tip from Beal. At least reporting the situation to Google in that way is better than just stewing in anger … which I can well understand.
I can easily understand how Google is not ready to be the plagerism police of the world, but surely they can do something about not ranking purloined content above the originator.
One reason I can see that prevents effectively policing the date of first publication issue is that there is no time standard … everyone runs their web/blog servers on whatever time standard they feel like. So when Google finds the content on server A and later on server B, how does Google know which one is the first publisher without very complex time/date calculations … only to be thrown off if the copyist runs his/her server at an earlier time.
There should be some sort of digital signature that could “brand” materials as it’s published … assigned by a trusted third-party … similar to the way we get third-party certificates for SSL transactions.
I have a couple of blogs running, and as soon as I hit the ‘Publish’ button, a couple of sites would be copying whatever I have to their blogs, either by feeds or whatever.
It is really annoying, but you can’t seem to stop it. Emailing them won’t help either and there’s not really a cybercop pointing guns at them not to plagiarize.
Sadly, this isn’t the worst that Google has done in this area. In some cases, the authorized content ranks dead last or not at all:
Google has a very serious problem here and they are in denial about it. This could, down the road, cost them dearly with their devoted searchers.
What you are witnessing is the Achilles Heel of Google and all search engines –they do NOT have the ability to determine content’s original owner.
This has always been the case, but it only became readily apparent when RSS feeds began to dominate the scene. Feeds give spammers an easy way to rip your content.
See that orange “Feed” button up in your browser bar? That’s a big flashing sign that says “Please Steal My Stuff.”
It’s a very sad situation, but it won’t be changing any time soon. Machines are only so smart.
It is happening to me. My site has been successful in its niche, and now the quality posts I make, and take hours, even days to refine and create, are instantly stolen and posted on a scraper site under their name.
The whole thing is really messed up, and Google have lost control of it.
You would think it is VERY EASY to define the original content of word for word copyright infringement: Simply include in Googleâ€™s algorithm WHICH ONE WAS PUBLISHED FIRST! FFS.
Referring to Andy Beal’s post above, you can find Google’s article about scraping sites and DMCA reporting here, http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=66359 . Google also refers you to their Webmaster’s Discussion Forum. I haven’t tried this so I don’t know if they respond well there.
Comments are closed.