Google Indexing Scrapers First?
Yesterday I published a guest post from Abhijeet Mukherjee titled Do You Know Your Visitors? 5 Points to Consider. A couple of hours later Abhijeet messaged me on Gtalk to let me know that Google was not indexing my backlinks to his blog, but rather the link from a scraper site that had copied part of the post.
This made me curious and went to check for myself. The first thing I wanted to know was if my post was indexed already by Google or not. I copied one sentence from the post and search it in Google, with quotation marks to find only exact matches. The result was pretty surprising: Google had already indexed 2 scraper sites, but my original post was not on their index yet, as the image below illustrates:
I repeated the search query today, and my post is now showing on the first position. Regardless, I find it pretty weird that Google would index first scraping material and only afterwards the original source. The same thing was happening to the indexation of the backlinks.
Anyone know what could be the cause for this flaw?
49 Responses to “Google Indexing Scrapers First?”
That Google Daniel…! As usual or maybe not…
For example I search the name of a quite famous person and the first results I got are some stupid nicknames on Hi5 and then after the 5 or 6 listing I get what I am looking for!
I believe google lists sites by the order of the links coming into that site and by the popularity of those sites that link to the site in question. Therefore, if the scraper site has more “popular sites” linking to it, then it will probably be listed first.
Omition, that should not be the case. Daily Blog Tips has around 300,000 backlinks from as counted by Yahoo.
Google probably labels scraper site as a mid-ranking directory/newssite. Frequently updating information streams require faster indexing than company or personal sites. When initial trust has been established – Blogstrings shows PR4 for me – new info is just immediately included without analyzing the relationship between source and distributor.
This shouldn’t be anything surprising, many companies, which publish a press release on their own site and distribute it via relevant services, are often out-ranked by bigger aggregator sites.
This can’t be sustainable trend though, after some time web would be flooded with RSS circulating sites and nobody would create new content. This can’t happen, can it?
Jarkko, if that is the case I think their system is flawed.
I assume it would be OK for that to happen if Techmeme got indexed before a small site, but having a PR4 scraper site being indexed before a PR7 which is also updated daily is over the edge.
And how would a scraper site have ‘popular sites’ linking to them? Its more likely the other way around 😉
The first reason I can think of is that scrapers post several articles everyday, and on each visit, if Googlebot sees more content, it’ll gradually increase crawling rate. Since DBT (1 post/day) probably has less than that scraper, Google may have indexed it later.
@Sumesh, crawling rate should also be related to overall domain trust and backlinks right?
I’d thought of domain authority (DBT is), but then the previous comment was the only answer I could find 😉
I’d be curious to know if anyone else has a better explanation. I’m subscribing to the comment feed.
Daniel, yes, I understand your point. Now it could be that G-algo is slightly flawed and even small aggregator site can at least temporarily beat their source.
Scraper sites are still relatively new phenominon. Like with blogs, it took Google quite a while to get them into control. In time this sort of attention arbitrages will certainly play out.
…Although, I still get many Google Alerts including links to pure spam blogs. Now that I come to think of it the share of spam might have been growing slightly…
Life is Colourful
That might be because of some weird mistake in algorithm. Most of the times, Google would take little time to find out the reliable resource and the scraper site must have been on tops in terms of updation frequency and all. So it can be easy to guess that sometimes Google gives imp to sites that are updated with high posting frequency [which is possible with scraping] though they are scrapers.
Daniel, you are so right. Same thing’s happened to me too. Sites that are copying my content are being indexed before mine. This is really weird. Google should do something about it.
I’m pretty sure having a sitemap and utilizing update notifications would increase the speed at which Google indexes your site. With the notification, Google would know about the post prior to the scrapers hitting it, putting you at the top quicker…
Abhijeet from Jeet Blog
@Daniel Thanks for this post. Its really scary because Google is the most powerful search engine and this is certainly not expected of it.
@Sumesh This has happened with me twice. My recent article at MakeUseOf.com which hit the Digg front page is in the omitted results in Google !!! There is no pingback or linkback from MakeUseof.com to me or my blog!!! Instead it is from a scraper !
Can you believe this? I was shocked when Google considered that article at Makeuseof which has hit the Digg front page and has huge number of backlinks as secondary.
As I write this, if I do a search for link:http://www.jeetblog.com in Google Blog Search for the past 2-3 days then the Makeuseof article is in the omitted results !!!
Google are you listening ?
Doesn’t google have hundreds (thousands?) of GoogleBots running around the web? It’s very likely that even though your site had been crawled by a GoogleBot, that Googlebot’s data hadn’t been absorbed by the larger Google search engine. This is a disadvantage of a distributed system like Google.
I’d be curious: check your web server log to see what time Google accessed the article you’re writing about.
Definitely sounds very backwards to me. Sorry I cannot shed any light on it for you, but I’m glad it didn’t take longer to get fixed and get you in the #1 spot.
Tea Party Girl
Well that explains why my latest back-link is from a “AP” story about Myanmar…
Firstly, I’ve heard getting tons of backlinks from Digg is not necessarily a great thing anymore…I remember reading somewhere that Google puts less weight on that kind of super build up of links in such a short period.
Secondly, everyone should be using the RSS footer plugin, which was actually recommended by someone at Google I think. It puts a link in your RSS feed after each post stating the original source of that post. Scrapers usually grab the entire feed content, so the link will more than likely be in their post also.
Maybe a bit off topic, but what do you want to do about that scraper sites? I’m asking because there’s one scraping mine exactly like the same way.
@Life is Colourful, could be the case.
@Michael Clark, yes I would also assume they have hundreds of bots going around. I will check the server logs to see if I can identify when the came yesterday.
@Aseem, true your first point to some extent, but what does it have to do with this post 🙂 ?
I have it footer plugin as well, right now I am using it to display sponsor and partner messages.
@banji, if they are scraping only 20% or less it would be hard to take them down, because this could be considered fair use.
If they are scraping 100% of the content though just send a DMCA.
If this is the case the their system definitely is flawed, I can’t see why indexing a scraper site first would be beneficial at all to the person who originally wrote the post. I say Google needs to fix this, being that they are top search engine this shouldn’t be happening.
wow scapers work fast
Daniel, read comment #13 from Abhijeet…that’s why I had mentioned Digg. He was talking about a scraper blog doing well even though the MakeUseOf post was the one that was Dugg…
Daniel, this has been a major problem with Google for at least a year, maybe longer. Scrappers routinely out-rank original content when doing a text snippet search, and it is not just an “initial” ranking problem –it can and will happen when your content has been live on the net for literally years.
I’ve had articles that were published years ago be out-ranked by scrappers. Luckily, you have a PR7 site, so this effect will be much less on you, however PR5 and below sites that produce original content are constantly pummeled by this. AND their listing in the serps is often hidden below the “for more results…” link at the very bottom. It is very disheartening.
It appears that Google has simply given up fighting scrappers and decided to just index everything and let their algo decide who goes on top with no regard to who originally produced the text.
Unfortunately, Google has lost the battle with scrappers.
I’ve been noticing this for a while. Another issue along the same lines: republishing posts on places like Zimbio will bump your post out of it’s position in the SERP’s for a few days and replace it with the Zimbio post. And this is after Google has already indexed your post.
What Michael Clark says is correct. The search bot crawling the scrapper site is (most likely) not the same bot that crawls dailyblogtips. Upon the first crawl, the scrapper site had original content not seen elsewhere, which ranked it high. And then, as soon as the dailyblogtips bot came by, it saw the same content and sent you to the top as being the authority and more trusted site.
It happens all the time with press releases. Scrappers get high rankings for the first few minutes to hours and by the end of the day we are up top and they can’t even be found anymore.
It is something the scrapper sites take advantage of, otherwise they wouldn’t be around, but does anyone have a better method for Google to use?
Looks like you might have some influence on Google because I just checked and DBT is listed in the number 1 spot where it should be. Maybe it just takes some time for things to work its way through.
Still I’m sure it ticks you off that others are copying your work for their gain but in the end I think it all works out.
Logic would suggest if there are more incoming links to the scraper sites, then google’s bots will find these sites before yours. Then when they finally do get to yours they acknowledge yours as the original, and list it as so.
Not John Chow
I am sure it is not worth your time to chase after the scrapers. Google should be dealing with them by lowering their pagerank.
It happens everyday. and it has a logical explanation. The scrapers “produce” much more content than a normal blog. For the Google spider, it is updated more frequently, that’s why the spiders send the information to the index at first.
But, when the 2 (or more) copies are to be evaluated by the algorithm, then Google determines which is the original one. So being, the copy has a short-time success in SERPs.
A lot of scrapers are finding my lawn mower site lately and dumping the actual site http://www.bigmowers.com right off the front page.
SEO and WordPress Design
The reason is simple. Scraper sites scrape several sites, which means they update several times a day, compared to a regular blog which updates usually once a day.
The more frequently a site is updated the more frequently it is crawled by the search engines.
And also the fact that your site is ranked above scraper sites doesn’t mean that Google considers you the original source. For example Technorati blog pages rank usually higher than the original posts.
Which sounds suspiciously like what someone said above
@Loius Gross, exactly, I also think they would need to balance the algo here.
@Aseem, got it.
@Joe, yeah apart from the indexing time I also noticed that even after 1 year scraper sites were still outranking me for very narrow search queries. I wrote about it here:
@Stephan, I suppose the same could happen with Squidoo and company, but here you have the authority factor.
@Erik, I think think they should delay the indexing of pages for some minutes, as to get a better picture of what is going on.
@Ben, yeah thankfully after a while it gets fixed, but it is annoying nonetheless, and I suspect the effect could last longer for blogs with less juice.
@Karl, it would be difficult to find a scraper site with many backlinks.
@Ivan, that is not always the case, the system is still flawed for exact search matches, check my link above.
@SEO and WordPress Design (what a name…), Yeah the timing issue is clear now, still I think it is flawed.
Karl, there are already some sophisticated blog comment scrapers, advanced to the degree that they slightly morph original wording. 😉
Sorry if this is covered already, I haven’t read all the comments. Next time it happens, let google know via the dissatisfied link at the bottom of the search result. But seeing that you got to number one eventually, I don’t know if there is anything they can do.
I have a blog I post to a few times a month, last time I checked, it took a few minutes before my post got in the index.
Same problem but with social bookmarck : i see my my blog serps with my post, under tecnorati or digg…
i did not know about it..thanks for the info guys..It can help me a lot..
We are having the exact same problem with one of our sites. In fact, we think that we may have been penalized because of it. In some cases, our content is never listed, but the scraper’s still stays in Google. This is very frustrating-to have scrapers get their pages listed before your original content is listed.
Yes I am completely agree with SEO and WordPress Design, but generally the new content is index by content above the already index site for similar content for few days only later on as soon as the traffic and backlinks to that scrapper and content diminishes it gradually get down to its original position.
You can take example of ezine articles. As soon as your publish new article it is index top for few days then after that no where.
I think it might have more to do with how often you update your site. The Google Search Appliance that many companies use to index content on their intranets automatically determines how often to crawl a URL based on how often that URL has changed in the past. I don’t know if this is what is happening in your case, but it’s likely that the content on the scraper sites change more often than the content on your site, so the Google-bots don’t check your site as often for changes.
For some reason, Google stopped indexing my posts and I can’t figure out what’s wrong. The only thing that Google indexes are category and page views, but none of the individual posts / permalinks show up in any of the search results. Even if I use very specific word-for-word with quotation search, nothing shows up. This only happened in the last 1-2 weeks and it’s driving me crazy.
Do you have any idea why that might be? Other search engines like Yahoo, Live, Ask, etc… work fine.
nice tips from the contributors to this site i have learned something new on how google index pages especially the area of submitting sitemaps as this will speed up indexing your page.
For some reason, Google stopped indexing my posts and I canâ€™t figure out whatâ€™s wrong. The only thing that Google indexes are category and page views, but none of the individual posts / permalinks show up in any of the search results. Even if I use very specific word-for-word with quotation search, nothing shows up. This only happened in the last 1-2 weeks and itâ€™s driving me crazy.
I would say its because the skraper sites is being indexed more often by Google because of they constant traffic. That is also why your site now are placed where it was supposed to be and the skraper site has been removed. I think Google’s technology is just not that good that it can see what sites really gets heavy traffic and what sites are just skraping material and gets the high ranking in the beginning.
Comments are closed.