Content scrapers are gunning for your website

Posted on 7 Nov 2016 by The Manufacturer

You worked hard on your website. Spending hours perfecting the layout and design; crafting language and writing content with care; and pouring endlessly over edits to make sure everything came out just right.

Finally, your hard work has paid off. You have a beautiful, professional website with content you are proud to show prospective customers.

But lurking elsewhere, a less creative web developer is looking to steal your ideas and pass them off as their own. And we aren’t just talking about plagiarism – oh no. This is much more sophisticated and damaging to your brand. We are talking about “content scraping.”

Content scraping is an illegal attempt to steal original content from a legitimate website without the owner’s permission by the use of automated software. This allows the cyber thief to pass off your work as their own on a completely different website.

Intelligent Cyber Security for the Real World

The Manufacturer and Cisco are putting on a webinar looking at a vast array of different cyber security issues facing manufacturers.

This webinar will look at the full gamut of cyber security threats including how legacy industrial control systems — many of which were never built with security in mind — remain especially prone to cyber threats.

When: Tuesday, Nov 29, 2016 10:00 am GMT Time (London, GMT)

To register: Follow this link and click the register button.

Aside from being unethical, it can have damaging effects on your brand, including lowered SEO effectiveness, decreased web traffic, loss of sales, diminished visitor engagement, slowed page load times, inaccurate visitor analytics, decreased advertising revenue and deflated brand awareness; not to mention pricey legal battles to recover duplicated content or copyright infringement.

Content scraping is a subset of web scraping, a software technique for extracting website information. Primary targets of web scrapers include airlines, real estate listings, e-commerce sites, bloggers, digital publishers, directories and classifieds, and financial websites. As you can tell, it’s a pretty big problem.

In fact, 46% of all web traffic is made up by bots. While motivations for using these bots vary (and heads up, almost none of them are good), 38% of companies who engage in web scraping do so to obtain content. Not surprisingly, you don’t have to be a skilled hacker to steal content as there is plenty of scraping software available online.

With such a seemingly ubiquitous problem, how are organizations expected to defend themselves against content scraping?

Battling the bots

Fear not. There are several solutions that may help block web scraping bots to better protect your website and content. There are basic hygiene measures and advanced techniques website owners should consider. The former include:

  • IP address tracking and blocking: You can’t stop scrapers if you don’t know who they are. Track IP addresses and block the bad ones.
  • Create rate limits from individual IP addresses: Legitimate users won’t access your website thousands of times per second.
  • Require login access or CAPTCHA: You’ve seen this technique before. While bots are good at automated infiltration, they are bad at decoding images. Using CAPTCHA is something your visitors are likely used to seeing, and something your bots just can’t figure out.
  • Embed your written content inside an image or PDF: Since scrapers typically search URLs using an automated program, you can protect your content by placing your text inside an image or PDF.
  • Capture bad bots with honey pots: Outsmart scrapers by building links enticing to bots but unlikely for customers to click. When the malicious user accesses these hundreds of times, you can be sure they are nothing but trouble when you block them.

The most effective answer lies in using products or services that can differentiate and block automated adversaries masquerading as human visitors. These services protect your web application on your behalf so you don’t have to worry about whether it’s a human being on the other side of your application, or a machine pretending to be a human.

Modern threat detection and cyber defense relies on an appropriate understanding of user behavior coupled with an intimate understanding of automated and malicious attacks. By combining these two techniques, we are finally able to block scrapers in real time, before they can exact harm on your brand.

There are, of course, reactive methods to retrieving your data such as the threat of lawsuit. The Digital Millennium Copyright Act (DMCA) provides notices and cease and desist letters, but it is much more effective to stop these attacks before they ever happen, rather than scouring the World Wide Web for any and all incidents of web scraping.

Protect your content

In the end, it is much better to be preemptive in the fight against content scrapers. Educate yourself about web scraping attacks and ensure you don’t lose what is yours.