Follow
Lucid Gen
  • Marketing
    • Facebook
    • Google
    • Zalo
    • Content marketing
    • Email Marketing
    • Marketing tools
    • SEO
  • Website
    • WordPress
  • Technology
    • Computer
    • Phone
No Result
View All Result
Lucid Gen
  • Marketing
    • Facebook
    • Google
    • Zalo
    • Content marketing
    • Email Marketing
    • Marketing tools
    • SEO
  • Website
    • WordPress
  • Technology
    • Computer
    • Phone
No Result
View All Result
No Result
View All Result
Lucid Gen

Lucid Gen › Website › How to prevent web scraping in WordPress

How to prevent web scraping in WordPress

Đánh giá bài viết
10/09/2021
0
Cách ngăn chặn đánh cắp nội dung website (web scraping) How to prevent web scraping in WordPress

Cách ngăn chặn đánh cắp nội dung website (web scraping) How to prevent web scraping in WordPress

18
SHARES
1.7k
VIEWS
Share to Facebook

Website content theft, also known as web scraping, is expanding on the internet today. Many website owners have a headache because how much effort and brainpower they spent was stolen by others and even reached the top of the original website thanks to that content. Today, Lucid Gen will share with you a few tips to prevent web scraping, making thieves bored and want to give up.

Related posts

  • How to report copyright infringement to Google DMCA
  • How to add users to Google Analytics and Tag Manager
  • How to add Facebook Messenger to website
Table of contents
  1. Need to read before deciding
    1. What is web scraping?
    2. Why do they steal your website content?
    3. How do they steal website content?
    4. Potential benefits of stolen content
    5. Should I use a manual anti-copy plugin?
  2. How to prevent web scraping
    1. Block IP of web scraping bot
    2. Delayed RSS Feed update
    3. Shorten content in RSS Feed
    4. Random class in HTML content page
    5. Add more internal links in content
    6. Add watermark (logo) to images
    7. Use DMCA and report content scraping sites
    8. Ask Google to index the article as soon as you finish writing it
    9. Let Google announce the pages with your content
    10. Use proper names instead of “I”
  3. Epilogue

Need to read before deciding

When you first get web scraping, you will often be anxious to find a way to prevent it completely. But you should know that “radical” is not possible. Please read through the content below to understand the nature of the problem and choose the appropriate course of action.

What is web scraping?

Web scraping is the process of collecting and extracting data from a specific website for users of this tool. The good intentions are usually to capture information, but many bad people have taken advantage of this way to steal other people’s website content.

This work is done by the web scraping bot continuously every day, hour and minute to quickly discover the latest content and bring it back to users. This scraping bot is hard for the average user to spot, but rest assured, Lucid Gen will show you how to spot them.

Why do they steal your website content?

There are many reasons, but the most common is “don’t want to work but want to eat” . Want to get traffic but lazy to brainstorm to write articles, don’t have money to hire good content staff, so they steal other people’s brains quickly.

The second type is to go to “teachers” who teach courses to make money with “Auto blog”. To put it bluntly, create a website and install some plugins to steal the content of other websites and wait for money from Adsense. But the teachers in Vietnam, you know, are vibrant, it is not enough for their family to eat, the teachers only teach ways to get rich, which is almost out of time. If you see these keys, please leave to become a person. It’s not easy to eat.

There are a few other types. But rest assured, the ending is usually not good. A website specializing in scavenging other websites’ content, Google will rate it very low, sometimes not even approving Adsense. Collecting for a long time will become a habit. They will not be able to create anything later, and they will always follow you.

How do they steal website content?

According to Lucid Gen, there are currently 3 common types of website content theft. I will list them in order from easiest to most difficult to prevent.

scraping content via RSS Feed

Web scraping bots will visit your RSS Feed URLs to discover the latest posts. It will then go to new posts to extract the content on your website and bring it back to its server. RSS Feed URLs usually look like this:

  • https://domain.com/feed (WordPress is this)
  • https://domain.com/feeds
  • https://domain.com/rss

This type of website content theft is straightforward to prevent, and you need to delay the RSS Feed update and block the IP of the web scraping bots. More specifically, Lucid Gen will clarify below.

Steal content with HTML

Web scraping bots will visit your blog and article categories and analyze the HTML to discover new posts. They will then access the new article to extract the content and bring it back to their servers.

For this type of website content theft, web scraping bots don’t access the RSS Feed URLs, so it’s harder to detect them. But there is also a way, and Lucid Gen will explain more clearly the prevention solution.

Manual Content Stealing

In this form, website content thieves are more “industrious” and will “Copy Paste” manually. Tools cannot prevent this type of encounter. There is only one way that you have to “work harder” than the thief, add more internal links, add a watermark to the image, use your own name more …

Editing the content, editing the name, removing the link is easy for us (sometimes they forget to fix it), but recreating the image to match the content without the watermark is quite tiring.

Potential benefits of stolen content

Lucid Gen wants to give you a different perspective on the matter. In many cases, having content stolen also benefits the original website.

Google appreciates your website

A website that is copied by other websites surely has good and valuable content for users. Google can know which is the original article, which is the copied content page. So, if they steal your website’s content but they don’t win your ranking position, don’t worry too much.

Follow up for a while, let them steal it until you see signs of winning the ranking position with you, then report them. You imagine 30-50% of their website content is copied from you, and it’s always better to report a clean 30-50% than to report 1-2 posts, right.

Get backlinks from competitors

Some places copy content but keep your internal links intact, and maybe they forgot to delete or intentionally left it to please each other. So you accidentally increase the backlink already.

If they do not bring backlinks to your website, try sending a request to insert your original article link into their website. If they agree, you have a backlink. If they disagree, you can report them.

Report: Report copyright infringement to DMCA and Google. Those pages that are successfully denounced will bear many heavy consequences with their ranking on Google search.

Should I use a manual anti-copy plugin?

In Lucid Gen’s opinion, it shouldn’t be. For several reasons as follows:

  • 99% of your traffic is a normal user, only 1% is a manual thief. Don’t let this 1% make it difficult for the remaining 99% of users. Sometimes people need to copy content, images from your website for better purposes like learning, teaching friends, etc.
  • Lucid Gen himself also feels uncomfortable when entering pages that do not allow copying, without right-clicking. I feel that the owner of these websites is a bit selfish.
  • Just install a utility like Enable Copy and you can easily disable the copy ban feature. Ordinary people don’t know, but people who specialize in scraping content know it all.
  • Make your website heavier.

So don’t install some content-blocking plugin. Doesn’t work against thieves, just annoying users.

How to prevent web scraping

Skimming through the above section, you must have understood the problem of website content theft. To what extent this problem seriously affects your website, choose the appropriate solution according to that extent. Or it is best to use all solutions at the same time.

Lucid Gen asserts that, although it cannot help you “thoroughly” handle this problem. But if you apply all the measures below, surely website content theft will decrease with your website. If there are new effective measures, I will update this article.

How to prevent web scraping

Block IP of web scraping bot

To do this, you must see how to install Wordfence Premium. We will ask Wordfence to record the history of IP, Hostname and User-agent that have visited your website. From there, filter out web scraping bots to block them.

How to set up Wordfence Security and activate Premium for free

Step 1: You install Live Traffic mode. You go to Wordfence > Tools and then install as follows.

  • Amount of Live Traffic data to store (Number of access logs): 500-5000 depending on your website traffic, can choose a number equal to 1/4 of your traffic.
  • Maximum days to keep Live Traffic data: 7-14 days.
  • Traffic logging mode: ALL TRAFFIC (All traffic).
Use Wordfence to block IP web scraping bots that are scraping your website's content
Use Wordfence to block IP web scraping bots that are scraping your website’s content

Step 2: You filter out the scraping bots to block them. Click Show Advanced Filters > Select URL > contains > feed to see which web scraping bots have accessed your RSS Feed URL.

The identifying characteristics of web scraping bots are as follows:

  • User-agent is usually Bot, you see bot for sure. Some content scraping tools can create User-agent as Human (ordinary person), this case is a bit complicated, I will just go below.
  • Regularly visit your website at a very regular time, for example every 5-10-15-20-25 minutes.
  • Hostname and User-agent contain the words: feed, content, newspaper…

Note to avoid confusion with friendly scraping bots:

  • The Google bot will have a Hostname of crawl-X.googlebot.com, where X will match the bot’s IP. Any hostname with the word “google” but not googlebot.com may be fake.
  • Bot of the pages that you have created bookmarks or backlinks, the bot name will often contain the website name or website domain name, which page you create bookmarks or backlinks on, you remember to compare.

Now you need to click the BLOCK IP button to block these web scraping bots. Notice their identifiers such as IP range, Hostname name, User-agent name to do more advanced steps.

Block IP and find common characteristics of web scraping bots that are scraping your website content (How to prevent web scraping)
Block IP and find common characteristics of web scraping bots that are scraping your website content

Step 3: You add the command to block web scraping bots when having the identity in step 2. You go to Wordfence > Blocking > Custom Pattern to configure as follows.

Note: You only enter IP Address Range or Hostname or User-agent for each block command. Filling in all 3 means that all 3 of these characteristics must be correct to be blocked.

  • Block Reason: give you a generic name that’s easy to remember, Web scraping bot, for example.
  • IP Address Range: Website content theft tools often change IPs. Please block it by changing the last number to 0/24. For example the IP that you have blocked is 192.168.200.200, it can change to 192.168.200.201 to continue scraping your content, then block 192.168.200.0/24
  • Hostname and User-agent: you enter *keyword*, for example there is a bot whose Hostname or User-agent often contains the word “newspaper”, you will enter *newspaper*. 2 asterisks means that whether there are any more words before or after this keyword, it will be blocked.
Add a command to automatically block web scraping bots from scraping your website's content (How to prevent web scraping)
Add a command to automatically block web scraping bots from scraping your website’s content

So what to do with web scraping bots with Hostname and User-agent like normal people?

  1. You know which websites steal your content right? Find the IP of those websites and block the whole IP range. Plugins that steal website content in WordPress will give you a hand. From time to time you check if these websites have changed to a new server or not to continue adding a new IP range to the block order. Thieves will not buy many servers to steal your content, it’s better to spend that money to rent good content.
  2. Based on the frequency of access, as I said in the identification section, the web scraping bots will visit from time to time 5-10-15-20-25… every minute. If you find an IP with such frequency, block that IP range. In the Live Traffice section you suspect an IP, click SEE RECENT TRAFFIC to see if all the traffic of that IP looks like a bot. Currently, I have not found a better way, but I believe there is a way to count the IPs that visit a lot during the day, you learn about anti-clicking Google Ads hackers, you can find a solution.

Delayed RSS Feed update

This way, Lucid Gen finds it simple but effective with scraping content through RSS Feed. The goal is to let the thief index after you, and after the index, Google knows it’s your copy.

You insert this code into the theme’s functions.php file. Edit the number and units to be the amount of RSS Feed update delay you want. The example in this code is a delay of 12 hours. If your website index is slow, you can increase it by a few days.

//Delay RSS Feed by LucidGen.com
function publish_later_on_feed($where) {
    global $wpdb;
    if ( is_feed() ) {
        $now = gmdate('Y-m-d H:i:s');
 
        $wait = '12'; // integer
        $device = 'HOUR'; //MINUTE, HOUR, DAY, WEEK, MONTH, YEAR
 
        $where .= " AND TIMESTAMPDIFF($device, $wpdb->posts.post_date_gmt, '$now') > $wait ";
    }
    return $where;
}
add_filter('posts_where', 'publish_later_on_feed');
Delaying RSS Feed updates makes the website content stealers index slower (How to prevent web scraping)
Delaying RSS Feed updates makes the website content stealers index slower

Lucid Gen knows you’ll be wondering, “wouldn’t it be better to go up a few months or turn off the RSS Feed altogether?” (because I used to think so too).

But you should not do that, in the movie there is a saying “Pull out the rope in the forest”. Just let the thief use the simple way, you can prevent it. If they can’t get to the RSS Feed or find that there are too many new articles available on your website, they may be suspicious and find a way to be more VIP PRO, then you will be more tired.

Remember the goal of this approach is: For sites that copy your content through the RSS Feed to index after you.

Shorten content in RSS Feed

This way is old; the web scraping bot now has access to the article to steal content, not just from the RSS Feed anymore. But you have to install it fully. Go to Settings > Reading and select Excerpt mode for RSS Feed.

Shortening the content in the RSS Feed is the old way to prevent web scraping
Shortening the content in the RSS Feed is the old way to prevent web scraping

Random class in HTML content page

This way, I listen to the seniors in the group’s comments. But I can’t do it. Random class in HTML is not difficult, but the CSS part is also random; it seems tight. I will research and update the next post. This method is a compelling solution against the form of scraping content from HTML. For example, Facebook and Google are also using random classes.

Add more internal links in content

This is easy to do; when writing articles, insert many internal links related to the main content. The main purpose is for readers to refer to more information from other articles but to support the main article. The secondary purpose is to reduce the quality of the content after it has been stolen.

After stealing, thieves often delete your internal links. So imagine, there are parts of the original article that direct the reader to see another article for more information, but the stolen article does not have these links. Readers will be annoyed and may recognize this as stolen content. In general, the strength of the internal backlink in the stolen article is not equal to the original one, so it cannot support SEO with the original one.

Add watermark (logo) to images

If you pay attention, you will see that all images on the Lucid Gen website have watermarks. If a thief uses your original photo, it’s like promoting your website for you. It is effortless; you can see how to insert logos into photos in bulk to make it fast.

Lucid Gen wants you to note one thing, “don’t insert the logo in a corner”. Thieves will either insert their logo larger and overlap your logo, or they will cut off the part with your logo. Insert how not to affect users much but make it impossible for thieves to hide your logo. Insert like Lucid Gen is sure, so are ShutterStock or Freepik.

How to add watermark to photo in bulk

Use DMCA and report content scraping sites

Many people say they use DMCA for nothing; they still steal content as usual and give free DMCA backlinks. It sounds reasonable, but it doesn’t convince me.

  1. You give DMCA backlink: DMCA also gives you backlink, you can also change the code to nofollow if you like, but nofollow, how will Google’s bot go through there to get backlink for you. So don’t be too modest.
  2. You still have content stolen: Yes, the DMCA only helps you at the reporting stage.
  3. Report to Google without DMCA: Not true, in some cases Google will ask you to provide more evidence to convince. The easiest thing for you to send to Google is that DMCA link. I once reported 180 URLs that copy Lucid Gen’s content in 1 application, Google asked me to provide more evidence, I sent the DMCA link to Google and then those 180 URLs disappeared from the search results. . There are many ways for DMCA to help you (such as reporting to a hosting provider, applying to foreign countries), most basically, it also shows Google that the time to protect your article is higher than that of the scraping party, or if the scraping party doesn’t have a DMCA, it loses.

Then: DMCA is recommended, free is fine, upgrade to Pro the better (Pro makes thieves and Google feel you’re “not the right type” and has a few other features). After you have just posted the article, you manually click on the DMCA at least 2 times so that it creates a new article protection page for you.

In this article, I don’t talk about DMCA too much. Go to dmca.com to create an account > get the DMCA code > go back to WordPress > Appearance > Widgets to add the DMCA code to the footer of your website.

Go to Appearance > Widgets to add the DMCA code to the footer (How to prevent web scraping)
Go to Appearance > Widgets to add the DMCA code to the footer

If you can’t negotiate with your copywriter about removing the offending content or adding backlinks for you, then you should do this. A little bit of compassion knows where to get backlinks.

Step 1: You visit the Google Copyright Removal page and fill in the following information.

  • Name: your name.
  • Last name: your last name.
  • Company name: can be left blank.
  • The copyright holder you represent: The user himself. Tick ​​the confirmation mark.
  • Email address: your email, preferably Gmail managing the Google Search Console for your website.
  • Country/Region: Vietnam.
  • Is the submitted information related to unauthorized streaming of an upcoming live event: Not true.
Fill in basic information to send a report to Google (How to prevent web scraping)
Fill in basic information to send a report to Google

Step 2: Fill in the URLs that copy your content. It consists of 3 parts as follows.

  • Identify and describe the copyrighted work: you paste the copied piece of content here, possibly including your image URL, up to 500 characters.
  • Where we can see a licensed sample of your work: The URL containing the original content on your website.
  • Location of the infringing material: URL containing the copied content on the thief’s website.
Fill in information about stolen content to report to Google (How to prevent web scraping)
Fill in information about stolen content to report to Google

Step 3: You tick all the terms below, enter the date of filing the denunciation and your Full Name in the signature and then send.

Check the terms and conditions boxes, enter the date and signature and send the report to Google (How to prevent web scraping)
Check the terms and conditions boxes, enter the date and signature and send the report to Google

You can track the complaint results at the Legal Removal Dashboard page and monitor your inbox to see if Google has sent an email asking for more information.

Ask Google to index the article as soon as you finish writing it

This is important; you must notify Google of your new article as soon as you publish it. Go to Google Search Console > Paste the new article URL into the search box > Check URL > Request index.

Ask Google to index the article as soon as you finish writing it (How to prevent web scraping)
Ask Google to index the article as soon as you finish writing it

If you are using WordPress, you can use the Instant Indexing for Google plugin to submit the index when you click the Publish button.

Instant Indexing for Google

Let Google announce the pages with your content

Google has a lot of interesting utilities that we haven’t explored yet. You can ask Google Alerts to notify you whenever a new URL appears in the search results containing the content you need.

Step 1: You visit the Google Alerts page > enter a sentence in your article in the search box and enter.

Have Google notify when someone steals your website content (How to prevent web scraping)
Have Google notify when someone steals your website content

Step 2: You set the way to notify you as follows.

  • Frequency: Immediately / Up to once per day
  • Source: Auto
  • Language: Vietnamese
  • Region: All regions
  • Quantity: All results
  • Send to: your email
Set up notification mode for Google Alerts (How to prevent web scraping)
Set up notification mode for Google Alerts

Use proper names instead of “I”

Use your own name or your website’s brand name more, instead of just personal pronouns like “I”. This helps readers realize this is content taken from your website if thieves forget to fix them or fix them but still miss it.

Epilogue

In short, the passive way is that you finish writing the article, then create a link for the DMCA and send it to Google index immediately; the content will insert internal links, the image will insert the watermark,… The proactive way is to block IP, delay RSS Feed updates, random class, report to DMCA and Google. Applying this skill set, I believe that my way will significantly reduce content theft on your website.

Đánh giá bài viết
Share7Pin3Share1Share
Trần Ngọc Minh Hiếu

Trần Ngọc Minh Hiếu

I am currently working as a Data Analyst; before that, I worked in Digital Marketing. Blogging is a joy, helping me share my knowledge and experiences from life and work.

Leave a Reply Cancel reply

I will review and reply to all comments within the day. Please feel free to leave your comments on this article!

Your email address will not be published. Required fields are marked *

Recommend for you

  • Trending
  • Latest
Cách thêm Google Analytics vào mọi website - How to add Google Analytics to website

How to add Google Analytics to website (Universal and GA4)

05/08/2021
0
Download Navicat Premium for Mac free for life - Tải Navicat Premium cho Mac miễn phí

Download Navicat Premium for Mac free for life

14/02/2023
2
Cách tải Final Cut Pro miễn phí cho Mac - How to download Final Cut Pro for free

How to download Final Cut Pro for free

30/08/2021
0
Cài đặt SQL Server và Azure Data Studio cho Mac

How to Install SQL Server and Azure for Mac

20/08/2022
6
Cách tạo tài khoản ChatGPT và OpenAI chi tiết - How to create a ChatGPT (OpenAI) account

How to create a ChatGPT (OpenAI) account

16/01/2023
4
Download Navicat Premium for Mac free for life - Tải Navicat Premium cho Mac miễn phí

Download Navicat Premium for Mac free for life

14/02/2023
2
How to create Service Account and enable Google Cloud API - Cách tạo Service Account và bật API Google Cloud

How to create Service Account and enable Google Cloud API

08/01/2023
0
How to run a Python file in CMD or Terminal - Cách chạy file Python trên CMD và Mac

How to run a Python file in CMD or Terminal

20/11/2022
0

Lucid Gen

A blog sharing about digital marketing, miscellaneous about technology and what you might need that the author knows.

Developed by blogger
Minh Hieu › Donate

254 Posts and 233 Comments

Disclosure: This website has advertisements. If you don’t have money, how can you write?

 Google Play Microsoft Store

Recent Comments

  • Steve on How to download Adobe on Mac for free for life
  • Vincent Nguyen on How to download Adobe on Mac for free for life
  • Cristina on How to download Adobe on Mac for free for life
  • Jinx on How to download Adobe on Mac for free for life
  • Josse on How to download Adobe on Mac for free for life

Image sources

Lucid Gen edits images from the following sources: Freepik, Unsplash & Pixabay.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  • About
  • Contact
  • Terms and policies

© 2019 Lucid Gen with by Tran Ngoc Minh Hieu DMCAProtected

No Result
View All Result
  • Marketing
    • Facebook
    • Google
    • Zalo
    • Content marketing
    • Email Marketing
    • Marketing tools
    • SEO
  • Website
    • WordPress
  • Technology
    • Computer
    • Phone

© 2019 Lucid Gen with by Tran Ngoc Minh Hieu DMCAProtected

Click to Copy