How can you prevent web pages from being crawled in the Tech Q A

11 answers

Anonymous users2024-02-10

Depending on your needs, it can be very complicated or low. I used to do collection, and I wasn't at the level of a great god. However, it can be said that more than 80% of H5, web pages, and apps can be done.

To sum up, the difficulty of each ** is different, and 99% of the data can be captured. It's the biggest crawler in China.,So I want to ban it completely.,Unless the server is shut down and the data is deleted.。 Otherwise, there are too many means to collect, which is nothing more than cost considerations.

Anti-crawlers also need to pay costs, including the loss of users due to poor experience, and the internal use of the intranet. If your data is valuable, it is recommended to ask a powerful technology to do some protection. There is a basic solution to the upper energy.

To sum up: anti-climbing can only prevent gentlemen, not villains, it can increase the difficulty, and there are many ways. However, it only increases the cost of collection.
Anonymous users2024-02-09

First of all, you have to be proficient in gathering, I have been playing for many years, let's talk about my experience 1Anti-search engine, set, but against some rogue spiders, this method does not work, see method 22It takes some server resources, whether it is a spider or a human or a machine browser header can be forged, IP can be forged, cookies can be forged, so the database is used to store every user visit, only $server['remote_addr\'], Then clean up the data with cron regularly for the IP count, for example, before the table is cleared once every 5 minutes, if the number of visits is greater than 100, all add deny, use Apache to prohibit it from accessing, this 100 is obviously not the number of visits of normal users, of course, the value is set according to your ** situation in order to improve performance, Google and the IP can still be found, these are whitelisted, and the others are all to be reviewed This is how I prevented collection, No one has collected my hundreds of thousands of data, of course, There is still a way, The way is, he has to use high concealment**, and each ** can only collect the number I set, just like the 100 set before, Hehe View the original post".
Anonymous users2024-02-08

< – This is the second anti-collection method. Inserts a note with a repeating feature head and tail on the body page of the story.

Of course, this can be removed with a regular rule, but it is enough to deal with a general acquisition system.

The third anti-collection method:

Heading 1 and Heading 2.

Title 3 and Title 4.
Anonymous users2024-02-07

Detail page.

For example, the CPU rises instantaneously and the server is unstable.

The page is faked. The network is abnormal.

Or some IP and other malicious collection, resulting in not only slowing down the access speed of effective users, but also meaninglessly increasing CPU, memory, bandwidth consumption, and even server crashes.

This not only wastes a lot of money to buy resources, but also consumes a lot of manpower to troubleshoot problems.

Apply actions. This plug-in simplifies the user's operation by automatically blocking the acquisition of IPs that exceed the set range, or the IP of the set range, and allows the acquisition to be allowed for a specific time.

Intuitive data. Intuitive chart display is used in data statistics, so that the administrator can see the specific situation at a glance, no visit will be missed, and the number of abnormal users and abnormal collection will be perfectly displayed.

The powerful log function provides a clearer understanding of the access records at each time, at any time, at any IP.
Anonymous users2024-02-06

Other headings or content.

Random content 1, welcome to the random content 2 of the information network of the news software

--i.e.: add the head and end of the body or the head and end of the list.
Anonymous users2024-02-05

It is possible to temporarily prevent others from harvesting your own web pages, but it is not a cure.

There are many ways to do this:

1. From the web server, directly disable the IP address with a large number of visits in a short period of time.

2. On the script, ditto.
Anonymous users2024-02-04

Plagiarism and plagiarism are commonplace things on the Internet, and the author himself has collected other ** content, but some sites are to prevent collection, and the principle is relatively simple, that is, if you encounter the use of a collector such as the head of the train, then the program can judge that this is not a manual click on the page, because the software runs very fast. The program will then be able to mask the IP address of the collector so that you can't collect the content, which is a way to prevent mass collection. There is also a situation of manual collection, that is, to go to other stations to copy and paste directly, this situation is the most difficult to eliminate, of course, you can also use js** to block it, specifically it is to prohibit users from copying, pasting, or simply prohibiting viewing the source**, there are many of these js** on the Internet, but to be honest, it is still not possible to completely eliminate the phenomenon of plagiarism.

Some capture software is so powerful that it can be captured even before the pages on your site are released.
Anonymous users2024-02-03

There are many ways to prevent the content of your own web pages from being captured.

Method 1: Add a watermark to the content with the imageMethod 5: Use JS to encrypt web content

This method is seen on individual **, very violent. Disadvantages:Search engine crawlers can't identify and kill all collectors,For the webmaster who hates search engines and collectors very much,Do what you can,You can get out,Others won't be able to collect you。

Method 6: ** Use different templates randomly

Analysis: Because the crawler locates the required content according to the structure of the web page, once the template is changed twice, the collection rules will be invalid, which is good. And it has no effect on search engine crawlers.

Method 7: Use scripting languages to do pagination (hide pagination).

Analysis: Again, search engine crawlers will not analyze all kinds of ** hidden pagination, which affects the search engine's inclusion of it. However, when the collector writes the collection rules, he or she needs to analyze the landing page**, and those who know a little bit about scripting will know the real link address of the pagination.

What the crawler does: I should say what the collector does, he has to analyze your web page anyway**, and by the way, it doesn't take much extra time.

What does the collector do: Reduce the number of visits per unit time and reduce the collection efficiency.
Anonymous users2024-02-02

First of all, for a person who is engaged in data mining or big data analysis, you must firmly believe that there is no *** in the world that is absolutely anti-pickpocketing. This means that all **as long as it is**, will be able to find relevant ways to crawl the data. Even if you have protection, you must strengthen this confidence.

Use a regular IP pool through constant replacement of IPs. In order to achieve further cleaning and sorting of your identity and your related cookie technology, including the minimum **IP blocking and related protection works. This is an IP pool that all data collectors must have.

This is the first essence of data crawling.

Write a very good set of crawler systems and rules. After a good crawler system, many crawler rules need to be able to intelligently determine whether they have been blocked, or be able to write several sets of the same rules to initiate relevant collection from different aspects. Efficiently solve the customer's network problem, and can efficiently solve the data analysis problem.

Avoid visual data acquisition. In the protection project, it is usually through your visualization, or through conventional means to crawl the data, and in the process of data collection, try not to look at the data, but through the interception of the data or the packet interception in the transmission process, that is, through the interception of the data and the data, their own packets are subcontracted and borrowed to achieve data, capture and mining.
Anonymous users2024-02-01

3. Open ** often to see if it will jump to determine whether there is an operator hijacking.

What happens if you are hijacked?

1. The domain name is pan-resolved.

Turn off the domain name pan resolution, enter the domain name management background, click on our domain name to find the domain name resolution with the * sign, and delete it.

2. Hacker hijacking.

For backup files, find the modified file and clean up the Trojan (note, please make a backup habit and back up at least once a week).

3. The browser is hijacked.

4. Hijacking by operators.

This hijacking is the most difficult to deal with, and this is usually the case that everyone encounters, and how to deal with this situation needs to talk about https encryption.

1) The HTTPS protocol needs to apply for a certificate at CA, and there are generally few free certificates, and you need to pay a fee.

2) HTTP is a hypertext transmission protocol, information is transmitted in plaintext, and HTTPS is an SSL encrypted transmission protocol with security.

3) HTTP and HTTPS use completely different connection methods, and the ports used are also different, the former is 80, and the latter is 443.

4) The HTTP connection is simple and stateless; The HTTPS protocol is a network protocol built by the SSL+HTTP protocol that can carry out encrypted transmission and identity authentication, which is more secure than the HTTP protocol.

In the usual operator hijacking, using https encryption, hijacking can be reduced by about 90%.
Anonymous users2024-01-31

https certificate function

1) Encrypted transmission.

When the user accesses ** through the HTTP protocol, the browser and the server are transmitted in plaintext, which means that the password, account number, transaction records and other confidential information filled in by the user are in plaintext, which may be leaked, stolen, tampered with and used by hackers at any time.

What does an SSL certificate do? After installing the SSL certificate, use the https encryption protocol to access it, which can activate the relationship between the client browser and the server"SSL encrypted channel"(SSL Protocol),Achieve high-strength two-way encrypted transmission to prevent transmission data from being leaked or tampered with.

2) Verify the real identity of the server.

What does an SSL certificate do? Phishing scams are rampant, how can users identify if it's phishing or safety? After deploying the globally trusted SSL certificate, the browser has a built-in security mechanism to check the certificate status in real time, and display the authentication information to the user through the browser, so that the user can easily identify the real identity and prevent phishing.

How to get an https certificate

A secure and trustworthy SSL certificate needs to be applied for from a CA (Certificate Authority Authority) and issued only after passing a rigorous review.