How to deal with crawlers crawling https sites

Updated on technology 2024-05-27
4 answers
  1. Anonymous users2024-02-11

    Spider crawler spider crawls https**.

    1) According to whether the hyperlink in the web page is https, there will be some hyperlinks in the network, and if it is https, it will be considered as an https site.

    2) According to the submission method of the submission entrance of the webmaster platform, such as active submission, if the submission in the file is https, the link will be found in the form of https.

    4)、Refer to the historical status of the link,The reason for using this method is mainly to correct errors,If the error extracts https, there will be two situations,One is because https is inaccessible, it will fail to crawl,Second, even if it can be successfully captured, it may not be what the webmaster wants,So there will be a certain amount of error correction。

    2. Crawling of HTTPS links.

    Now there are two more common ones, the first is pure https scraping, that is, it does not have an HTTP version, and the second is redirecting to https through http, both of which can be crawled normally and the effect of HTTP scraping is the same.

    3. HTTPS display.

    For HTTPS data, there will be an obvious prompt on the display side.

  2. Anonymous users2024-02-10

    Octopus collectors can process data from crawling https sites. When you set a collection rule, you can choose to use the HTTPS protocol for data collection. Octopus Collector automatically handles the certificate verification of the HTTPS site, ensuring the security and accuracy of the data.

  3. Anonymous users2024-02-09

    No, but I'm guessing you're using the wrong module.

    const http = require('http'With this.

  4. Anonymous users2024-02-08

    Spider crawler spider crawls https**.

    1) According to whether the hyperlink in the web page is https, there will be some hyperlinks in the network, and if the brother is https, it will be considered as an https site.

    2) According to the submission method of the submission entrance of the webmaster platform, for example, the main fiber is submitted in parallel, if the file is submitted is https, the link will be found in the form of https.

    4)、Referring to the historical status of the link,The reason for using this method is mainly to correct errors,If you extract https by mistake, you will encounter two situations,One is because https is inaccessible and will fail to crawl,Second, even if you can catch it successfully, it may not be what the webmaster wants,So there will be a certain amount of error correction。

Related questions
9 answers2024-05-27

Earthworms have "six fears", you can eliminate them according to their weaknesses, I recommend using the third or fourth of the following, convenient and fast. >>>More

8 answers2024-05-27

Find something like bee secret, sugar, etc., and wrap it up so that it doesn't leak. Because ants like sweetness the most. Spray some more pesticides or vinegar or something. After cleaning, it is generally gone.

7 answers2024-05-27

Crawling shrimp can be cleaned and made into spicy crawling shrimp, or dry pot crawling shrimp are very delicious.

9 answers2024-05-27

It's still delicious to use with fried.

6 answers2024-05-27

Now the express station needs to join, the rules of joining are nothing more than to pay a franchise fee or deposit, and then buy some equipment. >>>More