Is there a relationship between web crawlers and viruses, and what is the problem of web crawlers?

3 answers

Anonymous users2024-02-12

Web crawlers and viruses are two completely different concepts. Web crawler is a technology that automatically obtains information on the Internet, and automatically scrapes the data on the web page by writing programs that simulate the behavior of humans visiting web pages in the browser. Whereas, a virus is a type of malware that causes damage and harm to a computer system.

Web crawlers are legitimate data harvesting tools, while viruses are illegal malware. Octopus Collector is a full-featured, easy-to-operate, wide-ranging Internet data collector, if you need to collect data, Octopus Collector can provide you with intelligent identification and flexible custom collection rule settings to help you quickly obtain the data you need. To learn more about the functions and cooperation cases of octopus collectors, please go to the official website for more details.
Anonymous users2024-02-11

It does not matter. Crawlers generally refer to web crawlers.

It is a program or script that automatically scrapes web page information according to certain rules; A Trojan horse is a computer virus.

It refers to a malicious section with special functions hidden in a normal program, and is a backdoor program with special functions such as destroying and deleting files, sending passwords, recording keyboards, and attacking DOS.
Anonymous users2024-02-10

Web crawlers. Web crawler) is also called web spider.

A web bot is a type of web bot that is used to automatically browse the World Wide Web.

programs or scripts. Crawlers can verify hyperlinks.

and other sites update their own web content or their indexes to others through crawler software.

The process of crawler access consumes target system resources, so when accessing a large number of pages, crawlers need to consider planning, load, etc.

General purpose web crawler.

A general-purpose web crawler, also known as a scalable web crawler, expands the crawling object from a few seed URLs to the entire web, mainly collecting data for portal search engines and large web service providers. For commercial reasons, their technical details are rarely published. The crawling range and number of this kind of web crawler are huge, the crawling speed and storage space requirements are high, and the order requirements for crawling pages are relatively low, and because there are too many pages to be refreshed, they usually work in parallel, but it takes a long time to refresh the page once.

The structure of a general web crawler can be roughly divided into several parts: page crawling module, page analysis module, link filtering module, page database, URL queue, and initial URL collection. In order to improve work efficiency, general web crawlers will adopt certain crawling strategies. Commonly used crawling strategies are:

Depth-first strategy, breadth-first strategy.

1) Depth-first strategy: The basic method is to visit the next level of web links in order of depth from low to high, until it can no longer go deeper. After the crawler completes a crawling branch, it returns to the previous link node to search for other links.

When all links are traversed, the crawl task ends. This strategy is more suitable for vertical search or on-site search, but crawling sites with deep page content levels will cause a huge waste of resources.

2) Breadth-first strategy: This strategy crawls pages according to the depth of the content directory level, and the pages at the shallower directory level are crawled first. When the pages in the same level have finished crawling, the crawler goes deeper into the next layer to continue crawling.

This strategy can effectively control the crawling depth of the page, avoid the problem of not being able to end the crawl when encountering an infinitely deep branch, and is convenient to implement, without storing a large number of intermediate nodes, but the disadvantage is that it takes a long time to crawl to the page with a deep directory level.