How to write a distributed crawler in Python

Updated on technology 2024-03-16
6 answers
  1. Anonymous users2024-02-06

    618ip software has the function of IP and automatic IP access.

  2. Anonymous users2024-02-05

    It's a bunch of computers against one computer.

    For example, if you use host C to crawl **S, S finds that you crawl data too fast, and thinks that you are not operating it, so it blocks your IP, aren't you very depressed? So, the distributed crawler came into play, and I used c1, c2, c3 ,......N computers crawl data to S together, and each computer receives a task to reduce the frequency of crawling, but because N computers are crawling in parallel, the efficiency is quite amazing, and it is OK to assemble the data they crawl back.

    Distributed is a good thing.

    For another example, if the main force wants to pull up the stock price, it must have a large amount of capital, if the funds are concentrated on a computer, the exchange will record your IP, and ZJH will be easy to find you in the future. Then the main force will of course use distributed computers to pull, disperse the funds to n computers, and spread them to n accounts, and the funds on each account are lower than the zjh check your red line. Then use a distributed program to centrally control the fund linkage of n computers and n accounts.

    In this way, what they see is n leeks on n computers, and it is difficult to tell if they are acting together, only the main program knows that they are acting together.

  3. Anonymous users2024-02-04

    Personally, I think that it is enough for novices to learn python to crawl web pages with the following 4 libraries: (the fourth one is really unworkable, of course, it may not be able to do it in some special cases).

    2.Parsing web pages: Those who are familiar with jquery can use pyquery

    3.Use requests to submit various types of requests, support redirects, cookies, etc.

    4.With Selenium, you can simulate a user-like operation in the browser and handle a web page generated dynamically by JS.

    This and several libraries have their own functions. Together, it can complete the function of crawling various web pages and analyzing them. For specific usage, please refer to their official website manual (link above).

    To do things is to have a driven call to the state, if you don't have anything special to grab, novice learning can start with this level**.

    At present, it has been updated to the fifth level, and after passing the first four levels, you should have mastered the basic operations of these libraries.

    I really can't get through, let's look at the solution here, the fourth level will use parallel programming pinning. (Serial programming will be time-consuming to complete the fourth level), the fourth and fifth levels are only out of the question, and the solution has not yet been released...

    After learning these basics, it will be smoother to learn scrapy, a powerful crawler framework. Here is its introduction in Chinese.

  4. Anonymous users2024-02-03

    : From the basic needs of crawlers: 1

    Crawling py's urllib doesn't necessarily have to be used, but you have to learn if you haven't used it yet. A good alternative is a third-party more user-friendly, mature library such as requests, and if pyer doesn't know about the various libraries, then it's useless. Crawling is basically pulling the page back.

    If you do it deeply, you will find that you want to.

  5. Anonymous users2024-02-02

    "Getting started" is a good motivation, but it can be slow. If you have a project in your hand or in your head, then you will be driven by goals in practice, rather than learning slowly like a learning module.

    In addition, if every knowledge point in the knowledge system is a point in a graph, and the dependencies are edges, then the graph must not be a directed acyclic graph. Because the experience of learning A can help you learn B. Therefore, you don't need to learn how to "get started", because there is no such "getting started" point!

    What you need to learn is how to make something bigger, and in the process, you will quickly learn what you need to learn. Of course, you can argue that you need to know python first, otherwise how can you learn python to be a crawler? But in fact, you can learn python in the process of doing this crawler:

    d Seeing that many of the previous answers talk about "technology" - what software to use and how to climb, then I will talk about "Tao" and "technology" - how crawlers work and how to implement them in python.

    Let's make a long story short and summarize:

    You need to learn.

    Basic how crawlers work.

    Basic HTTP scraper, Scrapy

    bloom filter: bloom filters by example

    If you need to scrape web pages on a large scale, you need to learn the concept of distributed crawlers. It's not that mysterious, you just need to learn how to maintain a distributed queue that can be effectively shared by all the clustered machines. The simplest implementation is:

    Combination of RQ and Scrapy: Darkrho Scrapy-Redis · github

    Post-processing, web page extraction (grangier python-goose · github), storage (mongodb).

  6. Anonymous users2024-02-01

    When the Wang family learned the truth, they also accused Shi Rong of being ruthless and unjust, leaving the Hu family in the mansion and being like sisters. The Hu family's uncertain prophet made Shi Rong suspicious.

Related questions
12 answers2024-03-16

Use the old red envelopes left over from the family, and the steps are as follows: >>>More

15 answers2024-03-16

Fresh, delicious, tempting, red, green, tender, watery, fresh, delicate, crisp, refreshing, fresh, verdant. >>>More

3 answers2024-03-16

1. Why do we need to do knowledge paraphrasing?

2. The mental method of knowledge paraphrasing. >>>More

7 answers2024-03-16

For this, first of all, you have to be able to use macros, otherwise you have to be able to write VBA. >>>More

2 answers2024-03-16

Writing ideas: To tell a story well, grasp the six elements: time, place, people, reasons, process, and results, on this basis, the story can be told completely and thoroughly, and then the first sense of expression can be made, so that the article can be completed in one go, so that readers can resonate. >>>More