Python crawler framework, which is better Zhihu

Updated on technology 2024-02-24
6 answers
  1. Anonymous users2024-02-06

    1. scrapy: It is an application framework written in order to capture the best data and extract the structural data of the data, which can be applied in a series of programs including data mining, information processing or storage of historical data, and can easily climb down various information data with this framework.

    2. Pyspider: It is a powerful web crawler system implemented with python, which can write scripts on the browser interface, schedule functions and view crawling results in real time, use commonly used databases in the backend to store the crawling structure, and also set tasks and task priorities regularly.

    4. Portia: It is an open-source visual crawler tool that allows you to crawl without any programming knowledge, simply annotate the pages you are interested in, and create a spider to scrape data from similar pages.

    6. Beautiful Soup: It is a Python library that can extract data from HTML or XML files, and it can realize the customary document navigation, finding, and modifying documents through your favorite converter; At the same time, it can save you hours or even days of work.

    7. Grab: is a Python framework for creating web scrapers, with the help of Grab, you can create a variety of complex web scraping tools, from simple five-line scripts to complex asynchronous crawling tools that handle tens of thousands of web pages. Grab provides an API for performing network requests and processing received content.

    8. COLA: It is a distributed crawler framework, for users, only need to write a few specific functions, without paying attention to the details of distributed operation, the task will be automatically distributed to multiple machines, and the whole process is transparent to users.

  2. Anonymous users2024-02-05

    It is famous and integrates some commonly used crawler needs. Cons: Cannot load js.

    scrapy。

  3. Anonymous users2024-02-04

    It is famous and integrates some commonly used crawler needs. Cons: Cannot load js.

    scrapy。

  4. Anonymous users2024-02-03

    beautiful soup。It is famous and integrates some commonly used crawler needs. Cons: Cannot load js.

    scrapy。Looks like a powerful crawler framework that can meet the needs of simple page crawling (e.g. when URL patterns can be explicitly known). This framework makes it easy to climb down data such as Amazon listings.

    But for slightly more complex pages, such as weibo's page information, this framework does not meet the needs.

    mechanize。Pros: JS can be loaded. Cons: Documentation is severely missing. However, through the official example and the method of human flesh attempt, it is still barely usable.

    selenium。This is a driver that calls the browser, and through this library you can directly call the browser to complete certain operations, such as entering a captcha.

    cola。A distributed crawler framework. The overall design of the project is a bit bad, and the coupling between modules is high, but it is worth learning.

    Here are some of my practical experiences::

    For simple requirements, such as information with fixed patterns, it is possible to do anything.

    For more complex requirements, such as crawling dynamic pages, involving state transitions, involving anti-crawler mechanisms, and involving high concurrency, it is difficult to find a library that meets the needs in this case, and many things can only be written by yourself.

    As for what the subject mentioned:

    Also, what are the advantages of using the existing Python crawler framework compared to using the built-in libraries directly? Because python itself is already very simple to write crawlers.

    The third party library can do things that built-in libraries can't or are difficult to do, and that's it. Also, crawling is not simple, it depends entirely on the needs, and has nothing to do with python.

  5. Anonymous users2024-02-02

    scrapy: is an application framework written for crawling data and extracting structural data. It can be applied to a series of programs including data mining, information processing, or storage of historical data; This framework makes it easy to climb down data such as Amazon listings.

    PySpider: It is a powerful web crawler system implemented in python, which can write scripts on the browser interface, schedule functions and view crawl results in real time, use commonly used databases in the backend to store crawl results, and also set tasks and task priorities regularly.

    Portia: is an open-source visual crawling tool that allows you to crawl** without any programming knowledge, simply annotate the pages you are interested in, and Portia will create a spider to extract data from similar pages.

    Beautiful Soup: is a Python library that can extract data from HTML or XML files, and it can help you save hours or even days of work by navigating, finding, and modifying documents in your favorite converter.

  6. Anonymous users2024-02-01

    1. Scrapy

    The Scrapy framework is a relatively mature Python crawler operation and closure framework, which is a fast and high-level information crawling framework developed by Python, which can efficiently crawl web pages and extract structured data.

    Scrapy has a wide range of applications, such as crawler development, data mining, data monitoring and cracking, automated testing, etc.

    2. Pyspider

    It is a powerful web crawler framework written in python by Chinese people. The main features are as follows:

    1. Powerful WebUI, including: scripter, task monitor, project manager and result viewer;

    2. Multi-database support, including: MySQL, MongoDB, Redis, SQLITE, Elasticsearch; postgresql with sqlalchemy, etc.;

    3. Use rabbitmq, beanstalk, redis and kombu as message queues;

    4. Support task priority setting, scheduled tasks, retry after failure, etc.;

    5. Support distributed crawlers.

    3. Crawley

    High-speed crawling of the corresponding ** content, support for relational and non-relational shed database, data can be exported to JSON, XML, etc.

Related questions
8 answers2024-02-24

Removed the long type, there is now only one integer, int, but it behaves like a version of long >>>More

10 answers2024-02-24

python install

Add environment variables (the path should be filled in the scripts folder in your own python directory). >>>More

14 answers2024-02-24

No, you can solve it yourself through third-party software.

20 answers2024-02-24

Learn python courses to go to [Danai Education], the institution python training has a team of teachers with rich teaching experience. He not only has rich teaching experience, but also has rich practical experience in python projects. Teachers will start from the simulated python project and rely on the real python business project for practical training. >>>More

7 answers2024-02-24

Python is easy to learn, free and open source, high-level language, super portability, extensibility, object-oriented, embeddable, rich libraries, standardized, etc. Python can be said to be all-round except for a very small amount of development: system operation and maintenance, graphics processing, mathematical processing, text processing, database programming, network programming, web programming, multi-** applications, PYMO engine, crawler writing, machine learning, artificial intelligence, and so on. >>>More