How to deduplicate the data crawled by Python? Let s talk about the specific algorithm basis

Updated on technology 2024-03-11
3 answers
  1. Anonymous users2024-02-06

    It is necessary to analyze specific problems in detail. See what data you're scraping.

    It's a good idea to find a field value that can be distinguished as repeatability. For example, all the questions you know, each question has a corresponding ID, and the ID corresponding to the landlord's question is 181730605611341844. In the process of crawling, you can save all the problem IDs that have been crawled in a set(), if the question IDs that are about to be crawled already exist, then skip them, and continue otherwise.

    I don't know what database the landlord uses, in the database design, you can also add some constraints as constraints to ensure the uniqueness of the data.

  2. Anonymous users2024-02-05

    In Python, there are several methods that can be used to deduplicate crawled data, and the commonly used methods are as follows:1Use set for deduplication:

    The crawled data is stored in a collection that automatically deduplicates the elements. This can be achieved with the following: pythondata = [1, 2, 3, 4, 1, 2, 5]unique data = set(data) 2

    Use a dictionary for deduplication: Use the crawled data as the key of the dictionary, and the value can be arbitrary. Since the dictionary key is unique, duplicate data is automatically removed.

    This can be achieved with the following: pythondata = [1, 2, 3, 4, 1, 2, 5]unique data = {}fromkeys(data)keys()`3.

    Use list inference for deduplication: Use list inference to iterate through the crawled data and add non-duplicate data to a new list. This can be achieved by:

    pythondata = [1, 2, 3, 4, 1, 2, 5]unique data = [x for i, x in enumerate(data) if x not in data[:i]] These methods can be used to deduplicate the crawled data, depending on the specific needs and the size of the data. Octopus Collector is a full-featured, easy-to-operate, and wide-ranging Internet data collector, which can help users quickly obtain the data they need, and provides a wealth of tutorials and help documents to help users solve various data collection problems.

  3. Anonymous users2024-02-04

    When a site has data updates, incremental crawling is required, usually in the following centralized situations:

    In case 1, we hash the content of this particular page, of course, to remove the part that changes dynamically, for example, some pages have a captcha or date, and the program executes it regularly, and at the beginning of the execution, it detects whether the hash value of this page has changed from the last crawl, and if there is a change, it will start crawling.

    In case 2, we will hash the content of the page entry, and store the URL hash value of the page, if the hash value of the page entry changes, we will get the list of new page URLs, here we need to use the URL deduplication, similar to data deduplication, and use the Redis collection type to process.

    The Redis collection type does not allow duplicate data to be added, and when a duplicate is added, it returns 0 and fails to add it. We store all URL lists in the Redis collection, and when the page entry changes, we will deduplicate the page URL and crawl only the newly added pages.

    There are two common methods for deduplication of results:

    Among them, the bloom filter is by writing files, multiple processes need to add synchronization and mutual exclusivity, more cumbersome, it is not recommended to use when multi-threaded processes, and writing files is a disk IO operation, which takes a long time, can be accumulated to a certain number of writes again, or use the context manager to write at one time when the program ends or exits abnormally.

    Context manager, which executes a def before the main function is executedenterto execute def when the program finishes running or exits abnormallyexit, the context manager can also be used to count the time of program execution.

    Using Redis collection deduplication can support multi-threading and multi-process.

    By taking advantage of the feature of redis collection, create a collection in Redis, add the SHA1 value of data to it, return 1 if the addition is successful, indicating that there are no duplicates, and return 0 if the addition fails, indicating that there are already duplicate data in the collection.

    use: Step:1Create a Redis connection pool 2Repeat the check.

    The following example is an interface and provides an example.

Related questions
12 answers2024-03-11

Affectionate parting since ancient times. This has been going through the years

1 answers2024-03-11

Python data analysis hello dear,!<

1. Check the data table Python uses the shape function to view the dimensions of the data table, that is, the number of rows and columns. You can use the info function to view the overall information of the data table, and the dtypes function to return the data format. Inull is a function in Python that checks null values, you can check the entire data table, or you can check the null value of a single column, and the result returned is a logical value, including null values and returning false. >>>More

5 answers2024-03-11

Generally speaking, as long as you like the name, you will feel good. >>>More

7 answers2024-03-11

As small as vertical plum wheat information.

Wheat is a gramineous plant and is the most widely distributed grain crop in the world, and the sown area is the highest among the grain crops. Wheat has been cultivated in China for more than 5,000 years, and is currently mainly produced in Henan, Shandong, Jiangsu, Hebei, Hubei, Anhui and other provinces. Wheat sowing seasons are different, divided into spring wheat and winter wheat; According to the grain quality, it can be divided into hard wheat and soft wheat; According to the color, it can be white wheat, red wheat and flower wheat fibrillary skin. >>>More

11 answers2024-03-11

First of all, what database does your ERP use......?

Oracle, DB2, SQL Server are not the same. >>>More