-
The locomotive collector is a copying tool: after writing down the corresponding rules, you can quickly collect a large amount of content from others in a short period of time, and then publish it to your own **, to put it bluntly, it is stealing
-
Summary. According to the understanding of ordinary people, it is not illegal to use a locomotive collector to collect locomotive information, because locomotive information is public and does not involve personal privacy. However, it should be noted that the laws and regulations of different countries and regions are different, if some regulations explicitly prohibit the use of locomotive collectors, you need to follow the local regulations to avoid violating the law.
At the same time, when using the collector near the railway line of the railway station, you need to pay attention to your own safety and avoid dangerous incidents.
I'm still a little confused, can you be more detailed?
According to the understanding of ordinary people, it is not illegal to use a locomotive collector to collect locomotive information, because locomotive information is public and does not involve personal privacy. However, it should be noted that the laws and regulations of different countries and regions are different, if some regulations explicitly prohibit the use of locomotive collectors, you need to follow the local regulations to avoid violating the law. At the same time, when using the collector near the railway line of the railway station, you need to pay attention to your own safety and avoid dangerous incidents.
-
The locomotive collector collects information in two steps:
2. Collect content. After you have it, you can go to this to collect information, but there is a lot of information on the web page, and the software doesn't know what you want to take. In the content section, you have to do the rules. Tell the software what I want to take.
1. Pick**.
The product information on the web page is exactly what you want to do, that is, the goal.
Then click on the Test button to test the correctness of the information filled:
After the test is correct, we expand the address, now we just take a list page of the article address, there are other lists to collect, the other list pages are on its pagination, we observe the link form of these distributions, find out the pattern, and then fill in the ** rule in batches.
2. Collection of content.
After the above processing, the links to the target product page have been able to be collected, and now we enter the collection of content.
After clarifying the content to be collected, we began to write the collection rules, the locomotive collection content is the source of the collection web page, so we need to open the source of the product page and find the location where we want to collect information. For example, the description field:
Find the location of the description, and after finding it, how to fill in the collection rules, it is very simple, just fill in the start string and end string of the collection target in the corresponding position of the collection. Here we choose description:
As the start string, it is the end string. It is important to note that the start string must be unique to this page and present on other product pages. This page is the only way to enable the software to find the location to be collected, and it is common to other pages to ensure that the software can collect data from other pages.
After filling in, it does not mean that it can be collected correctly, and it needs to be tested to exclude some useless data, and the exclusion can be carried out in the HTML tag exclusion and content exclusion. After successful testing, such a label is made.
Here we use wildcards to achieve this. We use the (*) wildcard to indicate arbitrary where it is not universal. The address to be collected is represented by parameters (variables).
Finally, we change this paragraph to: (*compare prices(*)product details, fill in the module, and test if it succeeds.
If the test is not successful, it means that what you have filled in does not meet the unique and common criteria, and you need to debug. After the test is successful, you can save it and enter the creation of the label.
The label making here is the same as the above.,Find the location where you want to collect information.,Fill in the start and end strings.,And do a good job of filtering.,The only difference is that you have to select the module you just made in the page options you belong to.,I won't repeat it here.,Display the results directly.。
This completes the label. After clicking Update, remove the publish option, and you can collect the task.
-
Thank you elife sister.,It seems that my sister's locomotive collection technology is very good.,I saw a lot of locomotive collection posts with your reply., Is the version used or used?? I'm using the version, :$
-
It turns out that turning off that fork will bring up the built-in browser.
-
What should I do if my train keeps prompting the wrong format, everyone:'(
-
Acquisition of software programs:
You can search for "Locomotive Collector" from it, and enter the corresponding official to get the latest version of the program** address. Of course, you can also quietly get the latest version of the program from the provided network disk address:
Please click Enter a description.
Please click Enter a description.
Install and run the "Locomotive Collector" program, and click the "Login" button directly in the pop-up login screen to log in as a free version.
Please click Enter a description.
In the main interface of the program, click the "New" drop-down arrow and select the "Task" item from it.
Please click Enter a description.
In the pop-up window, enter the task name and click the Add button on the right side of the Start column.
Please click Enter a description.
The next extremely important step is to divide the ** to be collected, comprehensively analyze the URL of each piece of article in the ** taken and find out the rules, and finally fill in the figure as shown in the figure.
Please click Enter a description.
Then switch to the "Step 2: Capture Content Rules" tab where we need to split the content of the page. In this example, you can use Sogou Browser as an example, right-click the web page you want to analyze, and select the "Review Element" item from the pop-up menu.
Please click Enter a description.
In the "Development Mode" interface, click the "Select an element in the page to perspective" button, and then click the "Title" content, then the label corresponding to the title will be displayed in the "Developer" window, in this case "h2"."。
Please click Enter a description.
Next, in the "Collection Content Rules" interface, click the "Add" button to add the "Title" item, or directly double-click the "Title" item to modify it. In the pop-up interface, select "Capture before and after" to set the front and back dropouts respectively"".
Please click Enter a description.
Use the same method to add rules for other ingested content.
Please click Enter a description.
Finally, from the task list, check the content to be collected, and click the "Start" button to collect the web content in ** according to the rules.
Please click Enter a description.
-
Let's talk about the way I do the collection, there are two main ways on my side, the first, the regular site, the content is very complete, then find a collection source first, and then crawl the whole station data, note, this way, if the source station ** is more, it will be very time-consuming, according to the locomotive ten processes to calculate, a process can open ten threads, that is, a locomotive can run up to 100 threads, the average collection time of a chapter is about 1 second (plus the average time spent on list collection), For a station of 100,000 books, about 50 million plus chapters, it will take about a week for the data collection to be completed, and this is when your server configuration is relatively good. Then there is the release, and the release cannot be multi-threaded, so the time has to be doubled, which is almost more than two months. This is also the reason why some people say that train collection is slower.
The original content is collected, and then it is collected and updated every day, in the same way as the second point.
Then the second one is to directly collect daily updates, and the old books in the past will not be collected. In this case, the speed will be faster. It was ready to use at the time. The locomotive sets the timed task and triggers it automatically.
This is the traditional way of locomotive collection.
I studied locomotive collection for a month, found a more suitable way to collect quickly, and collected 100,000 books after multi-faceted and multi-customer testing, and after publishing, it took about two days.
The specific time has a certain relationship with the server configuration, for example, the speed of hard disk read and write, network bandwidth (the locomotive can be ignored on the server), etc., testing 2H4G US server, 100,000 copies about two days plus a few hours, it takes a few hours to collect, and it takes about two days to publish. Then update it regularly every day.
-
What's so laborious about this multi-threaded one, the locomotive dropped that and went to sleep.
-
6.The first 10 texts and 33-80 remarks were collected in the Bian Collection, which Bian Yiwen sold for 10,000 yuan in Shanghai in 2006.
-
Text, **, flash, forum attachments, and software station resources can be exhausted in one net. Powerful content collection and data import functions can publish any web page data you collect to a remote server, CMS system, or save it as a local file, access, mysql, ms sqlserver database. No matter what system you have, you can use the locomotive collector.
Of course, the program is not just for a few articles. With it, you can automatically get information that is updated frequently, such as domain expiration information, the latest news, etc. You can also use it as a forum spam or poster, top post machine to destroy the line, provided you do a good job of publishing the module.
You can also think of it as a **or file batch** tool, and the **function of the program is not inferior to some of the mainstream **tools**. When you use it to send data, you can achieve more complex functions. Locoyspider is a powerful and easy-to-use professional collection software, powerful content collection and data import functions can publish any web page data you collect to a remote server, customize the user CMS system module, no matter what system you have, you may use the train collector, the system comes with the module file support:
Module files for Wind News Articles, Dongyi Articles, Dynamic Network Forums, phpwind Forums, Discuz Forums, phpcms Articles, Phparticle, Leadbbs Articles, Magic Forums, Dedecms Articles, XYDW Articles, Shocking Cloud Articles, etc. For more CMS modules, please refer to the production modification by yourself, or you can go to the official ** to communicate with you about the production. At the same time, you can also use the system's data export function to export the fields of the collected data corresponding to the table to any local access, mysql, and ms sqlserver by using the built-in tags of the system.
-
A publishing module, also known as a publishing rule, usually refers to a database publishing module or a web publishing module. The so-called publishing module is a setting in the software when it is necessary to publish the collected data to the destination (e.g., specified database, **).
This setting can be saved as a file and can be imported into the crawler. The suffix of the database publishing module file is as:jhc;The web** publish module file has a suffix named :
cwr。Both collection rules and publishing modules can be exported from the collector or imported into the collector for use. The collection rule is responsible for collecting the data on the web page, and the publishing module is responsible for publishing the collected data to **. It can be seen that the compilation and modification of the collection rules are related to the ** being collected, and the writing and modification of the publishing module are related to the ** of the data to be published.
For example, collecting data from different ** columns and publishing it in a certain section (channel) of the same ** requires multiple collection rules and a publishing module. Collecting data from a column and publishing it in different systems requires a collection rule and multiple publishing modules. Note that the collection rules here refer to the collection and scraping settings.
Publishing data is to publish the collected data to the specified destination, and the train collector supports four publishing methods.
Method 1: Web**Publish to** This publishing method is similar to manually adding data in the background. The collector sends the data to the daemon, which processes the data, and usually the daemon says that the data is stored in the database.
Method 2: Save as a local file In this way, you can publish the collected data to a local file, and the collector supports saving it in TXT format, CSV format, and HTML format.
Method 3: Import to a custom database In this way, the collected data can be imported from the built-in database of the software to other databases by connecting to other databases through the collector, and the collector can be connected to MySQL, Access, Oracle, and MSSQL databases.
Method 4: Save as a local SQL file (insert statement) In this method, the collected data is exported and saved as an insert statement, which can be used to insert data in the database management tool. The collector can not only collect and publish data, but also publish the collected data later.
Batch substitution is supported, batch processing through SQL statements and in text boxes.