-
Traditional statistical methods for data mining include regression analysis, principal component analysis, and cluster analysis.
Non-machine statistical learning methods for data mining include fuzzy sets, rough sets, and support vector machines.
Data mining is the process of algorithmically searching for hidden information from large amounts of data. Data mining is often associated with computer science and is accomplished through many methods such as statistics, analytical processing, intelligence retrieval, machine learning, expert systems, and pattern recognition. Nowadays, people are eager to analyze massive amounts of data in depth, discover and extract hidden information in order to make better use of it, and it is precisely because of this need that data mining technology has come into being.
There are many legitimate uses for data mining, such as finding out the relationship between a drug and its *** in a database of patients. This relationship may not occur in 1,000 people, but pharmacology-related programs can use this method to reduce the number of patients who have adverse reactions to drugs and potentially save lives.
Regarding the study of data mining, we recommend the relevant courses of CDA data engineers, which take into account the horizontal ability to solve data mining process problems and the vertical ability to solve data mining algorithm problems. Students are required to have the thinking of starting from the root cause of data governance, explore business problems through digital working methods, and choose business process optimization tools or algorithm tools through proximate cause analysis and macro root cause analysis, rather than "problem tuning algorithm package". Click here to book a free trial lesson.
-
There are several statistical methods commonly used in data mining:
Traditional statistical methods include regression analysis, principal component analysis, cluster analysis, etc
Non-machine learning methods: fuzzy sets, rough sets, support vector machines.
-
Neural network approach
Neural networks have attracted more and more attention in recent years due to their good robustness, self-organization and adaptability, parallel processing, distributed storage, and high fault tolerance, which are very suitable for solving data mining problems.
Genetic algorithms
Decision tree method
Decision tree is an algorithm commonly used in ** models, which purposefully classifies a large amount of data to find some valuable, potential information from it. Its main advantages are simple description and fast classification, which makes it particularly suitable for large-scale data processing.
Coarse set method
Rough set theory is a mathematical tool for studying imprecise and uncertain knowledge. The coarse set approach has several advantages: no additional information is required; Simplify the expression space of input information; The algorithm is simple and easy to operate. The object of the coarse set processing is an information table similar to a two-dimensional relational table.
Override the positive example to exclude the negative example
It is to find rules using the idea of covering all positive examples and excluding all negative examples. First, select any seed in the positive example set and compare them one by one in the negative example set. If the selector is compatible with the value of the field, it is rounded off, and if it is not, it is retained.
Circling all the positive seeds according to this idea will give you the rule of the positive example (the conjunctional formula of the selector).
Methods of statistical analysis
There are two kinds of relationships between database field items: functional relationships and correlation relationships, and they can be analyzed using statistical methods, that is, using statistical principles to analyze the information in the database. It can perform common statistics, regression analysis, correlation analysis, difference analysis, etc.
Fuzzy set method
That is, fuzzy set theory is used to carry out fuzzy evaluation, fuzzy decision-making, fuzzy pattern recognition and fuzzy clustering analysis of practical problems. The higher the complexity of the system, the stronger the ambiguity, and the general fuzzy set theory uses membership to describe the either/or nature of fuzzy things.
-
The main methods of data mining are as follows:
1.Classification mining methods. The classification mining method mainly uses decision trees for classification, which is an efficient mining method that occupies an important position in data mining methods.
In order to test and classify the data more accurately, we use the decision tree algorithm, and the more typical methods in the decision tree are: ID3 algorithm, which has strong practicability and is suitable for large-scale data processing; The knn algorithm has a large payload and is suitable for different types of data processing.
2..Cluster analysis mining methods. The cluster analysis mining method is mainly used in the research field of sample and index classification, which is a typical statistical method and is widely used in the commercial field.
This clustering method can be divided into four types of analysis and mining methods according to different applicable objects: grid-based clustering method, layer-based clustering method, density-based clustering method, and model-based clustering method.
3.**Method. The method is mainly used for the mining of knowledge and continuous numerical data, and the traditional method is mainly divided into:
Time series method, regression model analysis, grey system model analysis. At present, the first method mainly uses neural network and support vector machine algorithm to analyze and calculate data, and at the same time, the trend of future data can be used.
The course on Big Data Mining Engineer recommends CDA's data analyst courses, which take into account the development of horizontal ability to solve data mining process problems and vertical ability to solve data mining algorithm problems. Students are required to have the thinking of starting from the root cause of data governance, explore business problems through digital working methods, and choose business process optimization tools or algorithm tools through proximate cause analysis and macro root cause analysis, instead of "encountering problems and adjusting algorithm packages" Click to book a free trial class.
-
Statistical data is the data that represents the characteristics, scale, structure, level and other indicators of natural economic elements in a certain geographical area. It is the basic data for qualitative, localized and quantitative statistical analysis. For example, what we usually call statistical yearbooks, what are the methods of statistics?
1. Census: Census is a one-time comprehensive survey specially organized for a specific purpose, which is used to collect comprehensive information on important national conditions, national strength and resources, and provide a basis for formulating plans, guidelines and policies.
2. Sampling survey: Sampling survey is the most widely used survey method in practice, which is a non-comprehensive survey method that randomly selects a part of the unit seat sample from the population of the survey object, and infers the quantitative characteristics of the population according to the sample survey results.
3. Statistical report: Statistical report is a kind of investigation method based on comprehensive investigation, which is arranged by the competent department in accordance with statistical laws and regulations, in the form of statistics and administrative means from top to bottom, and then summarized and reported by enterprises and institutions from bottom to top to provide basic statistical data.
4. Key survey: The key survey is a kind of non-comprehensive survey specially organized, which is to select individual or part of the key units in the overall survey to understand the basic situation of the whole.
5. Typical investigation: Typical investigation is also a kind of non-comprehensive investigation specially organized, which is based on the purpose and requirements of investigation and research, on the basis of a comprehensive analysis of the whole, consciously select the representative typical units to conduct in-depth and detailed investigation, so as to understand the essential characteristics, causality and development and change laws of things.
The above is about the method of statistics, but not every method is suitable for use, it needs to be decided according to the matter, I hope it will be useful to you!
-
1. Naive Bayes
Naive Bayes (NB) is a generative model (i.e., a joint probability distribution of features and classes that needs to be computed) and the computation process is very simple, just doing a bunch of counts. NB has a conditional independence assumption, i.e., the distribution between features is independent under the condition that the class is known. In this way, the naïve Bayesian classifier will converge faster than a discriminant model, such as logistic regression, so it only requires less training data.
Even if the NB conditional independence assumption is not true, the NB classifier still performs well in practice. Its main disadvantage is that it cannot learn the interaction between features, which is feature redundancy, in the case of R in MRMR.
2. Logistic regression
Logistic regression is a categorical method, a discriminative model, with many ways to regularize the model (l0, l1, l2), and you don't have to worry about whether features are related as you would with naïve Bayes. You'll also get a decent probabilistic explanation compared to decision trees and SVMs, and you can even easily update the model with new data (using the Online Gradient Descent algorithm). It can be used if you need a probabilistic schema (e.g., simply adjusting classification thresholds, indicating uncertainty, or to obtain confidence intervals), or if you want to quickly integrate more training data into the model later.
3. Linear regression
Linear regression is used for regression, unlike logistic regression, which is used for classification, and the basic idea is to optimize the error function in the form of least squares with gradient descent.
4. Nearest Neighbor Algorithm - KNN
knn is the nearest neighbor algorithm, and its main process is to calculate the distance of each sample point in the training sample and the test sample (common distance measures include Euclidean distance, Mahalanobis distance, etc.); Sort all the above distance values; Samples with the lowest distance before selecting k; Vote according to the labels of these k samples to get the final classification category; How to choose an optimal k value depends on the data.
5. Decision tree
One of the most important aspects of a decision tree is to choose an attribute to branch, so pay attention to the formula for calculating information gain and understand it in depth.
6. SVM supports vector machine
The high accuracy provides a good theoretical guarantee to avoid overfitting, and even if the data is linear and inseparable in the original feature space, as long as a suitable kernel function is given, it will run very well. It is especially popular in text classification problems that are often ultra-high-dimensional. It's a pity that the memory consumption is large, it is difficult to explain, and the operation and parameter tuning are also a bit annoying, but the random forest just avoids these shortcomings and is more practical.
-
Hello dear<>
The answer you are looking for: The four basic methods of data mining are as followsData mining is a very critical technology in the current Internet field, which provides enterprises with more insight to grasp market trends. It can effectively analyze the behavior of customers in order to find convincing conclusions to make decisions.
To put it simply, it is a technical process of figuring out ways to identify and understand data in order to tap into its potential value. Data mining is also a method of discovering deep patterns, unknown structures, outliers, and other useful information hidden in the data. There are four basic methods of data mining, which are association rule mining, classification and analysis, clustering techniques, and anomaly detection.
The most commonly used algorithms are the Apriori algorithm and the FP-Growth algorithm, which can find frequent itemsets and make corresponding rules, such as "when a customer buys a TV, he may buy its bracket". The classification and analysis mining method measures the degree of influence between variables, mainly including regression analysis, decision tree analysis, etc., and is used to identify the relationship between continuous and categorical attributes, such as "How does TV affect people's consumption behavior?" "Clustering is an unsupervised technique, generally processed by algorithms such as K-means, EM and DBSCAN, and its task is to divide large amounts of data into categories to identify unknown hidden structures, such as "TV consumers can be divided into groups with common characteristics".
Anomaly detection is the process of identifying abnormal values based on a specific metric of the data, and the most commonly used techniques are density clustering and sampling detection, which can help businesses find unexpected and sudden changes in the data, such as "why TV stops selling". In conclusion, data mining is a technology that can mine valuable findings and insights from information, and its four basic methods are association rule mining, classification and analysis, polyocclusal technology, and anomaly detection, which is an important tool for enterprises to explore business opportunities and establish competitive advantages. It is only through the rational use of these basic methods that enterprises can obtain truly effective market information and thus gain a competitive advantage.
-
1.Memory-based reasoning, the main concept of memory-based reasoning is to use known cases to ** some properties of future cases.
2.Market basket analysis.
3.Decision tree, decision tree has a strong ability to solve classification and **.
4.Genetic algorithms, the process by which genetic algorithms learn the evolution of cells.
5.Cluster detection techniques, including genetic algorithms, neural networks, and cluster analysis in statistics, all have this function.
6.Link analysis.
8.Quasi-neural networksQuasi-neural networks are repetitive learning methods in which a string of examples are handed over to learn so that they can be summarized into a pattern that is sufficient to distinguish them.
9.Distinction analysis is often applied to solve classification problems.
10.Rogisian regression analysis, when the group in the discriminancy analysis does not conform to the assumption of normal distribution, is a good alternative.
The CDA data analyst course is based on scenario-based teaching to mobilize students' practical ability in data mining, in the business scenario designed by the lecturer, the lecturer continuously raises business problems, and then the students gradually think and operate to solve the problem, so as to help students master the truly excellent data mining ability to solve business problems. This teaching method can trigger students' independent thinking and subjective initiative, and the skills and knowledge mastered by students can be quickly transformed into skills that can be flexibly applied by themselves, and can be freely used in different scenarios. Click here to book a free trial lesson.
Generally speaking, the core of the so-called wisdom exchange is large numbers. >>>More
This agglutination hierarchical clustering algorithm: calculate the proximity matrix repeat merge the two closest families to update the proximity matrix until only one family remains, much like Huffman's algorithm, when calculating the proximity of the family, there can be min, max, group average, distance between centroids, etc. (meaning as the name suggests), different proximity measures may produce different results. There are also their own advantages and disadvantages, such as min will be sensitive to noise or outlier... >>>More
The basic methods commonly used in data analysis are the list method and the graph method. The list method is to express the data in a list according to a certain rule, which is the most commonly used method for recording and processing data. The drawing method can clearly express the change relationship between various physical quantities. >>>More
29 appearances, points, shots, 3-pointers, free throws, rebounds, assists, steals, blocks, turnovers, fouls. >>>More
In recent years, the market size and penetration rate of China's cloud computing industry have continued to grow, making China's public cloud market enter a new stage of development. In addition, driven by the development of 5G commercialization and AI and other technologies, the scale of China's public cloud market has always maintained a rapid growth trend, according to the statistics of the China Academy of Information and Communications Technology, in 2018, the size of China's public cloud market reached 100 million yuan, an increase from 2017. >>>More