What is the k nearest neighbor algorithm and an introduction to the k nearest neighbor algorithm

8 answers

Anonymous users2024-02-07

The K-nearest neighbor (KNN) classification algorithm is a theoretically mature method and one of the simplest machine learning algorithms. The idea of this method is that if most of the k most similar (i.e., neighborest) samples in the feature space belong to a certain category, then the sample also belongs to that category.

In the knn algorithm, the neighbors selected are all objects that have been correctly classified. In this method, only the category of the nearest sample or several samples is used to determine the category to which the sample belongs in the classification decision. Although the knn method also relies on the limit theorem in principle, it is only related to a very small number of adjacent samples in category decision-making.

Since the knn method mainly relies on the surrounding limited neighboring samples, rather than relying on the discriminant domain method to determine the category, the knn method is more suitable than other methods for the sample set to be divided with a large number of overlapping or overlapping class domains.

The knn algorithm can be used not only for classification, but also for regression. The properties of a sample can be obtained by finding the k nearest neighbors of a sample and assigning the average of the attributes of these neighbors to the sample. A more useful approach is to assign different weights to the influence of neighbors at different distances on the sample, such as the weight is proportional to the distance.

One of the main drawbacks of this algorithm in classification is that when the samples are unbalanced, for example, one class has a large sample size and the other classes have a small sample size, which may lead to the majority of the large-capacity class among the k neighbors of the sample when a new sample is entered. The algorithm only counts the "closest" neighbor samples, and if the number of samples in a certain class is large, then either the sample is not close to the target sample, or the sample is very close to the target sample. Either way, the quantity doesn't affect the results.

It can be improved by using the method of weights (and neighbors with small distances from the sample, with large weights).

Another disadvantage of this method is that it is computationally intensive, because for each text to be classified, the distance from it to all known samples must be calculated to find its k nearest neighbors. At present, the commonly used solution is to edit the known sample points in advance, and remove the samples that have little effect on classification in advance. This algorithm is more suitable for automatic classification of class domains with large sample size, while those class domains with small sample size are more likely to misclassify with this algorithm.
Anonymous users2024-02-06

I've done it before, and now I forgot about it!

Looks like it's in pattern recognition!

It should be the closest category.
Anonymous users2024-02-05

Introduction to the KnearestNeighbor (KNN) algorithm: For an unknown sample, we can judge its category based on the category of k samples closest to it.

As an example, for an unknown sample green circle, we can select the nearest 3 samples, which contain 2 red triangles and 1 blue square, then we can judge that the green circle belongs to the category of red triangles.

We can also take the nearest 5 samples, which contain 3 blue squares and 2 red triangles, then we can determine that the green circle belongs to the category of blue squares.

Documentation

Let's explain the parameters in the knn algorithm:

n_neighbors': The number of selected reference objects (the number of neighbors), the default value is 5, you can also specify the value by yourself, but the larger the value of not n neighbors, the better the classification effect, the best value needs to be verified.

weights': The weight parameter of the distance, uniform by default.

uniform': Uniform weighting, all points are weighted the same in each category. To put it simply, the importance of each point is the same.

distance': The weight is proportional to the reciprocal of the distance, and the points that are closer are more important and have a greater impact on the outcome.

algorithm': operation method, default auto.

auto': Automatically selects the most appropriate calculation method based on the FIT data of the eradication model.

ball_tree': Tree model algorithm balltree

kd_tree': The tree model algorithm kdtree

brute': Brute force algorithms.

leaf_size': The size of the leaf, default is 30. Only if algorithm = ball tree' or 'kd_tree', this parameter needs to be set.

p': Minkoski distance, when p = 1, Manhattan distance is selected; When p = 2, select Euclidean distance.

n jobs: The number of computer processors used, default is 1. When n=-1, all processors are used for the calculation.

4.Application case demonstration

The following takes the dataset that comes with the SKLEARN library - the handwritten number recognition dataset as an example to test the knn algorithm. In the previous chapter, we briefly covered the general steps of machine learning: load the dataset - train the model - result - save the model.

In this chapter, we will follow this step.

Hand search and write a digital recognition dataset.

5.The approach of the model

Each model has some unique attribute methods (the skills of the model, what it can do), let's take a look at the commonly used attribute methods of the knn algorithm.

Advantages and disadvantages of the algorithm

Advantages: simple, the effect is not bad, and the lifting block is suitable for multi-classification problems.

Disadvantages: Low efficiency (because the distance of the sample from each sample point is calculated and then sorted), the efficiency decreases as the sample size increases.
Anonymous users2024-02-04

The k-nearest neighbor algorithm is an instance-based machine learning algorithm that is mainly used for classification and regression problems. The idea is to find the k known instances that are closest to the new instance, and use their labels (for classification problems) or values (for regression problems) to diverge. There are three basic elements to consider when using the k-nearest neighbor algorithm:

1.Distance measurement method.

The distance metric is the method used to calculate the distance between a new instance and a known instance. Common distance measures include Euclidean distance, Manhattan distance, Chebyshev distance, and so on. When choosing a distance measurement method, you need to choose it based on the characteristics of the specific problem and the attributes of the data.

2.k value selection.

The k-value refers to how many recent known instances are selected to participate**. Typically, a value that is too small will cause the model to be overfitted, and a value that is too large will lead to an underfit of the model. Therefore, when selecting the k value, it is necessary to adjust the parameters.

3.Dataset selection and preprocessing.

The k-nearest neighbor algorithm has high requirements for the quality and quantity of the dataset, so it is necessary to select the appropriate dataset for training and testing. At the same time, before using the k-nearest neighbor algorithm, some preprocessing work needs to be carried out, such as data cleaning, missing value processing, feature selection and dimensionality reduction.

In addition to the above three basic elements, the k-nearest neighbor algorithm also needs to consider other factors, such as choosing a classification or regression problem, using the weighted average method, etc. Considering these factors, we can derive a more complete process of k-nearest neighbor algorithm, which can be better applied to practical problems.
Anonymous users2024-02-03

K-value selection questions, Dr. Li Hang's book Statistical Learning Methods says:

Approximate train loss

Test Loss:

In practical applications, the k value is generally taken as a relatively small value, for example, the cross-validation method (in simple terms, the training data is divided into two groups: the training set and the verification set) to select the optimal k value.

For example, divide the data into 4 parts, one of which is used as a validation set. It was then tested 4 times (sets), each time with a different validation set. That is, the results of 4 sets of models were obtained, and the average value was taken as the final result. Also known as 4-fold cross-validation.

According to knn, every time we need a point, we need to calculate the distance from each point in the training dataset to this point, and then select the k points closest to the ruler source for voting. When the dataset is large, the computational cost is very high, and the algorithm complexity for the dataset with n samples and d features is o(dn2

There are 2 key questions to ask when building a KD tree:

1) Which dimension of the vector is selected for partitioning? A random dimension is chosen or sequentially, but a better approach should be to divide the data in the one dimension where the data is more dispersed (the degree of dispersion can be measured by the variance of the root number scale).

2) How to divide the data? A good partitioning method can make the constructed tree more balanced, and the median can be selected each time to divide it.

Constructor

Given a two-dimensional spatial dataset: t=, construct a balanced kd tree.

Pros: Cons:
Anonymous users2024-02-02

The key elements in the k-nearest neighbor algorithm are: the selection of k-value, the measurement of neighbor distance, and the formulation of classification decisions.

Value selection: The k-nearest neighbor algorithm has obvious advantages, is simple to use, and has strong interpretability, but it also has its shortcomings. For example, "majority voting" can be flawed when the category distribution is skewed. In other words, the selection of k-values is very important, and the samples with more frequent occurrences will dominate the best results of the test points.

2.A measure of neighbor distance:

Without quantification, there is no way to measure the distance and proximity. In order to calculate "distant relatives and close neighbors", the k-nearest neighbor algorithm requires that all the features of the sample can be quantified comparably. If some of the features of the sample data are non-numerical, find a way to quantify them as well.

For example, colors, different colors (such as red, green, blue) are non-numerical types, and there seems to be no distance between them. However, if you convert a color (this non-numeric type) to a grayscale value (numeric type: 0 255), then you can calculate the distance (or difference) between different colors.

3.Classification decision-making:

Essentially, a classifier is a mapping function from a feature vector to a category. The classification process of the k-nearest neighbor algorithm is roughly as follows: (1) calculate the Euclidean distance between the sample to be tested and each sample in the training set; (2) Sort each distance from small to large; (3) The first k samples with the shortest distance are selected, and the voting rule of "minority obeys majority" is adopted for classification tasks.

For the regression task, the average of k neighbors can be used as the ** value.
Anonymous users2024-02-01

Hello dear! The nearest neighbor algorithm (KNN), also known as the k-nearest neighbor algorithm, is an active area of machine learning research. The simplest brute force algorithm is more suitable for small data samples.

The model used by the k-nearest neighbor algorithm actually corresponds to the partition of the feature space. The knn algorithm can be used not only for classification, but also for regression. The knn algorithm has been widely used in the fields of artificial intelligence machine learning before the forest, character recognition, text classification, image recognition, etc.2

dynamic time warping, referred to as dtw; It is a measure of the similarity between two time series, characterized by the fact that the length of the series can be different; We hope to help you!
Anonymous users2024-01-31

1) Simple, easy to understand and implement, no need to estimate parameters.

2) Zero training time. It doesn't show training, unlike other supervised algorithms that train a model with the training set (i.e., fit a function) and then classify the validation or test set with that model. The knn just saves the samples and processes the test data when it is received, so the knn training time is zero.

3) KNN can deal with classification problems, and at the same time, it can naturally deal with multi-classification problems, which is suitable for classifying rare events.

4) Particularly suitable for multi-modal problems (objects with multiple class labels), KNN performs better than SVM.

5) KNN can also deal with regression problems, which is **.

6) Compared with algorithms such as Naive Bayes, there are no assumptions about the data, the accuracy is high, and it is not sensitive to outliers.

1) The amount of computation is too large, especially when the number of features is very large. Each text to be classified has to calculate its distance from all known samples to get its k-th nearest neighbor.

2) Poor intelligibility and inability to give rules like decision trees.

3) It is a lazy learning method, which basically does not learn, resulting in a slower speed than algorithms such as logistic regression.

4) When the sample is unbalanced, the accuracy of the rare category is low. When the sample is unbalanced, such as a large sample size of one class and a small sample size of other Waxland species, it may lead to a majority of samples of the large-capacity class among k neighbors of that sample when a new sample is entered.

5) The dependence on training data is particularly large, and the fault tolerance of training data is too poor. If one or two of the data in the training dataset are wrong, and they happen to be next to the values that need to be classified, this will directly lead to the inaccuracy of the ** data.

When you need a model that's particularly easy to explain.

For example, a recommendation algorithm that needs to explain the reason to the user.

Through this experiment, I learned about the k-nearest neighbor algorithm and its idea, which is based on the idea that if most of the k most similar samples in a sample in the feature space belong to a certain category, then the sample also belongs to that category.

The so-called k-nearest neighbor algorithm means that given a training dataset, the k-instance closest to the new input instance is found in the training dataset.