# KNN Predictions on the Iris Dataset

Alex Egg,

This notebook is also hosted at:

• http://www.eggie5.com/62-knn-predictions-on-the-iris-dataset
• https://github.com/eggie5/ipython-notebooks/blob/master/iris/Iris.ipynb

These are my notes from the “Intro to Python and basic Sci-kit Learn” Seminar on 10/8/2015 @ UCSD

This Notebook will demonstrate the basics of using the k-Nearest Neighbors algorithm (KNN) to make predictions. The dataset is called Iris, and is a collection of flower measurements from which we can train our model to make predictions.

## The Dataset

The iris dataset is included int the Sci-kit library. It is a collection 4 dimensional vectors that map flower measurements to a flower species.

from sklearn.datasets import load_iris

iris.feature_names

['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']


As you can see from the above output, that the dataset is indeed a 4D vector with a length and width measurement for the flower sepal and petal. Lets preview the data for these measurements:

iris.data

array([[ 5.1,  3.5,  1.4,  0.2],
[ 4.9,  3. ,  1.4,  0.2],
[ 4.7,  3.2,  1.3,  0.2],
[ 4.6,  3.1,  1.5,  0.2],
[ 5. ,  3.6,  1.4,  0.2],
[ 5.4,  3.9,  1.7,  0.4],
[ 4.6,  3.4,  1.4,  0.3],
[ 5. ,  3.4,  1.5,  0.2],
[ 4.4,  2.9,  1.4,  0.2],
[ 4.9,  3.1,  1.5,  0.1],
[ 5.4,  3.7,  1.5,  0.2],
[ 4.8,  3.4,  1.6,  0.2],
[ 4.8,  3. ,  1.4,  0.1],
[ 4.3,  3. ,  1.1,  0.1],
[ 5.8,  4. ,  1.2,  0.2],
[ 5.7,  4.4,  1.5,  0.4],
[ 5.4,  3.9,  1.3,  0.4],
[ 5.1,  3.5,  1.4,  0.3],
[ 5.7,  3.8,  1.7,  0.3],
[ 5.1,  3.8,  1.5,  0.3],
[ 5.4,  3.4,  1.7,  0.2],
[ 5.1,  3.7,  1.5,  0.4],
[ 4.6,  3.6,  1. ,  0.2],
[ 5.1,  3.3,  1.7,  0.5],
[ 4.8,  3.4,  1.9,  0.2],
[ 5. ,  3. ,  1.6,  0.2],
[ 5. ,  3.4,  1.6,  0.4],
[ 5.2,  3.5,  1.5,  0.2],
[ 5.2,  3.4,  1.4,  0.2],
[ 4.7,  3.2,  1.6,  0.2],
[ 4.8,  3.1,  1.6,  0.2],
[ 5.4,  3.4,  1.5,  0.4],
[ 5.2,  4.1,  1.5,  0.1],
[ 5.5,  4.2,  1.4,  0.2],
[ 4.9,  3.1,  1.5,  0.1],
[ 5. ,  3.2,  1.2,  0.2],
[ 5.5,  3.5,  1.3,  0.2],
[ 4.9,  3.1,  1.5,  0.1],
[ 4.4,  3. ,  1.3,  0.2],
[ 5.1,  3.4,  1.5,  0.2],
[ 5. ,  3.5,  1.3,  0.3],
[ 4.5,  2.3,  1.3,  0.3],
[ 4.4,  3.2,  1.3,  0.2],
[ 5. ,  3.5,  1.6,  0.6],
[ 5.1,  3.8,  1.9,  0.4],
[ 4.8,  3. ,  1.4,  0.3],
[ 5.1,  3.8,  1.6,  0.2],
[ 4.6,  3.2,  1.4,  0.2],
[ 5.3,  3.7,  1.5,  0.2],
[ 5. ,  3.3,  1.4,  0.2],
[ 7. ,  3.2,  4.7,  1.4],
[ 6.4,  3.2,  4.5,  1.5],
[ 6.9,  3.1,  4.9,  1.5],
[ 5.5,  2.3,  4. ,  1.3],
[ 6.5,  2.8,  4.6,  1.5],
[ 5.7,  2.8,  4.5,  1.3],
[ 6.3,  3.3,  4.7,  1.6],
[ 4.9,  2.4,  3.3,  1. ],
[ 6.6,  2.9,  4.6,  1.3],
[ 5.2,  2.7,  3.9,  1.4],
[ 5. ,  2. ,  3.5,  1. ],
[ 5.9,  3. ,  4.2,  1.5],
[ 6. ,  2.2,  4. ,  1. ],
[ 6.1,  2.9,  4.7,  1.4],
[ 5.6,  2.9,  3.6,  1.3],
[ 6.7,  3.1,  4.4,  1.4],
[ 5.6,  3. ,  4.5,  1.5],
[ 5.8,  2.7,  4.1,  1. ],
[ 6.2,  2.2,  4.5,  1.5],
[ 5.6,  2.5,  3.9,  1.1],
[ 5.9,  3.2,  4.8,  1.8],
[ 6.1,  2.8,  4. ,  1.3],
[ 6.3,  2.5,  4.9,  1.5],
[ 6.1,  2.8,  4.7,  1.2],
[ 6.4,  2.9,  4.3,  1.3],
[ 6.6,  3. ,  4.4,  1.4],
[ 6.8,  2.8,  4.8,  1.4],
[ 6.7,  3. ,  5. ,  1.7],
[ 6. ,  2.9,  4.5,  1.5],
[ 5.7,  2.6,  3.5,  1. ],
[ 5.5,  2.4,  3.8,  1.1],
[ 5.5,  2.4,  3.7,  1. ],
[ 5.8,  2.7,  3.9,  1.2],
[ 6. ,  2.7,  5.1,  1.6],
[ 5.4,  3. ,  4.5,  1.5],
[ 6. ,  3.4,  4.5,  1.6],
[ 6.7,  3.1,  4.7,  1.5],
[ 6.3,  2.3,  4.4,  1.3],
[ 5.6,  3. ,  4.1,  1.3],
[ 5.5,  2.5,  4. ,  1.3],
[ 5.5,  2.6,  4.4,  1.2],
[ 6.1,  3. ,  4.6,  1.4],
[ 5.8,  2.6,  4. ,  1.2],
[ 5. ,  2.3,  3.3,  1. ],
[ 5.6,  2.7,  4.2,  1.3],
[ 5.7,  3. ,  4.2,  1.2],
[ 5.7,  2.9,  4.2,  1.3],
[ 6.2,  2.9,  4.3,  1.3],
[ 5.1,  2.5,  3. ,  1.1],
[ 5.7,  2.8,  4.1,  1.3],
[ 6.3,  3.3,  6. ,  2.5],
[ 5.8,  2.7,  5.1,  1.9],
[ 7.1,  3. ,  5.9,  2.1],
[ 6.3,  2.9,  5.6,  1.8],
[ 6.5,  3. ,  5.8,  2.2],
[ 7.6,  3. ,  6.6,  2.1],
[ 4.9,  2.5,  4.5,  1.7],
[ 7.3,  2.9,  6.3,  1.8],
[ 6.7,  2.5,  5.8,  1.8],
[ 7.2,  3.6,  6.1,  2.5],
[ 6.5,  3.2,  5.1,  2. ],
[ 6.4,  2.7,  5.3,  1.9],
[ 6.8,  3. ,  5.5,  2.1],
[ 5.7,  2.5,  5. ,  2. ],
[ 5.8,  2.8,  5.1,  2.4],
[ 6.4,  3.2,  5.3,  2.3],
[ 6.5,  3. ,  5.5,  1.8],
[ 7.7,  3.8,  6.7,  2.2],
[ 7.7,  2.6,  6.9,  2.3],
[ 6. ,  2.2,  5. ,  1.5],
[ 6.9,  3.2,  5.7,  2.3],
[ 5.6,  2.8,  4.9,  2. ],
[ 7.7,  2.8,  6.7,  2. ],
[ 6.3,  2.7,  4.9,  1.8],
[ 6.7,  3.3,  5.7,  2.1],
[ 7.2,  3.2,  6. ,  1.8],
[ 6.2,  2.8,  4.8,  1.8],
[ 6.1,  3. ,  4.9,  1.8],
[ 6.4,  2.8,  5.6,  2.1],
[ 7.2,  3. ,  5.8,  1.6],
[ 7.4,  2.8,  6.1,  1.9],
[ 7.9,  3.8,  6.4,  2. ],
[ 6.4,  2.8,  5.6,  2.2],
[ 6.3,  2.8,  5.1,  1.5],
[ 6.1,  2.6,  5.6,  1.4],
[ 7.7,  3. ,  6.1,  2.3],
[ 6.3,  3.4,  5.6,  2.4],
[ 6.4,  3.1,  5.5,  1.8],
[ 6. ,  3. ,  4.8,  1.8],
[ 6.9,  3.1,  5.4,  2.1],
[ 6.7,  3.1,  5.6,  2.4],
[ 6.9,  3.1,  5.1,  2.3],
[ 5.8,  2.7,  5.1,  1.9],
[ 6.8,  3.2,  5.9,  2.3],
[ 6.7,  3.3,  5.7,  2.5],
[ 6.7,  3. ,  5.2,  2.3],
[ 6.3,  2.5,  5. ,  1.9],
[ 6.5,  3. ,  5.2,  2. ],
[ 6.2,  3.4,  5.4,  2.3],
[ 5.9,  3. ,  5.1,  1.8]])


Now each of these vectors maps directly to a flower species, described in target:

iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])


The numbers in the output above are ids. The dataset provides a lookup table to map the ids to a string flower name:

iris.target_names

array(['setosa', 'versicolor', 'virginica'],
dtype='|S10')


So, id 2 maps to:

iris.target_names[2]

'virginica'


## The Model

For this demonstration we will use the KNN algorithm to model a flower species prediction engine. I can’t get too deep into the details of the theory behind KNN, but I can try to describe it intuitively below.

### KNN Intuition

For simplicity, let’s image our dataset is only a 2 dimensional vector instead of 4, (remove width measurements) e.g:

['sepal length (cm)',
'petal length (cm)']


Now we have a collection of (sepal length petal length) pairs that map to iris.target

If we scatter plot this we’d have something like:

Where the x axis is sepal length (cm), the y axis is petal length (cm) and the color (red/green/blue) corresponds to the species id.

Now, from intuition, you could say if you picked a point on the scatter plot and it is surrounded by a majority of blue dots then it can be inferred from the data with a probability degree of certanity that that point is also classified as a blue species. This is in essence what KNN does. It will build a boundary around a clustering of common points. See the image below:

Also, keep in mind that the example above is presented w/ a 2D dataset while ours is 4D, the theory holds.

Now let’s test this out and see if we can make some predictions!

## Experiment

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=1)

X = iris.data
Y = iris.target
knn.fit(X,Y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_neighbors=1, p=2, weights='uniform')


So what we are saying above, is that we have a vector X with corresponding output in vector Y, now train the model w/ these inputs so that we will have a boundary map like the image above that will allow us to make arbitrary predictions.

So for example, say I have a flower w/ these measurements:

sepal length (cm): 3
sepal width (cm): 5
petal length (cm): 4
petal width (cm): 2


If I convert that into a vector and feed it though the model knn it should make a prediction about species.

test_input = [3,5,4,2]

species_id = knn.predict(test_input)
print iris.target_names[species_id]

['virginica']


So, we can say w/ a certain degree of certainty that this random sample test_input is of type virginica

### Measuring Error

Now I keep using the language “with a certain degree of certainty”, etc, in which I’m trying to convey that this an other machine learning/data mining models are only approximations and a degree of error exists. Let’s measure the error of our KNN.

#### Cross Validation w/ Test Train Split

One way we can measure the accuracy of our model is to test it against the data we trained it with. Sci-kit has a convenient function to help us with that.

from sklearn.cross_validation import train_test_split


We’re basically feeding in input for which we KNOW the correct output to measure the error in our model.

In order to do this we will use the train_test_split function which will divide our dataset into 2 parts. The first part will be used to train the model and the second part will be used to test the model.

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.4, random_state=4)
actual = Y_test


Now train the model w/ the new data:

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, Y_train)
expected = knn.predict(X_test) #predictions


Now what we could do is to feed in all all the test data in X_test and compare the results to the known answers in Y_test and then measure the error, which is the difference between the expected answer and the actual answer. Then we could compute some type of error function, such as mean-squared-error, etc.

But even more convenient, Sci-kit has a metrics library that will automate a lot of this:

import sklearn.metrics
score_1 = metrics.accuracy_score(actual, expected)
score_1

0.94999999999999996

##### Lower Error Rate

Our model has an accuracy of .94 where 1 is the max. I don’t have anything relative to compare it to but it seems pretty accurate. There are lots of settings in the KNN model, such as the K value, that we can adjust to get a higher acceptance rate. If you noticed, the first time we instantiated KNeighborsClassifier we chose a setting of n_neighbors=1 with no discussion. Lets iterate 1-26 and see if we can’t lower the error rate:

import matplotlib.pyplot as plt
%matplotlib inline

scores = []
k_range = range(1,26)
for k in k_range:
knn=KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, Y_train)
actual = knn.predict(X_test)
scores.append(metrics.accuracy_score(Y_test, actual))

plt.xlabel('Value of K for KNN')
plt.ylabel('Testing Accuracy')
plt.plot(k_range, scores)

[<matplotlib.lines.Line2D at 0x10e785ad0>]


It looks like if we choose a K value of 20 our model should be the most accurate. Lets see if we adjust our K can we get a higher score than the last score score_1 of: 0.94999999999999996

knn = KNeighborsClassifier(n_neighbors=20)
knn.fit(X_train, Y_train)
expected = knn.predict(X_test)
score_2 = metrics.accuracy_score(actual, expected)
score_2

0.98333333333333328

(score_1-score_2)/score_1

-0.035087719298245605


It’s only a 3.5% decrease in error, but considering it’s so close to 1 we should be happy that our model is predicting flowers at a reasonable error rate.