[ad_1]

## Machine Learning / Projects

## Pokémon, Machine Learning, Python: a winning trio

In my last article I introduced the algorithmic part of the workshop we at Datamasters prepared for Pycon 22. TLDR: a Pokémon search engine using the machine learning KNN algorithm was proposed. You can read the article here. Let’s do a quick recap:

- Take 6 numbers from the user, and assign them to a “fake” Pokémon
- Compute the Euclidean distance between the fake Pokémon and each Pokémon stored in one of the many datasets you can easily find on Kaggle
- Store in a new data structure a list with length N (the number of Pokémon in the dataset) in which each element stores 2 numbers:

the index of the Pokémon in the dataset

the euclidean distance between the i-th Pokémon and our fake Pokémon - We now have a data structure with a length equal to the number of Pokémon in the dataset; each element contains a unique index and a number indicating the distance; we want to sort this structure by the “distance” value of each element
- Show in output just the first K elements of this sorted data structure (e.g. if K = 3 let’s take just the 3 Pokémon that are the nearest and thus the most similar to the user-inserted Pokémon)

That’s it. The idea is not that hard. Read again the article if you’re missing something.

In this article we’ll implement our search engine using Python. As a part of a workshop called “Beginners’Day”, we decided not to use any third-party library, not even the super-used and super-cool Machine Learning libraries Python is notorious for (pandas, matplotlib, numpy, etc). We’re building this search engine **from scratch**. The only package we’re using is CSV, which is part of the Python standard library; it means you don’t have to install anything to use this package. Once you’ve installed Python, you’re good to use this package. Let’s start by importing this package and using it to read the dataset, which (unsurprisingly) is a CSV file. Pay attention: the very first row of the dataset contains the column names and not actual Pokémon data:

import csvcolumns=[]

pokemon_dataset=[]withopen('pokemon.csv')asfile:

print(type(file))

reader=csv.reader(file)

columns=next(reader)forrowinreader:

pokemon_dataset.append(row)

All we do in these lines of code is read the content of the CSV file and store the read values in 2 lists. The first will contain the column names (#, name, type 1, type 2, etc), and the latter will hold the actual dataset. Note that `columns`

is a one-dimensional list while `pokemon_dataset`

is a **two-dimensional list**, also called **matrix** or if you prefer **table**: to access a Pokémon we’ll use a single index, while if we want to access an attribute of a Pokémon we’ll use 2 indices:

`print(pokemon_dataset[0]) # Prints the row with Bulbasaur data, the first Pokémon of the dataset`

print(pokemon_dataset[0][1]) # Prints the name(index 1) of the Pokémon with index 0 (always Bulbasaur)

Let’s explore the dataset by printing all the available fields, their types and their values for the first Pokémon, Bulbasaur:

fork, vinzip(columns, pokemon_dataset[0]):

print(f"{k:10} -> {v:15} ({type(v)})")# -> 1 ()

Name -> Bulbasaur ()

Type 1 -> Grass ()

Type 2 -> Poison ()

Total -> 318 ()

HP -> 45 ()

Attack -> 49 ()

Defense -> 49 ()

Sp. Atk -> 65 ()

Sp. Def -> 65 ()

Speed -> 45 ()

Generation -> 1 ()

Legendary -> False ()

Notice how all fields are **strings (type ****str****), even the ones containing numeric values **(scores columns, generation, total). Of course, applying the euclidean distance to strings is just not possible, besides being totally nonsense; what we need to do is store the numeric values we’re interested in in another data structure, but with a `float`

type:

num_indexes=[5, 6, 7, 8, 9, 10]numerical_data=[]foriinrange(len(pokemon_dataset)):

row=pokemon_dataset[i]

num_row=[]forcolinnum_indexes:

num_row.append(float(row[col])) numerical_data.append(num_row)fork, vinzip(columns[5:11], numerical_data[0]):

print(f"{k:10} -> {v:5} ({type(v)})")# the last for loop would print:

HP -> 45.0 ()

Attack -> 49.0 ()

Defense -> 49.0 ()

Sp. Atk -> 65.0 ()

Sp. Def -> 65.0 ()

Speed -> 45.0 ()

In `numerical_data`

we now can find **just** the columns containing the data we’ll use to compute the euclidean distance. Lastly, just to make sure we’re doing well, let’s print the values and the types for the first Pokémon in the dataset, double-checking they are `float`

.

Now: let’s write the function that will be the real *core* of our KNN version. First of all, the euclidean distance between two points with N dimensions:

defeuclidean_distance(p1, p2):

dim=len(p1)

distance=0fordinrange(dim):

distance+=abs(p1[d]-p2[d])**2distance

=distance**(1/2)returndistance

`p1`

e `p2`

are lists, both with the same number of elements (i.e. with the same *length*). It’s a mandatory requirement for euclidean distance to work: the points *must *belong to the same **euclidean space**; it’s a fancy word to say that if we have points with **2 dimensions**, we’ll compute the euclidean distance in **a** **plane**; if we have a point with **3 dimensions**, we’ll compute the euclidean distance in **space**. Computing the euclidean distance between a 2D point and a 3D point just doesn’t make sense, uh? Back to the code: in the `for`

loop we iterate through the points coordinates, we compute the absolute value of their difference, we square it and we add it to a variable called `distance`

initially set to 0. Outside the loop, we raise `distance`

to the power of 0.5, which means (I’d like to remind you) computing the square root of `distance`

.

The next one is the function that will do the remaining part of the algorithm, the creation and the sort of the data structure:

defget_k_neighbors(k, dataset, fake_pokemon):distances

=[]foriinrange(len(dataset)):

row=dataset[i]

d=euclidean_distance(fake_pokemon, row)

distances.append((i, d))

distances.sort(key=lambdatup: tup[1])returndistances[:k]

This function has three arguments:

`k`

, a number, that represents the number of results returned to the user`dataset`

, the table containing all the Pokémon numeric values`fake_pokemon`

, a list filled with the user’s values

Next, the function creates an empty data structure, called `distances`

, and for each Pokémon in the dataset, identified by the `i`

index, computes the Euclidean distance between the `fake_pokemon`

and the i-th dataset row using the function defined earlier. For each of the computed distances, a new tuple is inserted in the `distances`

structure: the first element of the tuple is the variable `i`

while the second (and last) element is the computed distance. After the loop, this is the data structure we’ll get:

All we have to do is **sort** this data structure. Sorting a list in Python is quite easy: we can use the `.sort`

method on the list we want to sort, and instantly all the data will be sorted in ascending order. This approach is fast and works well with **one-dimensional lists**, while in this case, we have a `Nx2`

structure, where N is the number of Pokémon. But have no fear! We can use the sort method passing as the first argument a **callback function**, in which we can **specify the element we want to use to sort our original data structure**. The function will be called on each element of our list and has just one argument: the i-th element of the data structure. For the sake of brevity, we used a **lambda function**, but we could achieve the same result using an old-fashioned function, like this:

def which_element(tup):

return tup[1]distances.sort(which_element)

Notice how the return value (in both the lambda function and traditional function) is simply the value with index 1 of the tuple, i.e. the second column of our data structure: the distance.

Here’s `distances`

after the sorting:

Now we can use **slicing** to return the first **k** elements, the k elements with **the shortest distance**, or if you prefer, **the k Pokémon most similar to the user Pokémon**.

In the end, we create a function to print the info of the Pokémon.

**def** print_pokemon_info(i):

s **=** f"{pokemon_dataset[i][0]} - {pokemon_dataset[i][1]}, of type {pokemon_dataset[i][2]} (gen. {pokemon_dataset[i][**-**2]})"

**if** pokemon_dataset[i][**-**1] **==** "True":

s **+=** f"n - LEGENDARY"

**return** s

The function takes just one input parameter: the index of the row in the Pokémon dataset. We just use a Python f-function with:

- Pokémon Pokèdex number
- Pokémon name
- Pokémon type
- Pokémon generation (to retrieve the generation we use the index -2 to get the
**second-last**element) - A message if the Pokémon is legendary (to retrieve the “legendary” field we use index -1 to get the
**last**element of the row)

Lastly, we return the generated string.

In the end, we use the created functions writing the lines of code in which we ask for the user’s input:

user_row=[]k=5foriinrange(len(num_indexes)):

col_index=num_indexes[i]

v=input(f"Inserisci il tuo valore di {columns[col_index]}n")

user_row.append(float(v))fork, vinzip(columns[5:11], user_row):

print(f"{k:10} -> {v:15} ({type(v)})")print("Looking for the most similar Pokémon...")l=get_k_neighbors(k, numerical_data, user_row)

print(l)forpinl:

print(print_pokemon_info(p[0]))

We create an empty list corresponding to our fake Pokémon; we set k to 5 (we’ll get the 5 Pokémon with the shortest distances) and we ask the user for 6 numbers (we’ll also use the `columns`

list we created in the first cell, to show the user the score name); we print the values for better usability of our program; we invoke the function to get the K most similar Pokémon; we use the last function to show what Pokémon were returned by our algorithm. Here’s an example of the program’s output:

Insert a HP value

40

Insert a Attack value

40

Insert a Defense value

40

Insert a Sp.Attack value

40

Insert a Sp.Defense value

40

Insert a Speed value

40

HP -> 40.0 () Looking for the most similar Pokémon...236 - Tyrogue, of type Fighting (gen. 2)

Attack -> 40.0 ()

Defense -> 40.0 ()

Sp. Atk -> 40.0 ()

Sp. Def -> 40.0 ()

Speed -> 40.0 ()

300 - Skitty, of type Normal (gen. 3)

29 - Nidoran♀, of type Poison (gen. 1)

504 - Patrat, of type Normal (gen. 5)

412 - Burmy, of type Bug (gen. 4)

And there it is! We can insert whatever value we want, and our software will use the KNN algorithm implemented earlier to return the Pokémon most similar to the numbers inserted by the user. Easy, isn’t it? As usual, coding is not that hard once you have a good design.

Our repository, available at this URL, you can find a Jupyter notebook with visualizations of Pokémon images, taken from one of the many websites with Pokémon data.

See you soon, with other ML articles and tutorials!

[ad_2]

Source link