Introduction to Machine Learning With Python on Windows

Introduction

In this article , I’m going to try to introduce you to machine learning in Python.Machine learning is a very large subject and even more so with Python.I’m not going to be able to cover every topic of he subject but the goal here is to try to show you the basics so you can start building your first ML projects.

Altough there aren’t any prerequistes for you to follow along, I would suggest a little refresher on your Python skills (mailny lists and dictionaries).A little calculus and linear algebra wouldn’t hurt either.

Installation

The easiest way to get all the libraries you need for doing machine learning in Python is to install the Anaconda distribution.Although the main library is Scikit Learn, it has a number of dependencies wich can be sort of tricky to configure.I recommand that you Google for the latest Anaconda distribution and start the installation:


python-ml-fig-1


python-ml-fig-2


python-ml-fig-3


python-ml-fig-4


python-ml-fig-5


python-ml-fig-

Normaly the installation goes smootly but I did have case that one time where I’ve got an error like this when trying to call one of the libraries (matplotlib):

This application failed to start because it could not find or load the Qt platform plugin "windows" in "".

Reinstalling the application may fix this problem.

To resolve this I’ve added the QT_PLUGIN_PATH variable manually to my path:

python-ml-fig-8

You should check your installation by opening a command line , typing python and verifying that you have all you need for your machine learning needs (some imports statements may take a little while to load):

>>> import numpy
>>> import pandas
>>> import matplotlib
>>> import sklearn
>>> print(numpy.__version__)
1.11.1
>>> print(pandas.__version__)
0.18.1
>>> print(matplotlib.__version__)
1.5.3
>>> print(sklearn.__version__)
0.17.1
>>>

The versions on your installation might be a little fresher then mine but you should still be able to follow along.

NumPy

The NumPy library is part of the SciPy family. SciPy is a group of libraries that makes it easier to build scientific projects in Python.NumPY main features are 1D and 2D Arrays.NumPy arrays are similar to lists in Python except they support vectorized operations (like in linear algebra) wich is important for machine learning (You can still do concatenation with normal lists in Python).

This data structure is the one of the most used in machine learning .The best way to learn about it is to see it in action.These examples are taken from this GitHub repository.They’re based on lessons from the Intro to Data Analysis Udacity’s course:

//1D Arrays

>>>import numpy as np
>>>import pandas as pd
>>># First 20 countries with employment data
>>>countries = np.array([
    'Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
    'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas',
    'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium',
    'Belize', 'Benin', 'Bhutan', 'Bolivia',
    'Bosnia and Herzegovina'
])

>>># Employment data in 2007 for those 20 countries
>>>employment = np.array([
    55.70000076,  51.40000153,  50.5       ,  75.69999695,
    58.40000153,  40.09999847,  61.5       ,  57.09999847,
    60.90000153,  66.59999847,  60.40000153,  68.09999847,
    66.90000153,  53.40000153,  48.59999847,  56.79999924,
    71.59999847,  58.40000153,  70.40000153,  41.20000076
])

>>>def max_employment(countries, employment):
    '''
    Fill in this function to return the name of the country
    with the highest employment in the given employment
    data, and the employment in that country.
    '''
    max_ind = employment.argmax()
    max_country = countries[max_ind]
    max_value = employment[max_ind]

    return (max_country, max_value)

//2D Arrays

>>>import numpy as np
>>>import pandas as pd
>>>ridership = np.array([
    [   0,    0,    2,    5,    0],
    [1478, 3877, 3674, 2328, 2539],
    [1613, 4088, 3991, 6461, 2691],
    [1560, 3392, 3826, 4787, 2613],
    [1608, 4802, 3932, 4477, 2705],
    [1576, 3933, 3909, 4979, 2685],
    [  95,  229,  255,  496,  201],
    [   2,    0,    1,   27,    0],
    [1438, 3785, 3589, 4174, 2215],
    [1342, 4043, 4009, 4665, 3033]
    ])

>>>def mean_riders_for_max_station(ridership):
    '''
    Fill in this function to find the station with the maximum riders on the
    first day, then return the mean riders per day for that station. Also
    return the mean ridership overall for comparsion.
    
    Hint: NumPy's argmax() function might be useful:
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.argmax.html
    '''
    max_station = ridership[0,:].argmax()
    overall_mean = ridership.mean()
    mean_for_max = ridership[:,max_station].mean()
    return (overall_mean, mean_for_max)

>>>print mean_riders_for_max_station(ridership)

>>>(2342.5999999999999, 3239.9000000000001)

Note:The instructor in the course uses Jupyter notebooks , wich are based on another SciPy library called IPython (if you’ve installed Anaconda, you aleady have what you need).

Obviously, NumPy arrays support more operations than the just ones shown here.If you want to learn more about NumPy, you can do so by taking a look at the tutorial that’s in the documentation.

Pandas

Pandas is another library from the SciPy family.Pandas takes the concept of NumPy arrays a little further and introduces two data structures called series and dataframes.They are similar to NumPy arrays (actually they are based on them) but support more operations like grouping ,etc…

Here are a few examples of both data structures taken from the Pandas documentation:

// Series

>>>import numpy as np
>>>import pandas as pd
>>>s = pd.Series([1,3,5,np.nan,6,8])
>>>s
0     1
1     3
2     5
3   NaN
4     6
5     8
dtype: float64

// Dataframes

>>>import pandas as pd
>>>dates = pd.date_range('20130101' , periods=6)
>>>dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
>>>df = pd.DataFrame(np.random.randn(6,4) , index=dates, columns=list('ABCD'))
>>>df
                   A         B         C         D
2013-01-01 -1.277055 -0.463313 -0.454370 -0.175243
2013-01-02 -0.466551  1.770887  0.530980 -0.783895
2013-01-03  0.490643 -0.918222  0.979640  0.423781
2013-01-04  0.528427 -0.808807  0.605657  1.430232
2013-01-05 -0.792209 -1.563374  0.476319  1.140602
2013-01-06  0.179147 -0.156790  0.273832  1.225799
>>>df.head()
                   A         B         C         D
2013-01-01 -1.277055 -0.463313 -0.454370 -0.175243
2013-01-02 -0.466551  1.770887  0.530980 -0.783895
2013-01-03  0.490643 -0.918222  0.979640  0.423781
2013-01-04  0.528427 -0.808807  0.605657  1.430232
2013-01-05 -0.792209 -1.563374  0.476319  1.140602
>>>df.tail()
                   A         B         C         D
2013-01-02 -0.466551  1.770887  0.530980 -0.783895
2013-01-03  0.490643 -0.918222  0.979640  0.423781
2013-01-04  0.528427 -0.808807  0.605657  1.430232
2013-01-05 -0.792209 -1.563374  0.476319  1.140602
2013-01-06  0.179147 -0.156790  0.273832  1.225799
>>>df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
>>>df.columns
Index([u'A', u'B', u'C', u'D'], dtype='object')
>>>df.values
array([[-1.27705517, -0.46331279, -0.45436985, -0.17524309],
       [-0.46655075,  1.77088663,  0.53098007, -0.78389515],
       [ 0.49064344, -0.91822226,  0.97964037,  0.42378052],
       [ 0.52842695, -0.80880698,  0.60565654,  1.43023196],
       [-0.7922092 , -1.5633739 ,  0.47631935,  1.14060203],
       [ 0.17914748, -0.15678958,  0.27383177,  1.22579884]])
>>>df.describe()
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean  -0.222933 -0.356603  0.402010  0.543546
std    0.738918  1.144878  0.478978  0.883319
min   -1.277055 -1.563374 -0.454370 -0.783895
25%   -0.710795 -0.890868  0.324454 -0.025487
50%   -0.143702 -0.636060  0.503650  0.782191
75%    0.412769 -0.233420  0.586987  1.204500
max    0.528427  1.770887  0.979640  1.430232
>>>df.T
   2013-01-01  2013-01-02  2013-01-03  2013-01-04  2013-01-05  2013-01-06
A   -1.277055   -0.466551    0.490643    0.528427   -0.792209    0.179147
B   -0.463313    1.770887   -0.918222   -0.808807   -1.563374   -0.156790
C   -0.454370    0.530980    0.979640    0.605657    0.476319    0.273832
D   -0.175243   -0.783895    0.423781    1.430232    1.140602    1.225799
>>>df.sort_index(axis=1, ascending=False)
                   D         C         B         A
2013-01-01 -0.175243 -0.454370 -0.463313 -1.277055
2013-01-02 -0.783895  0.530980  1.770887 -0.466551
2013-01-03  0.423781  0.979640 -0.918222  0.490643
2013-01-04  1.430232  0.605657 -0.808807  0.528427
2013-01-05  1.140602  0.476319 -1.563374 -0.792209
2013-01-06  1.225799  0.273832 -0.156790  0.179147
>>>df.sort_values(by='B')
                   A         B         C         D
2013-01-05 -0.792209 -1.563374  0.476319  1.140602
2013-01-03  0.490643 -0.918222  0.979640  0.423781
2013-01-04  0.528427 -0.808807  0.605657  1.430232
2013-01-01 -1.277055 -0.463313 -0.454370 -0.175243
2013-01-06  0.179147 -0.156790  0.273832  1.225799
2013-01-02 -0.466551  1.770887  0.530980 -0.783895
>>>df['A']
2013-01-01   -1.277055
2013-01-02   -0.466551
2013-01-03    0.490643
2013-01-04    0.528427
2013-01-05   -0.792209
2013-01-06    0.179147
Freq: D, Name: A, dtype: float64
>>>df[0:3]
                   A         B        C         D
2013-01-01 -1.277055 -0.463313 -0.45437 -0.175243
2013-01-02 -0.466551  1.770887  0.53098 -0.783895
2013-01-03  0.490643 -0.918222  0.97964  0.423781

>>>df['20130102':'20130104']
                   A         B         C         D
2013-01-02 -0.466551  1.770887  0.530980 -0.783895
2013-01-03  0.490643 -0.918222  0.979640  0.423781
2013-01-04  0.528427 -0.808807  0.605657  1.430232
>>>df.loc[dates[0]]
A   -1.277055
B   -0.463313
C   -0.454370
D   -0.175243
Name: 2013-01-01 00:00:00, dtype: float64
>>>df.loc[:, ['A', 'B']]
                   A         B
2013-01-01 -1.277055 -0.463313
2013-01-02 -0.466551  1.770887
2013-01-03  0.490643 -0.918222
2013-01-04  0.528427 -0.808807
2013-01-05 -0.792209 -1.563374
2013-01-06  0.179147 -0.156790

>>>df.loc[dates[0], 'A']
-1.2770551741354739
>>>df.at[dates[0], 'A']
-1.2770551741354739
>>>df.iloc[3]
A    0.528427
B   -0.808807
C    0.605657
D    1.430232
Name: 2013-01-04 00:00:00, dtype: float64

>>>df.iloc[3:5,0:2]
                   A         B
2013-01-04  0.528427 -0.808807
2013-01-05 -0.792209 -1.563374
>>>df.iloc[[1,2,4],[0,2]]
                   A         C
2013-01-02 -0.466551  0.530980
2013-01-03  0.490643  0.979640
2013-01-05 -0.792209  0.476319
>>>df.iloc[1:3,:]
                   A         B        C         D
2013-01-02 -0.466551  1.770887  0.53098 -0.783895
2013-01-03  0.490643 -0.918222  0.97964  0.423781
>>>df.iloc[:,1:3]
                   B         C
2013-01-01 -0.463313 -0.454370
2013-01-02  1.770887  0.530980
2013-01-03 -0.918222  0.979640
2013-01-04 -0.808807  0.605657
2013-01-05 -1.563374  0.476319
2013-01-06 -0.156790  0.273832
>>>df.iloc[1,1]
1.770886631960771
>>>df.iat[1,1]
1.770886631960771

>>>df[df.A > 0]
                   A         B         C         D
2013-01-03  0.490643 -0.918222  0.979640  0.423781
2013-01-04  0.528427 -0.808807  0.605657  1.430232
2013-01-06  0.179147 -0.156790  0.273832  1.225799
>>>df[df > 0]
                   A         B         C         D
2013-01-01       NaN       NaN       NaN       NaN
2013-01-02       NaN  1.770887  0.530980       NaN
2013-01-03  0.490643       NaN  0.979640  0.423781
2013-01-04  0.528427       NaN  0.605657  1.430232
2013-01-05       NaN       NaN  0.476319  1.140602
2013-01-06  0.179147       NaN  0.273832  1.225799
>>>df2 = df.copy()
>>>df2['E'] = ['one','one','two','three','four','three']
>>>df2
                   A         B         C         D      E
2013-01-01 -1.277055 -0.463313 -0.454370 -0.175243    one
2013-01-02 -0.466551  1.770887  0.530980 -0.783895    one
2013-01-03  0.490643 -0.918222  0.979640  0.423781    two
2013-01-04  0.528427 -0.808807  0.605657  1.430232  three
2013-01-05 -0.792209 -1.563374  0.476319  1.140602   four
2013-01-06  0.179147 -0.156790  0.273832  1.225799  three
>>>df2[df2['E'].isin(['two','four'])]
                   A         B         C         D     E
2013-01-03  0.490643 -0.918222  0.979640  0.423781   two
2013-01-05 -0.792209 -1.563374  0.476319  1.140602  four

As you can see , dataframes support a wide variety of operations including sorting, slicing (in order to filter the data out) and even transposing.To learn more about dataframes and Pandas in general , you can read their excellent documentation.

Matplot

Just before we get to the actual subject of machine learning, it’s worth mentioning that often we would rather visualize the data instead of reading data points (wich are often in ML projects contained between the values of 0 and 1 thus not very easy to understand).

For that, SciPy has yet another library called Matplotlib.Here is a quick example from another site I have where I use it to plot a pie graph of the top five domains in a list of emails (domains with the most emails).

>>>import matplotlib.pyplot as plt
 
>>># Data to plot 
>>>labels =  'yahoo.com', 'hotmail.com','aol.com','gmail.com','wanadoo.fr'
>>>sizes = [2352, 462, 248, 188, 46] 
>>>colors = ['gold', 'yellowgreen', 'lightcoral' , 'lightskyblue' , 'red']
>>>explode = (0.1, 0, 0, 0, 0)  # explode 1st slice
>>># Plot
>>>plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140)
>>>plt.axis('equal') 
>>>plt.show()

//The result


list_472

Scikit Learn

Machine learning operations can be divided into two very general categories:Supervised Learning and Unsupervised Learning.

Basically the process is the same: we feed an alogorythm some data to learn from (we actually call this training a model) and then we ask the program to make an educated guess as to what class some new data poins belong or for a specific value (usually called a target).The ouput may be a continous variable (a serie of data points) or a discrete variable.

With Scikit Learn, the data almost always takes the shape of a two dimensionel array.For supervised learning (like classifications and regressions), it’s best practice to split the data into two parts: a training data set and a test data set.We train the model by calling the fit() function on the training data set and then ask the library to make an educated guess about features of the data called labels.In Scikit, this is called making a prediction and it’s done by calling the predict() on a classifier.It’s also a best practice to have some test labels at hand , that way you can evaluate how confident is the prediction of an algorythm.You can do this in Scikit Learn by using the accuracy_score function.

The code that follows is an example of supervised learning taken from two GitHub repository.They belong to students who who apparently took the Udacity machine learing course. The first one is here and here is the link to the second one.

In this progam, we are trying to predict wich person wrote an email.The output is a discrete variable 0 or 1 each corresponding to a label representing a person named Sarah or Chris).In machine learning, most of the time is spent preparing data.

In this particular example, the data is preprocessed and dumped into a file using a utility called pickle (you can learn more about this by taking a look at the Scikit Learn documentation).

In order for this code to work, you must first download the preprocess.py file and put into your current working directorty.You must also download the files that contain the actual data , word_data.pkl and email_authors.pkl.

#!/usr/bin/python

""" 
    This is the code to accompany the Lesson 1 (Naive Bayes) mini-project. 

    Use a Naive Bayes Classifier to identify emails by their authors
    
    authors and labels:
    Sara has label 0
    Chris has label 1
"""

import os

import sys
from time import time

from email_preprocess import preprocess


### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()


#########################################################
### your code goes here ###
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# the classifier
clf = GaussianNB()

# train
t0 = time()
clf.fit(features_train, labels_train)
print "\ntraining time:", round(time()-t0, 3), "s"

# predict
t0 = time()
pred = clf.predict(features_test)
print "predicting time:", round(time()-t0, 3), "s"

accuracy = accuracy_score(pred, labels_test)

print '\naccuracy = {0}'.format(accuracy)

//Output
// Maybe different in your environment

no. of Chris training emails: 7936
no. of Sara training emails: 7884

training time: 1.33 s
predicting time: 0.203 s

accuracy = 0.973265073948

Note : to see the actual prediction, just print the pred variable to the screen.You can also use Python functions like len() and sum() to make comparisons with the actual data.

As you can see , we’ve used the Naive Bayes for the classifier above but Scikit supports many algorythms like SVM (Support Vector Machines) or Decision Trees.To try them out, you can simply change the one line that defines the classfier, for example:

//SVM
//Check out the documentation for the many parameters
//Notably the kernel and the gama to fine tune the decision support line

>>>from sklearn import svm
>>>clf = SVC()
>>>#Run the rest of the program unchanged

//Decision Trees
//Check out the documentation for the parameters
//Notably the number of samples and how it influences fitting the model

>>>from sklearn import tree
>>>clf = tree.DecisionTreeClassifier()
>>>#Run the rest of the program unchanged

Other algorythms include (but not limited to) Random Forest, AdaBoost and kNN (neighrest neighbor).

References

  • Intro to Data Analysis : The course by Udacity that you can find here.
  • Introduction to Machine Learning : The course by Udacity that you can find here.
  • SciPy: You can learn more about this group of open frameworks here.
  • Scikit Learn: You can learn more about this library here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Create a free website or blog at WordPress.com.

Up ↑

%d bloggers like this: