Random Thoughts: January 2014

Theano

Theano is a generic mathematical expression evaluation library. The main advantage is, it compiles to either CPU or GPU. By default it uses cpu.

Download from http://deeplearning.net/software/theano/

Tutorial http://deeplearning.net/software/theano/tutorial/index.html#tutorial

Pylearn2

Pylearn2 is based on theano. So you can get advantage of running on GPU.

Download from https://github.com/lisa-lab/pylearn2

Models

Logistic Regression
K-means
Multilayer Perceptron (MLP)
Restricted Boltzmann Machines (RBM)
Deep Boltzmann Machines (DBM)
Ising
Auto Encoder: Autoencoders, denoising autoencoders, and stacked DAEs.
Maxout Networks (Maxout)

It is not limited to above models and any one can add new model. Above models are pre implemented.

Hebel

It is also GPU accelerated deep learning library. Its based on PyCuda.

Models

Logistic Regression
Neural network regression
Muti-Task neural net

Download from https://github.com/hannes-brt/hebel

Deepnet

GPU accelerated. Based on cudamat.

Models

Feed-forward Neural Nets
Restricted Boltzmann Machines
Deep Belief Nets
Autoencoders
Deep Boltzmann Machines
Convolutional Neural Nets

Download from https://github.com/nitishsrivastava/deepnet

Others

Python tutorial on Restricted Bolzmann Machines - https://github.com/echen/restricted-boltzmann-machines

Modular Restricted Bolzmann machine implementation. Its based on theano. https://github.com/benanne/morb

Matrix Library for cuda - https://github.com/deeplearningais/CUV

In geometry, a two dimensional plane is linearly separable if we can divide the plane using a line. In case of n-dimensions it is a hyperplane. In Machine learning, we use this property to classify the data. Lets first understand what do we mean by Learning here ?.

Learning vs Memorising

Oxford dictionary defines learning as "the acquisition of knowledge or skills through study, experience, or being taught". Memorising is the ability to remember/call what ever we have seen before. The key difference is, if we learn correctly we can able to answer the things which we haven't seen before. While learning, we generalise the things we have seen before and apply the same to new things we haven't seen. Achieving 100% memorisation in computers is very very easy. Its matter of how fast we can go through the things we have seen and get the answer. The challenge lies in how we make computer learn. To achieve that we will not define any explicit rules for any particular problem rather we will create a model where we will give the computer ability to generalise from examples we have given. The key for learning is ability to generalise.

Linear separability

Lets take following example. For now lets ignore how we will convert out problem to cartesian coordinates.

The problem is if a new point comes, consider the green cross in below picture. In what category we will put it into ?. Is it belongs to blue cluster or red cluster ?.

Most of us will agree that it belongs to red cluster. Because it is close to it. What we do intuitively is we draw a link between these two clusters and we will decide based on what side it is falling into.

For two dimension it is a line and for multi-dimensions it is a hyperplane.

If the data can be separated using a hyperplane (straight line in case of two dimension) then we call the data linearly separable. What we defined now is a simple algorithm to classify the data into two classes. We defined a simple classifier now that can divide the data into two classes. How can we do that if we have more than one class ?.

Multi-class classification

Once we have an algorithm to classify the data into two classes, we can generalise it to multiple classes. There are two approaches for doing that.

One vs Rest (one vs all): Here we will have n classifiers. Meaning we will have n hyperplanes. First we will check whether it belongs to class1. We will check this by drawing the hyperplane between class1 and rest of the classes (we will mix all other classes into one). We will repeat this for all the classes.
One vs One: This is more complicated but is useful for certain class of problems. Here we will have (n * (n - 1)/2) classifiers. For 4 classes we will have 6 classifiers. We will compare class1 vs class2, class1 vs class3, class1 vs class4, class2 vs class3, class2 vs class4, class4 vs class4. Whichever class comes more times we will classify it as that class.

Revisiting Linear separability

Not all problems are linearly separable. Take following example.

We can't draw any line that separates these two classes. Only way is to draw an ellipse. These kind of problems are solved into two ways. As i said before, draw an ellipse instead of line. Meaning, we are using non-linear function to classify the data.

The other way (ex. kernel trick in svm) is to project the data to higher dimension and check whether it is linearly separable. If we project above data into 3rd dimension we will see it as,

Clearly, above data is linearly separable using a hyperplane. In the next blog, we will see how we can use this for sentiment analysis.

References

http://en.wikipedia.org/wiki/Linear_separability
SVM Example: http://stackoverflow.com/questions/9480605/what-is-the-relation-between-the-number-of-support-vectors-and-training-data-and
Multi-class classification: http://scikit-learn.org/stable/modules/multiclass.html
http://www.statsoft.com/textbook/naive-bayes-classifier/

Random Thoughts

January 25, 2014

Deep learning libraries in python