What is machine learning?

There are some problems where the solution is not known and so conventional programming techniques do not work instead we get the computer to perform statistical data gathering and determine the factors themselves. The computer is able to process vast amounts of data and detect significance in cases where no human would have ever found it.

There are different types of machine learning, but most of them can be classified as supervised learning or unsupervised learning. Supervised learning is where you have labelled data, for example you could have loan decisions (the label would be loan granted and loan refused) along with the application forms and other data (such as credit rating). Unsupervised learning is where you don’t even know the type of answers so for example you could find films that likely to be popular with another person (Netflix recommendations).

Supervised Learning

Supervised learning is generally done with data that has already been evaluated by a person or possibly another algorithm. In the loan example we might take into account the type of loan (mortage, credit card, overdraft), the employment history, wages, outstanding debt, and credit score. Some of these values are categorical (the type of loan), and some are quantitative (the amount of outstanding debt). Sometimes is may be worth changing a quantitative value into a categorical value (0-5000 debt, 5000-15000 debt and 15000+ or more debt). One problem with supervised learning is that it can enforce biases in the original data, so if loans were refused to Afrian Americans then the algorithm could pick that up even if there is no data relating to the race of the applicant stored directly in the data. There is an article in Forbes on this problem.

Unsupervised learning

One of the algorithms used for unsupervised learning is K means, this tries to minimize the distance between points in a multidimensional sense. So for the Netflix example you would consider how films are rated by the most prolific 5000 people (ones with lots of rating information), and you assign each person to a random group, you then find the center point of the group and then move people in the group with the closest center, you keep on doing this until there are no changes, or after a certain number of loops around. You can then match other people to one of the people in the same group and determine films they are likely to like.