What is Maching Learning

Machine Learning is a term that can mean different things to different people. Andrew Ng, cofounder of Coursera and Professor at Stanford, provides two definitions in his popular Machine Learning Course. The first definition comes from Arthur Samuel around 1959.

Field of study that gives computers the ability to learn without being explicitly programmed.

The second definition comes from Tom Mitchell’s 1997 Machine Learning textbook. This definition is a bit more formal and rigorous. This book defines a well-posed learning problem as:

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Machine Learning Categories

Machine learning can be broken down into a few categories. The two most popular are supervised and unsupervised learning. A couple other categories are recommender systems and reinforcement learning.

Supervised Learning

Probably the most common category of machine learning, supervised learning is concerned with fitting a model to labeled data. Labeled data is data that has the correct answer supplied. Regression and Classification are the most common types of problems in supervised learning.

Unsupervised Learning

Unsupervised learning deals with unlabeled data. Therefore, the goal of unsupervised learning is to find structure in the data. Clustering is probably the most common technique.

Others

Recommender systems deal with making recommendations based upon previously collected data. Reinforcement learning is concerned with maximizing the reward of a given agent(person, business, etc).

Learn More

Most of the above information comes from the Coursera Machine Learning Course. There is still time to sign up since the first assignments are not due until the end of the week.

Win-Vector Blog » Data Science, Machine Learning, and Statistics: what is in a name?

Win-Vector Blog » Data Science, Machine Learning, and Statistics: what is in a name?.

This is an excellent write-up for the differences between:

  • Statistics
  • Machine Learning
  • Data Mining
  • Informatics
  • Big Data
  • Predictive Analytics
  • Data Science

Startups working on Machine Learning as a Service (MLaaS)

  1. BigML – A great interface. Just upload your data and it shows basic information for each column such as a histogram and mean values. See the Gallery for some examples of the final models.
  2. Wise.io – Just launched, but it looks to be a serious contender. It was started by a team from UC Berkeley.
  3. Precog – Taking a slightly different approach, Precog is both a platform and an online IDE for data science. The IDE supports Quirrel, hyped as R for big data.
  4. Ersatz – Ersatz is currently in private beta, but they are building a web platform for building deep neural networks.

Am I missing any startups on this list?

While not really startups, the following 2 links might also fit here.

  1. Google Prediction API – Cloud-based machine learning tools
  2. PSI Project – A research project at the Australian National University.

12 Useful Tips for Machine Learning

Pedro Domingos of the Department of Computer Science and Engineering at the University of Washington provides a very useful paper with tips for machine learning. The paper is title, A Few Useful Things to Know about Machine Learning [pdf].

Below are the 12 useful tips.

  1. LEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION
  2. IT’S GENERALIZATION THAT COUNTS
  3. DATA ALONE IS NOT ENOUGH
  4. OVERFITTING HAS MANY FACES
  5. INTUITION FAILS IN HIGH DIMENSIONS
  6. THEORETICAL GUARANTEES ARE NOT WHAT THEY SEEM
  7. FEATURE ENGINEERING IS THE KEY
  8. MORE DATA BEATS A CLEVERER ALGORITHM
  9. LEARN MANY MODELS, NOT JUST ONE
  10. SIMPLICITY DOES NOT IMPLY ACCURACY
  11. REPRESENTABLE DOES NOT IMPLY LEARNABLE
  12. CORRELATION DOES NOT IMPLY CAUSATION

For details and a good explanation of each, see the paper A Few Useful Things to Know about Machine Learning [pdf].

Also,later this year, Pedro Domingos will be teaching a machine learning course via Coursera. Sign up if you are interested.

Data Analysis by Data Type

Data analysis is performed in many different fields and on many different types of data. Most fields call it something different. The following list comes straight from Jeff Leek’s Data Analysis Coursera class.

Name of Data Analysis by Data Type

The type of analysis is very similar for all fields, but what separates data science and machine learning from the others is the 3 V’s of big data. Data science and machine learning deal with a greater Volume of data, Variety of data, and Velocity (speed at which new data appears) of data. Because it is becoming cheaper and easier to store massive amounts of data than ever before, I think the other fields are beginning to realize the potential in big data. Signal processing is definitely becoming an area with big data, due to the fact that electrical sensors are everywhere.

What are your thoughts? Do you see any real differences in the data analysis performed for the data types above?

Large-Scale Machine Learning at NYU

New York University is offering a Large Scale Machine Learning course starting later this month. This is NOT a MOOC, so it is not open to everyone. However, the lecture videos will be posted and possibly the other class handouts. This is not an introductory course, so knowledge of machine learning is a prerequisite. The course is being taught by John Langford of Microsoft Research and Yann LeCun of NYU.

For more about the course, see the original blog announcement.

Elements of Statistical Learning Textbook (Free)

The Elements of Statistical Learning textbook is available for free. It is a classic, widely-used textbooks for statistics and machine learning. Here is a far from complete list of some of the topics:

  • Supervised Learning
  • Linear/Logistic Regression
  • Regularization
  • Model Selection
  • Trees
  • Neural Networks
  • Support Vector Machines
  • Random Forests
  • Unsupervised Learning
  • Clustering

As you can see, the book is quite extensive.


Note: This book has been available for a quite a while, but I realized I have not added a link to it on my blog.

Top ten algorithms in data mining (2007) [pdf] | Hacker News

Top ten algorithms in data mining (2007) [pdf] | Hacker News.

The discussion below the link is also very good.

If you are curious, here are the 10 algorithms, and the paper is displayed below.

  1. C4.5
  2. k-Means
  3. SVM
  4. Apriori
  5. EM
  6. PageRank
  7. AdaBoost
  8. kNN
  9. Naive Bayes
  10. CART