This is one of the better descriptions, I have seen, for what a data scientist does.
They must find interesting, novel, and useful insights about the real world in the data. And they must turn those insights into products and services, and deliver those products and services at a profit.
Notice, data scientists don’t just need to find insights in data. They also need create profitable products from that insight. I often times feel that data products are not seen as important as improving the machine learning algorithms, but the data products really are the end goal.
The quote came from the Harvard Business Review article, To Work with Data, You Need a Lab and a Factory.
A very nice slidedeck from Jeff Hammerbacher of Cloudera. It goes over k-means clustering and some enhancements.
Deep Learning is a new term that is starting to appear in the data science/machine learning news.
What is Deep Learning?
According to DeepLearning.net, the definition goes like this:
Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence.
Wikipedia provides the following defintion:
Deep learning is set of algorithms in machine learning that attempt to learn layered models of inputs, commonly neural networks. The layers in such models correspond to distinct levels of concepts, where higher-level concepts are defined from lower-level ones, and the same lower-level concepts can help to define many higher-level concepts.
Deep Learning is sometimes referred to as deep neural networks since much of deep learning focuses on artificial neural networks. Artificial neural networks are a technique in computer science modelled after the connections (synapses) of neurons in the brain. Artificial neural networks, sometimes just called neural nets, have been around for about 50 years, but advances in computer processing power and storage are finally allowing neural nets to improve solutions for complex problems such as speech recognition, computer vision, and Natural Language Processing (NLP).
Hopefully, this blog post provides some inspiration and useful links to help you learn more about deep learning.
How is Deep Learning being applied?
The following talk, Tera-scale Deep Learning, by Quoc V. Le of Stanford gives some indication of the size of problems to be tackled. The talk discusses work being done on a cluster of 2000 machines and more than 1,000,000,000 parameters.
Startup50′s list of 42 Big Data Startups.
The voting the done, but the list contains plenty of startups working in the data science field.
The following video goes well with the previous post about Open Source Alternatives to AWS.
It says a lot for the quality of OpenStack, since one the world’s most secretive organizations trusts it. OpenStack might be a good option for data teams needing to quickly build and deploy data products.
Note: This post has nothing to do with the recent NSA whistle blower news.
Working with big data can often mean doing some cloud computing. If a public cloud like Amazon AWS is not an option, there are some open source alternatives. They all offer some level of compatibility with the AWS API for both EC2(compute) and S3(storage).
- Rackspace OpenStack
- Apache CloudStack
I don’t think anybody does it better than Hans Rosling. In the following video he helps to explain population growth, child mortality, and fossil fuel usage based upon wealth. I love how he uses toy blocks and chips to help visualize his point.
See the original post from the Guardian, Hans Rosling: the man who’s making data cool
The blog post, Central Limit Theorem Visualized in D3, was posted last week.
IEEE Spectrum’s Techwise Conversations just published an excellent podcast titled Is Data Science Your Next Career?. The author of the podcast interviews Chris Wiggins of Columbia University.
Note: If you don’t enjoy podcasts, the link contains the entire text for reading as well.
Zipfian Academy, the same company that is creating the 12 week intensive data science training course, will be offering a series of 6 short courses on data science. The courses will be 1.5 hours each and will be taught live in San Francisco. For those of you that cannot be in San Francisco, the courses will be recorded and available online.
The short courses are not free, $35 each or $150 for all, and seating is limited. The seating is limited to allow all students access to the instructors. The first short course starts tomorrow (May 28, 2013), so register now if you are interested. Here are links to all the short courses: