The team that brought you the Analytics Handbook, has freely published the third and final book, titled THE DATA ANALYTICS HANDBOOK RESEARCHERS + ACADEMICS. This book focuses on data science in research and academics communities. Like the previous 2 books in the series, it includes interviews with top experts in the field. Here are just a few of the people with interviews in this book.
The authors are now working on a new data science training project called Leada. Check it out for more details.
Onur Akpolat has put together A curated list of awesome big data frameworks and resources. The list is very extensive and includes: NoSQL databases, machine learning libraries, frameworks, filesystems and more.
On a similar note, Joseph Misiti has compiled a large list of machine learning specific resources. The list is titled, Awesome Machine Learning, and it includes resources for various languages, NLP, visualization, and more.
Both lists are on Github, so if you notice something missing from the list, feel free to add it. Contributions are welcome.
Tristan Zajonc, cofounder of Sense Platform, gave a recent thought-provoking talk at Data Driven NYC. He spoke about the future of data science productivity. According to Tristan:
In the next 2 or 3 years, everybody doing data science should be using a data science productivity platform…a cloud-based data science platform.
In addition to the productivity platforms, the power methods will see some improvements. Here are 2 that Tristan mentions:
- Probabilistic Programming – matching of computer science and Bayesian statistics
- Deep Reinforcement Learning - making optimal decisions via deep learning
It is an exciting time for data science. I think the next few years will see much better productivity tools, workflows, and platforms. More on that in an upcoming blog post.
The other videos from Data Driven NYC are also available on Youtube.
As the field of data science continues to grow and mature, it is nice to begin seeing some distinction in the roles of a data scientist. A new job title gaining popularity is the data engineer. In this post, I lay out some of the distinctions between the 2 roles.
Data Scientist vs Data Engineer Venn Diagram
A data scientist is responsible for pulling insights from data. It is the data scientists job to pull data, create models, create data products, and tell a story. A data scientist should typically have interactions with customers and/or executives. A data scientist should love scrubbing a dataset for more and more understanding.
The main goal of a data scientist is to produce data products and tell the stories of the data. A data scientist would typically have stronger statistics and presentation skills than a data engineer.
Data Engineering is more focused on the systems that store and retrieve data. A data engineer will be responsible for building and deploying storage systems that can adequately handle the needs. Sometimes the needs are fast real-time incoming data streams. Other times the needs are massive amounts of large video files. Still other times the needs are many many reads of the data.
In other words, a data engineer needs to build systems that can handle the 3 Vs of big data.
The main goal of data engineer is to make sure the data is properly stored and available to the data scientist and others that need access. A data engineer would typically have stronger software engineering and programming skills than a data scientist.
It is too early to tell if these 2 roles will ever have a clear distinction of responsibilities, but it is nice to see a little separation of responsibilities for the mythical all-in-one data scientist. Both of these roles are important to a properly functioning data science team.
Do you see other distinctions between the roles?
There are an abundance of statistical programming languages available, and the fine folks at DataCamp started to compile some of the data about the languages. They then produced the infographic at the bottom of the post. To start with, SAS, R, and SPSS are the 3 languages being compared.
Here are 3 bits of information based upon the infographic:
- If you want a job – use SAS
- If you want to use the language of Kaggle winners – use R
- If you want to read analysis in an academic journal – use SPSS
I would love to see Python added to the infographic, but it might be much harder to get accurate numbers for Python since it is general programming language not just a statistical programming language. I would also love to see some benchmarks around both speed and number of steps(lines) to complete certain tasks. Anyhow, enjoy the infographic for yourself. Is there anything else you would like to see compared?
The widely popular Caltech course, Learning from Data, will be offered on EdX this fall. The course starts September 25, 2014, and it will run for 10 weeks. Here is an abbreviated list of the course topics.
- Linear Models
- Neural Networks
- Cross Validation
- and much more
EdX offers a number of other Data Science related courses. See all of them on the Statistics and Data Analysis course list.
Last week, I got the opportunity to spend some time with the team from Insight Data Engineering. They offer a free program that trains people to be data engineers. Then they help those people connect with a job at an impressive company. The program runs a few times a year and consists of 6 intense weeks learning about and working on a data engineering project.
Although the program is free, it does have a highly-selective application process. Once accepted, you can expect the following:
- A beautiful office space in sunny Palo Alto, CA
- Mentoring from experts in the field
- Meet and Greets with some of the biggest names in data science
- Introductions to some of the leading data engineering companies
- Access to a growing network of program alumni
- A bright future as a data engineer
Insight Data Engineering is the same company that has run Insight Data Science, a similar type of program but for scientists instead of engineers, for the past 2 years. That program has 100% placement so far, and I don’t see that number ever changing. The program has an excellent advisory board that is actively involved in the program.
The Data Engineering program is actively accepting applications for the next session scheduled to start in September. Hurry, the deadline for applications is July 7, 2014.