Hilary Mason provides another great talk title: Machine Learning for Hackers. The video is worth watching. Enjoy!
I think a dataset related to human trafficking would be interesting. It would need to contain: when, where, and the age of the person kidnapped. It could also contain the eventual location of the victim. I don’t know that any organisation has this data. Many times the kidnappings occur unknowingly or the persons involved are not allowed to speak about it. I think this data could be used to predict when kidnappings for human trafficking would occur. Thus preventing the crime.
Also, I would love a dataset all about my life. I would love to know what factors constitute a better day for me. I would like the dataset to contains foods I eat, accomplishments I get done, sleep (including how often I wake up), exercise, devotion time, rating of how good I thought the day was and possibly anything else. I know books and experts say that good food and exercise make people feel better. I would really like to know for me, which factors are most important. The problem is: I don’t want to take the time and effort to track all this data. I bet there is an app for it.
Chinese Gender Predictor
This one would just be for fun. Currently, I would enjoy a large dataset with information about child births. The dataset would need to contain the conception date (or due date), mother’s birth date, and child’s gender. I know that hospitals have this type of data, but HIPPA prevents the sharing of medical records. Here is why I would like it. There are numerous Chinese Gender Predictors around. They claim to be able to accurately predict the gender of a baby. Given enough data, this would be a fairly simple thing to validate or invalidate. Just perform the Chinese Gender Predictor and see how often it is correct. If it is correct significantly more than 50% of the time, then the early Chinese may have known something we do not. Otherwise, the Chinese Gender Predictor is not a useful tool. This data would have little impact for bettering the world, but just sounds like a fun little project.
Whether it exists or not, what dataset would you love to access?
DJ Patil and Josh Elman, both of Greylock Partners, give an insightful talk at LeWeb London 2012. The most important part was the introduction of the Data Scientific Method.
Data Scientific Method
- Start with a Question
- Leverage your current data
- Create features and run tests
- Analyze the results and draw insights
- Let the data frame a conversation
We are excited to announce that the Second Edition of OpenIntro Statistics will be released in August! The First Edition will remain available for one more academic year (2012-2013), or longer if there is continued interest. The Second Edition is a further evolution of OpenIntro Statistics and includes the following important changes:
- New data. Many of the data sets, some just one or two years old, have been swapped out for newer data and studies.
Talk of bigdata is not new, but the term bigdata might be new. Here is a nice article from 2010 about the benefits and risks of massive amounts of data. It is a nice read about bigdata from a 2010 perspective. You will notice it lacks some of the hot buzzwords of today.
Data Science Education Opportunity
I think there is an opportunity for an online data science training program. Many people wanting to learn data science already have a degree and some of the necessary skills. The online curriculum would have to be flexible enough to allow a person to fill in the gaps of missing knowledge. I would also like the topics to be broken down further than a typical university semester course. Break the materials into one or two week segments. For example, don’t offer training in machine learning. Separate it into numerous training segments like: logistic regression, support vector machines, and random forests. Proper prerequisites would need to be stated, but this method would allow learners to quickly grasp small chunks of knowledge.
One problem here is the lack of credentials. The online material would need to present a student with some type of certificate/award for completing material. The certificate/award would have to mean something, and not just be a slip of paper. The other problem is the vast amount of time required for creating the training material. Anyhow, I think there is an opportunity for someone or some company to create this curriculum.
What would work for you?
How would you like to learn data science?
Yesterday, I posted about some traditional strategies to acquire data science skills. Today, I will post a nontraditional strategy.
There is hoards of data science information available on the internet for free. With enough personal motivation, a person could learn all the skills necessary for free (or cheap) online. Coursera is probably a great place to start. There are also other good sites such as Udacity, the Kaggle Wiki, other blogs and websites.
The problem with this approach is knowing exactly what to learn. A course in machine learning is great, but data science is more than just machine learning. How do you know what to learn? It would be really nice to have a collection of data science topics and the associated online training materials.
Would this strategy work for you?
Based upon the popularity of a previous post about a certificate program from the University of Washington, it appears that many people are interested in learning the skills necessary to become a data scientist. Thus, I decided to compile a list of some of the possible learning strategies.
Traditional College Education
The most obvious path would be to study at a traditional college or university. Colleges and universities are starting to notice the demand for data science skills, and many colleges are currently offering programs to prepare someone as a data scientist. This path is safe and predictable. Do the homework, complete the courses, and get the degree or certificate. Most people are familiar with the process, and it offers few surprises. The problems here are the costs, lack of flexibility, and time involved.
Companies are now starting to offer training programs for data science. EMC is leading the way in this category with their data science training program. Cloudera also offers lots of training related to hadoop and big data. Wolfram offers data science training with Mathematica. One of the problems with this category is the cost. Another problem is the companies have the tendency to teach and promote their own products. This may leave the student with numerous gaps in the full data science spectrum.
What are you thoughts about the above approaches? What are the positives and negatives? Also, later this week I will be posting some less-traditional approaches to learning data science.
With Facebook (s fb) engineers, it appears the high-performance database apple doesn't far fall from the tree. On Monday, former Facebookers Eric Frenkiel and Nikita Shamgunov (who also spent six years as a senior engineer on Microsoft (s msft) SQL Server) launched a startup called MemSQL that seeks to speed relational databases by taking a page out of the Facebook playbook.