It is back-to-school time, and here are some papers to keep you busy this school year. All the papers are free. This list is far from exhaustive, but these are some important papers in data science and big data.

### Google Search

**PageRank**– This is the paper that explains the algorithm behind Google search.

### Hadoop

**MapReduce**– This paper explains a programming model for processing large datasets. In particular, it is the programming model used in hadoop.**Google File System**– Part of hadoop is HDFS. HDFS is an open-source version of the distributed file system explained in this paper.

### NoSQL

These are 2 of the papers that drove/started the NoSQL debate. Each paper describes a different type of storage system intended to be massively scabable.

### Machine Learning

**10 algorithms in data mining**| pdf download – This paper covers a number (10 to be exact) of important machine learning algorithms.**A Few Useful Things to Know about Machine Learning**– This paper is filled with tips, tricks, and insights to make machine learning more successful.

#### Bonus Paper

**Random Forests**– One of the most popular machine learning techniques. It is heavily used in Kaggle competitions, even by the winners.

Are there any other papers you feel should be on the list?

Maybe include a literature survey on neural networks? Not my field of specialty so I won’t recommend one, but I know it’s about to get red hot.

That is true. Also, a paper on random forests would be nice as well. I may have to look for best papers on those topics.

Thanks,

Ryan

@Ryan: Link to Random Forest Paper: http://oz.berkeley.edu/~breiman/randomforest2001.pdf

Thank you very much for the link. That would the correct random forest paper to add. I have added it to the list.

Ryan

Pingback: [repost ]7 Important Data Science Papers » New IT Farmer

Reblogged this on Datapolitan and commented:

Great list of resources, though I’d add E.F. Codd’s seminal “A Relational Model of Data for Large Shared Data Banks”: http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf

Another paper on MapReduce that helped me a lot in writing algorithms in MapReduce framework is this one http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_725.pdf

and to handle large data sets, Breiman’s Pasting small votes paper

http://sci2s.ugr.es/keel/pdf/algorithm/articulo/1999-ML-Breiman-Pasting%20Small%20Votes%20for%20Classification%20in%20Large%20Databases%20and%20On-Line.pdf

Pingback: Seven Important Data Science Papers | Data Scie...

Pingback: Seven Important Data Science Papers | Big Data,...

Reblogged this on Gary Short.

thanks for reblogging

Pingback: Seven Important Data Science Papers | On Data S...

great resource list. any papers on statistics as it changes to handle larger datasets?

That is a great idea. I have not looked for any papers on that topic. If I find any, I will post it to the blog. Thanks for the comment.

Ryan

Pingback: 5 more Data Science Papers | Data Science 101

Nice list !! Also, consider Amazon.com recommendations http://www.win.tue.nl/~laroyo/2L340/resources/Amazon-Recommendations.pdf

Oh, that is a good. I have not seen it before. Thanks for the recommendation on the recommendation paper.

Ryan

Pingback: 7 Important Data Science Papers | #algorithms |...

Pingback: Important data science papers | Bring on the Data!

Pingback: Big Data Mining » 7 Important Data Science Papers