Nice GraphDB and NoSQL Talk

This is a wonderful talk by Max DeMarzi (he has a very informative blog as well). If you are new to NoSQL or Graph Databases, I highly recommend this video.

One comment stuck out for me:

You’re never gonna run out of nodes when you get to half a trillion…

That is a really big number, but I wonder how many years that statement will stand. If you have any thoughts, please leave a comment.

ChiSC: Max DeMarzi – Is Your Problem a Graph Problem? from 8th Light on Vimeo.

2 Recently Released Open Source Graph-Related Projects

  1. GraphBuilder

    Intel Labs built a tool for constructing mathematical graphs out of large datasets. It is Java based and works with Hadoop and MapReduce. Intel has release a whitepaper explaining more about GraphBuilder. The code is available on Github. A big thanks to Mark Nickel for pointing out this project.

  2. ArangoDB

    ArangoDB is a flexible NoSQL database. It is a document database with the ability to add edges. Thus it can become a graph database. I had a fun time playing around with the online tutorial and demo. ArangoDB also claims to support being a key/value store. The code is available on Github.

3 Secrets for Aspiring Data Scientists | Software Advice

Michael Koploy wrote 3 Secrets for Aspiring Data Scientists about what it takes to enter a career as a data scientist. He lays out 3 steps:

  1. Sharpen Your Scientific Saw – Hone your math and science skills
  2. Learn the Language of Business – Data Scientists need to explain the data in business terms
  3. Keep Adding to Your Technical Toolbelt – Learn all the tools you can (NoSQL, Excel, Hadoop,…)

The article is a nice read. http://blog.softwareadvice.com/articles/bi/3-career-secrets-for-data-scientists-1101712/

Java and MongoDB Webinars

10gen, the company behind MongoDB, will be offering some free webinars this fall. This webinar series is targeted at using MongoDB with Java. 10gen has been running successful webinars for a long time, so I would high recommend any/all of the following sessions.

Title Date
Building your first Java Application with MongoDB Oct. 18, 2012 and Nov. 22, 2012
Building Web Applications with MongoDB and Spring Nov. 1, 2012
MongoDB on the JVM Nov. 29, 2012
Simplifying Persistence for Java and MongoDB Dec. 13, 2012

Neo4j and Bioinformatics Webinar

Neo Technology, the company behind the graph database Neo4j, is hosting a webinar on Thursday. Pablo Pareja from the Bio4j project will provide an overview of bioinformatics and neo4j, as well as some applications.

Bioinformatics can be viewed as data science for biology. Bioinformatics was cool before data science was even a term.

If you are interested in learning more about bioinformatics and graph databases, the register for this webinar and start learning.

Challenge To Future Developers: Start Storing More Data

Dear Future Developers

Please store as much data as possible. Do not worry about the cost of the extra storage disks. The value in the data will far outweigh the cost of the hardware. Here are some examples of data that could be stored but is typically not.

Start storing data about the order in which pages on your site get visited. Where do visitors most often land, and where do they go from there? Is there a path that leads to visitors becoming customers? Is there a path that leads to visitors leaving? Both would be good to know. Given enough of this data, it would be possible to predict what pages eventually lead to the most customers.

Start storing log information to a database. Some places do this, but far too many do not. As developers, this should get a higher priority. It is never fun to go debug a problem only to find the log file has been overwritten. Setting up a database for this would definitely save on debug time. Plus, the log data could possibly be helpful for determining trends or parts of the system that frequently have issues. It is important to remember that not all bugs produce errors, thus it is important to store all the log data.

Start storing data about the errors that occur and what(screen/page) caused the error. This information is typically stored in log file somewhere. It is too frequently lost after a couple days. It would be much better to store this information in a database for archival purposes. This is closely related to the previous paragraph.

Start storing information about which fields on a form get updated. Then you can notice if users are constantly returning to the same form to update a different field. Maybe the user was unaware that both fields can be updated simultaneously. Rearranging the fields might create a better user experience, and it will decrease the amount of updates hitting the database.

Start storing data about which buttons and links users click. This is not just the pages visited but the actual user actions. A good web analytics program can cover some of this, but why not store all of it yourself. Then you can do with it as you please. It would be great to know for your site what buttons users click the most? Is it the color, location, neither, or both that determine a popular button? What buttons and links never get clicked? How frequently does the same user click each button? If a user continues to come back and click the same button, it may indicate a navigation issue. There are some nice usability enhancements that can be made with this data.

Start storing data that you cannot immediately see as useful. The bigdata movement is continually showing the advantage of having more data. You never know when or for what the data will be useful.

Many of the current NoSQL choices would be good candidates to store the above data. This data will obviously grow very quickly, and speedy inserts are a must. Therefore, a database like MongoDB, Cassandra or Redis might be a good choice.

What other data do you think could be collected? I am sure there are lots of other possibilities. Also, I am going to take myself up on this challenge. I would like to store more information about the software I build.

Sincerely,

Ryan Swanstrom