How to Interview a Data Scientist

Here is a slide show on hiring interviews for Data Scientists. This was presented at Strata 2013 by Daniel Tunkelang, Director of Data Science at LinkedIn.


This is great checklist of topics to become proficient in for all aspiring Data Scientists. This list of skills can reasonably be attained within a master’s degree program in Statistics, Computer Science or Operations Research.

Data Science 101

A while back James Kobielus wrote the article, Data Scientist: Consider the Curriculum. It contains one of the best descriptions of a data science curriculum I have seen.  Also the article includes a list of algorithms/modeling techniques that should be known by a data scientist. Below is the list from the article.

  • linear algebra
  • basic statistics
  • linear and logistic regression
  • data mining
  • predictive modeling
  • cluster analysis
  • association rules
  • market basket analysis
  • decision trees
  • time-series analysis
  • forecasting
  • machine learning
  • Bayesian and Monte Carlo Statistics
  • matrix operations
  • sampling
  • text analytics
  • summarization
  • classification
  • primary components analysis
  • experimental design
  • unsupervised learning
  • constrained optimization

The list almost looks overwhelming.
Do you think anything is missing from the list?

View original post

Big Data at Facebook

Facebook has long been a leading force in the development of big data. The recent release of Graph Search is seen as a move by the company to flex their big data muscles and put their engineering bona fides to the test.

A couple of good articles came out this week discussing the engineering challenges Facebook is taking on with this project. Zach Miners, at IT World, explains the work the new tool does searching large graphs with this article. Harpreet, at Tools Journal, wrote a good piece explaining how  Natural Language Processing is used by Graph Search.

One last article I would like to highlight is a glossary of big data at Facebook that was put out by Wade Roush at Xconomy

NYU’s Data Science initiative pushes North East to forefront

NYU has announced a university wide Data Science initiative and the creation of the Center for Data Science (CDS). Columbia University got a head start with The Institute for Data Science and Engineering (IDSE), but NYU is offering a MS degree with plans to offer a PhD, something Columbia’s IDSE is not doing at this time.

This news comes on the heels of Massachusetts’ efforts to make Boston a big data capital.   Massachusetts has long been a leading center for academic research in data mining and big data through MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL).

There are many institutions across the US that have added Data Science degrees to existing programs. However, NYU is one of the first schools to create a center with the purpose of offering a degree, as opposed to only conducting research, in this field.

Python wraps itself around big data

Python and big data have recently been in the news. Continuum Analytics just received $3M from DARPA (I hope they cash their check before the possible sequester) to develop big data capabilities for Python with projects Blaze and Bokeh. This is promising news for those of us that are not proficient in multiple programming languages. At this point, Java has been the lingua franca for most big data applications. This project won’t address all the performance issues with Python, hence the common use of Java in most development, but hopefully it’ll allow us non-polylingual programmers do to some heavy lifting without all the curly braces.

Once again GigaOm’s Derrick Harris gives us a great report in, “DARPA puts $3M into startup pushing big data in Python.”

InformationWeek Government also has a nice article, “DARPA Funds Python Big Data Effort,” by J. Nicholas Hoover.

Also check out Continuum Analytics’ blog announcement about the project and another post detailing Blaze.