Production Ready Data Science

24 August, 2016

This article is featured in the free magazine "Data Science in Production – Download here

Becoming a data-driven organization is not an easy task. Off the top of my head, and in no particular order, these are the most frequent challenges a company faces:

  1. Attracting, retaining, and training the right talents;
  2. Collecting and making data available cross-silos;
  3. Modernize their tech stack or increase the complexity of the IT landscape by adding new technologies;
  4. Fear of the unknown i.e. many people are afraid about losing control, or their job, to data and data science;
  5. Lack of vision1.

As I was thinking about this list, however, I felt there was something deeper about the troubles some organizations are facing. I know in fact about companies that have made significant progress in all five points, but that are still not reaping the fruits they were expecting. When I looked at these companies more closely, they were all not putting the models they developed into production. The reasons varied, from being content with report-driven decision making (either a one-off report, or periodic reporting) to simply struggling with all the pieces of the puzzle.

Being the curious type, I set out to investigate what was making the puzzle so difficult for them. Productionizing a model involves a series of (moving) pieces:

The companies struggling with becoming data driven, are failing in on or more of the above points. What they are doing is a mix of the following4:

If you’ve payed attention to these points, you probably start seeing a pattern: data scientists usually suck at software quality, that is7: reliability, usability, efficiency, portability, and maintainability. Because data-driven models are implemented through software, they suffer from bad software quality just as much as your typical application.

Let me be clear: this is not an easy task! To create a (great) model you need creativity, a scientific attitude, knowledge of various modeling techniques, etc. Getting data scientists able to create these models is one of the biggest challenges for an organization. But focusing on the modeling at the cost of software quality will produce something great and admirable that ends up not being used.

This is the reason we actively hire data scientists that can code, and can do it well.

I imagine you now have the next burning question which is: what if the data scientists working at my organization are not good at it? What if someone left the company, implemented a great new method, but nobody can actually make sense of what she wrote?

This is where I pitch you our services, training and consultancy, because it’s not like I write 12 hundreds words for nothing! We can train your data scientists to write code of higher quality and we can review the code they wrote. And we’re very good at it and have fun while doing it! Get in touch.

I wrote this post after a lot of brainstorming with the team. A big thank you goes in particular to Gabriele for reading everything and giving me precious feedback, together his experience at some of the largest Dutch enterprises.

  1. A lack of vision is a much broader issue than 1-4 as it can bring even the largest and most flourishing corporations to the ground (a great read about this is Good to Great). I included it nonetheless as it will cut or make budget unavailable or prevent management buy-in of data-driven products. And lack of management buy-in is even worse that lack of budget. One of our first clients installed its first Hadoop cluster on dismissed machines, built a type-ahead and recommendation engine for their web shop, and see profits surge right after they put it into production. There was nothing a budget could do had management not agreed about "letting" the model into production. 

  2. Unless something breaks of course. 

  3. Whatever that means for you. 

  4. This is probably one of the post with the highest density of bullet points I’ve ever written. Apologies. 

  5. I still vividly remember when a professor suggested that using kkk as a variable name was not a very wise choice, to which I replied that I was using k and kk for something else. 

  6. It is not my intention to denigrate their work. I often use the matrix factorization methods implemented in Spark to train my recommendation engines. I am merely stating that they set out to solve a problem without thinking about productionizing their work. 

  7. This is a subset of the ISO 9126 standard on software quality. 

Subscribe to our newsletter

Stay up to date on the latest insights and best-practices by registering for the GoDataDriven newsletter.