Marginally Interesting - The Newsletter - Issue #4
Today there's a longer discussion of MLOps in general, and how to bridge the gap between notebooks and production, and then there are two articles on survivorship bias and how to build a business from Open Source Software.
As always, you can reply directly to this email to talk to me.
The White Whale of MLOps
Every cloud provider has some MLOps offering, for example, to automate data and training pipelines.
I recently looked at AWS Sagemaker Pipelines and ran the demo - only to be left with an inference instance that cost about $5 per day. The example was running on the Iris data set, a tiny standard example data set from statistics. Yet the whole pipeline takes 10 minutes to run, because every step requires to spin up an instance. I'm aware they made this to show off the possibilities. But still... .
You specify the pipeline through calling functions that take many named parameters. This seems to be a type of interface that has been popularized by tools like Airflow.
It is nice to build the pipeline through Python (instead of say... YAML), but if you compare this with typical notebook code, it couldn't be more different.
You'd probably need to do a lot of reimplementing moving from notebook exploration to production code. The problem with this is that you'll eventually need to go back to exploration to work on the next iteration.
I wondered what other would have to say and took this to Twitter:
You can see the discussion if you expand the tweet (click on the bird!). Don summarized it quite nicely:
This mirrors discussions I've had again and again in the past. Notebooks are great for one thing, but production requires something else. The challenge is that you not only need to clean up your notebook code once, but you need to be able to go back and experiment to work on the next iteration of your model. But production code is optimized for robustness, automation, and so on.
Solutions I've heard:
Keep notebooks and production separate.
Copy and paste production code back into notebooks for more exploration.
Switch to doing exploration on the production components.
Gradually create more production level components, but in a way that you can re-use them for exploration (e.g. libraries).
All of these can work. Each approach has its short-comings. For example, keeping notebooks and production separate means these will be out of sync. I've seen teams use the third approach, but seeing a huge slow down in time. Which could be okay if you start dealing with huge amounts of data.
My personal bet is on the last point, building up infrastructure that let's you do exploration even faster, but how to do that exactly is something we'll still need to figure out.
What are your experiences and what approaches have you found to close the gap?
Survivorship Bias
Every time someone tells the story how some founder persevered against all odds, I have to think of this. Just because one person got wildly successful although everyone told them that it wouldn't work doesn't mean that's a good strategy.
Only looking at those that made it is survivorship bias. The following post I found has a good detailed explanation of it.
Survivorship Bias: The Tale of Forgotten Failures — fs.blog Survivorship bias is a common logical error that distorts our understanding of the world. It happens when we assume that success tells the whole story and when we don’t adequately consider past failures.
How to build a business with Open Source Software
Open source software has become an essential part of today's (IT) world, but sometimes I think it has also made it so much harder to make money for such a highly specialized craft.
Bruno Lowagie tells his story how to build a business around an open source software project, going through everything he tried (custom coding, selling documentation, licensing) and what worked and what didn't.
Entreprenerd — Open Source Survival: Index — entreprenerd.lowagie.com Open Source Survival: 8 different ways to make money with open source software
That's it for issue #4! Have a great week!