Marginally Interesting - The Newsletter - Issue #6: What kind of ML tools do we really need?
Welcome to Issue #6, today about what kind of ML tools we really need.
I recently had a quite interesting discussion on Twitter (it is possible!) with Lars Albertsson. Previously at Spotify, his startup Scling works on providing data-value-as-a-service since 2018.
One of the topics that has followed me (or I followed it?) over the past few years is that of platforms and tooling for data scientists and machine learning applications. A lot has happened in the past years. What started as mostly open source driven projects propelled Python to become the language of choice for data science.
Eventually, cloud companies like AWS and Google picked up the topic and added their offerings. Google in their typical fashion re-seating everything on top of tensorflow, while AWS produced their usual sprawl of offerings under the Sagemaker brand.
Still, I feel that there are some things missing. The ubiquitous notebook is great for cursory exploration but challenging for cleaner and more robust production code. Even a library like pandas, the de-facto standard for managing datasets that fit into RAM has some limitations as pointed out by one of its creators.
At the same time, I always felt like this is a challenging space to do some innovation and make money. It seems like you'll need to take an "open core" approach with open source software plus some paid upgrades, but then you'll always live in fear that one of the cloud providers takes your open source core and integrates it in their offerings.
So I wrote the following tweet (click on the Twitter icon to expand):
There very interesting replies and pointers, but I want to follow this specific thread by Lars:
Besides economically feasibility, yhere is another dimension to this which sounded very familiar. There is definitely an interest in better tooling, but there are many problems related to putting ML into practice which are unrelated to tooling, but are about processes, and organization.
For example, as I said previously, data science projects are different because they have to deal with a much higher amount of uncertainty. You have to very consciously manage this uncertainty and include it in your planning. Just having shiner deployment pipelines for ML models alone won't make your projects deliver better results.
Another example is the role of data and data quality within a company. Data is not organized in a way that it can be used to build data driven products. As Jesse Anderson often points out, even the role of the data engineer is not well understood.
Lars compared the situation to Toyota and GM. In order to get their problems with quality under control, GM studied how Toyota factories worked and even rebuild a factory, but without really understanding the differences in processes.
Interestingly, not all tools are the same, but some tools manage to transform the way that people work, sometimes due to its short-comings. Hadoop seemed to have been one of these examples.
I believe that the best tools also transform the way we think and work. I think there is an optimistic view on "If you have a hammer, every problem looks like a nail" in which you understand for the first time that many problems are in fact nails (and that that is a good way to view them).
For some reason, Big Data and Data Engineering seem to currently converge towards distributed execution engines for SQL. This is nice because SQL is well known, but at the same time, maybe it is a bit too convenient and well-known.
And part of me also thinks that one should not ignore this space just because it seems economically challenging. But more on that (maybe) some other time.