Marginally Interesting - The Newsletter - Issue #7: What Particle Accelerators Teach Us About Data Culture

Jun 18, 2021

the inside of a large metal structure with metal bars — Photo by Antonio Vivace on Unsplash

One of the biggest challenges when building data products is data. You have to get the data, clean the data, test whether the data is actually fit for what you want to do. You need to build pipelines to fetch the data automatically.

The worries aren't over after that because you probably introduced a data dependency between parts of the organizations that usually don't need to talk to one another.

For example, if you're building a recommender system, you use data from the frontend and the order system to track what customers actually bought.

Google's paper "ML: The High-Interest Credit Card of Technical Debt", which I considered required reading for anyone building data products in a non-trivially sized company, discusses how these data dependencies are different from normal dependencies.

What the paper does not cover is what to do about it, and indeed that's a very good question.

I don't have a full answer, but I always have to think back to what Thorsten Dietzsch told me, with whom I worked at Zalando. Thorsten (who is an overall great guy) was a physicist in his former life and did his Ph.D. at the CERN project.

CERN (founded in 1954) is an absolutely mind bogglingly gigantic project (both in terms of physical dimensions and in the amount of people-centuries that went into building it) that is essentially a particle accelerator which they use to study atomic and I guess subatomic structures. I'm not a physicist myself, but as far as I understood they accelerate particles to such high speeds that if they collide with one another there is so much energy that subatomic bindings are broken and from the "debris" you can see what's inside.

It's not a simple matter of seeing, of course, but it requires some serious data analysis on tons of sensor data. CERN is producing and analyzing massive heaps of data. They probably did big data before it was cool. According to their website, they are collecting more than 100 Petabytes per year, that's about 100.000 Terabytes of data.

One question is of course, how do you manage data quality on such a scale? How do you do that given that the people who work there come from hundreds of different research institutes in teams that are more or less self-organizing?

Distributed responsibility for data

There are teams or workgroups who are responsible for the data at different levels. For example, there is the workgroup close to one of the thousand of sensors, and they make sure that the data is available, of high quality, and so on.

Then there are teams that do all kinds of analysis to provide "golden data sets" based on the sensor data. For example, there is one team that is working on detecting a certain kind of particle, and they own the process that provides the agreed upon standard way to detect a certain particle.

Understanding the value of data

Now this might look like the typical kind of separation of responsibilities you would see in a company as well.

But CERN has also fully understood the value of the data. After all, this data is what will become the basis of scientific discoveries and the much coveted publications.

So being part of these workgroups comes with very high appreciation in the CERN community. It is not easy to get into those groups. You have to prove that you're good enough (of course), and make significant contributions before you are accepted. Often you start out doing low-level housekeeping work before they let you work on anything interesting.

And if there is a scientific breakthrough, you will make it on the author list, which is the ultimate currency in science.

Leaving room for exploration

On the other hand, nobody is forced to work with these golden data sets. If you want you can "roll your own" detector based on the sensor data. But then you need to be prepared to proving that the data is correct and explain why you haven't used the golden data set that already exists. It's not like you are forbidden to do so, but you need to have a solid reason why.

You might wonder what CERN has to do with your company, but in a way, you also have sensors (frontend tracking), golden data sets (e.g. conversion funnels), and scientific breakthroughs (a personalized recommender that leads to 15% uplift in conversions), and you have people who want to see some appreciation (bonuses and promotions) for their hard work.

I think many companies still haven't really understood the value of data in all the places that need to be aware of it. Teams that essentially provide "sensor data" don't have data quality as part of their KPIs. Contributions aren't properly attributed. Data management is expected to "just work". Data quality is not given enough attention, both operationally and to validate whether something really works or not. Companies aren't strategic enough about building out products with a "sensor strategy". Sometimes data management is understood as a purely technical or infrastructure problem, which is only part of the solution.

All of this can be changed, but it needs to start at the very top. Enjoy the weekend!

ML in Practice

Discussion about this post