The Iceberg secret in Machine Learning

Often times the whole picture is not visible. The truth requires digging deeper than most. But the payoff is usually worth it

I first came across the Iceberg secret a few years back when reading a post by Joel Spolsky. Joel does a great job at revealing the secret that many software projects face. Since then I have come to realize that the same principle translates to machine learning projects.

In a nutshell, the Iceberg secret speaks to the clear gap between technical and non-technical stakeholders when it comes to evaluating the quality and progress of building an AI-based solution. Often the solution is judged on the visualization of the data or the scalar output of the predictions (the 10%) and little regard is given to the bulk of the work that is spent on the data preparation (the 90%)

For the Data or ML Engineer, they understand the exponential value in carefully focussing on understanding the data, preparing the data and how important it is to ensure that a mature data pipeline is in place before any time is spent on the top 10%.

Managing this process is usually the linchpin of a project’s success.

Often times the technical team is instructed: “We don’t need to spend much time on the data. We can just throw it into the AI. Its AI after all. Don’t bias the system. AI will figure it out.”

The reality on the other hand is that most of the work is spent on:

  • Designing a scalable and flexible data pipeline for effective data ingestion
  • Reviewing the data to ensure you understand its characteristics and how to approach the solution in the context of the overall goal
  • Designing and implementing the cleaning, transforming and loading of the data for ML training
  • Implementing a process that allows for continuous training, evaluation and redeployment of the ML models

The secret is not that this challenge exists, but the secret is that your client or project manager has no idea that most of what they understand or hear about AI’s value in the media, is only as good as the data.

Your most valuable skill (the skill you will get paid the most for) is your ability to effectively manage this misperception.

You are probably wondering, how do I manage the iceberg effect. Here are some strategies I find goes a long way.

Provide visualizations during the data exploration phase

Typically you will need to provide some documentation on your strategy for how you will approach the solution. This is often presented to either the internal technical team, or the client.

The audience is unlikely to understand what sparse data is or whether its a classification, clustering, regression or ranking problem. However, they do understand charts and graphs. By visualizing the existing data set in its raw state you are doing yourself two big favors:

  1. It makes the data easier to reason about between both technical and non-technical stakeholders
  2. It allows you to showcase the improvements of the data once you have cleaned and transformed the data for ML training

Put extra effort whenever visualizing the data

Now that you know the top 10% is all that truly matters it’s important to make sure it looks great! Whether its a screenshot or a demo, put extra effort into the way you visualize the data

For data scientist and ml engineers, matplotlib is great. It’s easy to quickly visualize a confusion matrix, the error rates, or learning graph, etc. Matplotlib is extemely powerful and the community is very strong which produces plenty of examples. But for me, I like to use libraries with more aesthetically pleasing graphs when showcasing to non-technical team members. Here are some examples of Pygal, Seaborn, and Pyplot.

Pygal

Pygal graph library

Seaborn

seaborn graph library

Plotly

I personally like to use plotly do display straight forward graphs. I think its easy to make your graphs look great, but know that you are sacrificing some of the power of matplotlib. Ultimately you should use the tool that you feel the most comfortable with but still looks great!

Plan for explainability

Explainability is hard. It’s even harder when you use a deep learning based machine learning where the decision making is harder to prove. Still, it’s important to visually represent what you can.

Track and visualize progress

What is arguably even more important is visualizing your architecture and neural network design to display changes and progress as you find better architectures. Again it makes things easier to reason about without having to explain the more complex details.

To visualize neural networks you can use Tensorboard from Tensorflow or ANN-Visualizer library. Below is screenshot of what is produces