Scientific Excellence in Machine Learning Applications


The concept of engineering excellence is well known to software developers. While many definitions exist, a common theme is doing the right thing throughout the life cycle, including optimizing the fit between project goals and business needs, designing for security and following coding standards as just a few examples. The software engineering excellence ideas carry over directly to systems that incorporate Machine Learning (ML) techniques, but they do not cover the ML part of the work.

Various authors list ML best practices: some present deep insights into model development for web applications; others describe organizational-level challenges and solutions to implementing ML based systems; yet others discuss software engineering and ML considerations together. However, the recommendations are domain specific and/or do not address all the phases of the science-to-practice process. We believe the ML application field has matured enough to establish general principles for going from theory to highly effective real world systems.

What does it take to achieve the equivalent of engineering excellence when applying the science of ML? We offer several insights and propose a set of guiding principles.

Scientific Excellence Principles

1. Replicate a small and well understood aspect of human intelligence

It is convenient to view ML as a semantically magic black box – numbers go in and meaning comes out. The truth is that Artificial General Intelligence (AGI) belongs to the realm of scientific research, and ML can realistically only help in restricted problem domains, for which human functioning has fairly detailed explanations. Some of the problems are trivial for humans but potentially difficult for machines (e.g. finding objects in an image). In certain domains, a human would have low accuracy while a machine would be quite close to the mark (e.g. estimating the probability of clicking on an ad). Other problems are solved in significantly different ways by machines, with much better results than humans (e.g. playing chess). The two crucial aspects in each of these settings are that (a) the data exhibits patterns and that (b) ML can learn high level rules based on them. Once an expectation of feasibility exists for a class of problems, choosing the exact problem to tackle requires balancing between business value and model learning difficulty.

2. Ensure the data provides enough signal for ML

It seems common sense to discuss the solution immediately after the problem, but that would skip a key consideration: data. Models typically represent little more than compressed versions of the data they are fed and as a subtle consequence, the effectiveness of a model within a system depends on the quality of that system’s data instrumentation. For data, the basic requirement is to match the distribution seen in training with that seen at run time, and it can be difficult to fulfill in practice because these stages are often implemented in separate paths in the infrastructure. ML developers have to understand the code and data source differences and take mitigation actions, but also to manage expectations. As an example of the latter, if time constraints prevent feature logging in 50% of the cases, developers should anticipate difficulties moving the needle in the tail of the data distribution.

Labeling issues form a category of their own when making sure a model can learn from data. At one extreme, setting labels via trivial heuristics works well and involves no human effort, e.g. for unsupervised visual representation learning, label two consecutive frames of a video as similar and two frames far apart in time as dissimilar. At the other extreme, it takes cross-team collaboration and multiple iterations to arrive at clear and actionable definitions, followed by laborious collection. Philosophical debate on concepts seems detached from reality but is in fact relevant to practical labeling contexts. For example, should the label be ‘car’ here?

3. Choose the model based on plausible mid-level correlations

Choosing the best model or model architecture for a problem requires an intuition for what it would capture. It is not always necessary to fully represent the target concept and problem-specific shortcuts often provide alternatives that are easier to learn. Consider a toy action recognition problem of classifying short video sequences into two categories, running and jumping jacks – the latter generates more localized motion patterns, so the extent of the motion is a good proxy. There is no step-by-step process to come up with a model, but past experience and some amount of ad-hoc experimentation typically point towards a candidate. Determining the most suitable model for a problem has commonalities with model explainability, an established research topic.

As for choosing between Deep Learning and “classical” ML approaches, it is not a given that neural networks work best for every problem. In scenarios in which only a small amount of labeled data is available (e.g. thousands of instances), a large network would likely overfit. On the other hand, techniques like transfer learning specifically address these scenarios, so the final answer depends on problem domain and circumstances.

4. Compare results against an easily understood baseline using multiple ML metrics + require business metric improvements to update production

Traditional Machine Learning metrics like NDCG are often weakly correlated with the overall system’s business metrics, such as session success rate. To continue the web search example, as the system matures and reaches a certain level of quality, gains in “offline”/traditional metrics tend not to be mirrored in “online”/business metrics. It is important to distinguish between what can be calculated easily and what actually defines impact, and to measure the latter in an A/B test, only shipping to production if values go up significantly.

Unfortunately, even interpreting traditional metrics is not simple. What does a statement like “my new model reduces errors by 50%” mean? Perhaps precision for a binary classifier went from 98% to 99%, likely a remarkable technical feat, but with negligible benefit to the overall system. Or maybe precision for the same classifier increased from 50% to 75%, a potentially easier but more impactful achievement. One thing to verify in both cases is that the recall was the same or better; ideally, the change should also be characterized in qualitative terms, e.g. the model does better on large articulated objects.

At least two more elements belong to the measurement picture. The first is establishing a baseline, which should be generic and easy to understand; in many cases, this can simply be an off-the-shelf network for a related problem. The second is “taking the derivative of metrics” by observing their variation for conceptually small changes in the ML setup. This adds context and helps better understand new methods: if the 10% bump from switching to a complex architecture is on the same order of magnitude as the effects of heuristic label cleanup or simple parameter tuning, then it is not notable.

5. Ensure the results of the overall system are reproducible

Consolidating the results of model development experiments is clearly desirable, but there never seems to be a good time to re-run an experiment involving a user facing system. Doing so can slow down overall progress or can reduce the bandwidth of a labeling workforce – and isn’t there a hint of madness in doing the same thing repeatedly and expecting different results? However, the key fact here is that whenever people are involved, asking the same question in different ways or to different individuals will yield more accurate results. This is why, for example, it is standard practice to aggregate responses from multiple human judges per item (e.g. 3) when evaluating textual question answering systems. More generally, multiple user tests must be conducted to assess the value of an ML model.

We extend the concept of reproducibility to the system in which the model is embedded. If the ML component is often skipped due to reasons unrelated to its functioning, this will introduce noise in the business metrics and obscure the ML modeling goals. If the component workflow is rearranged every two quarters, drastically altering the role of the ML model, then strong claims cannot be made about whether ML is beneficial or not. Both situations should trigger revisiting the decision to incorporate ML into the system.

6. Derive improvement ideas from a joint understanding of the data and the model

What is the best way to push a stubborn ROC curve towards the top-left? Get more data? Add more layers? Most of the time, and especially when it matters, the answer is not easy. This document does not aim to provide a recipe for model quality improvements (Andrej Karpathy’s blog post is a great resource for neural networks), but rather to plead in favor of general guidelines. The gist is to look at the data and to reason about how the model processes it.

In Computer Vision, examining data comes with surprising perils: practitioners may fixate on the visual pattern the target class generates in a single image and automatically assume universality, or they may become mesmerized by three search results and fail to notice that their method does not differ from prior art. Such temptations must be resisted and qualitative observations must be accompanied by rigorous characterizations of the data distribution(s), the intermediate quantities computed by the model, the evolution of metrics during training, etc.

Debugging naturally goes hand in hand with improvement but figuring out a problem does not automatically lead to a solution. A general strategy to arrive at solutions is to think of ways in which the model components are mathematically not well suited to the task they are employed for. To borrow an example from the scientific literature, the highly cited Eigenfaces vs Fisherfaces paper was based on the insight that LDA is more appropriate than PCA for building representations for face recognition.

7. Identify temporal challenges and opportunities

The majority of ML implementations will be in use for months or years as opposed to being one-off projects. While some organizations effectively view their data through a temporal sliding window, e.g. by discarding records of user actions older than 30 days, many organizations do not. For the latter ones, the training data gradually becomes outdated and at a certain point a new model will urgently be needed. The situation will put significant stress on ML developers, because tough questions like how to optimally combine old and new data or how to resolve certain types of regressions will have to be answered accurately in a short amount of time. Separate from having to update the model in response to evolving data, there is a need to reduce the bias that the current model version adds to the future training data. Within an ad serving system, for example, the ads favored by the model will dominate the training data, potentially creating a feedback loop where good ads are preferred over optimal ones because the total loss focuses on accurate CTR predictions for the former. A method to counter this tendency is to pick ads at random for a fraction of the traffic (e.g. 10%) and assign higher weights to the resulting training data instances.

As time goes by and the system is exposed to more data, opportunities also open up. One of them is collecting tail examples, which are highly valuable in application domains like self-driving cars. Additionally, unsupervised learning techniques can harness the power of a large data corpus by providing powerful representations to bootstrap supervised methods.

Scientific Excellence vs. Engineering Excellence

The two concepts of excellence coexist naturally and depend on each other implicitly. Practices pertaining to the Minimum Viable Product (MVP) can be preceded by determining a good problem for both ML and business needs (a Maximum Impact Solvable Problem?) and succeeded by building a data repository. As a more basic example of coexistence, the software infrastructure for monitoring can also implement detection of large changes in the data distribution. Regarding interdependence, the time and effort put into good engineering will be wasted if it is not accompanied by similar investments in good science, and vice versa, as software and ML are both integral to ML applications.

Like software engineering excellence, scientific excellence brings a number of benefits: project de-risking, keeping tech debt under control, ease-of-maintenance, scalability and potential for future gains. Conversely, lack of scientific excellence can lead to delays, downtime, increased complexity and lack of closure. Engineering excellence acts like a gating mechanism for scientific excellence, in the sense that engineering best practices eventually enable scientific excellence benefits, while lack of good engineering hinders good science, potentially causing the aforementioned problems.


For ML based software applications, scientific excellence in ML work is the analogous of engineering excellence in software development. The two concepts are complementary but strongly related. Based on a few typical application situations, we outlined principles of scientific excellence with the intention to crystallize the ideas on ML best practices and to help develop them further.

What do you think? We welcome your feedback!

Scroll to Top