Machine Learning Engineering with Python - Published on 5th Nov!

Not long to go now until Machine Learning Engineering with Python is published on 5th November, which happens to be Bonfire Night in the UK. I look forward to seeing lots of fireworks and pretending its to celebrate my book being finally out!

Ahead of publication I wanted to thank so many of you for your kind words of encouragement and for, of course, supporting the book. I really hope it does help people working in the machine learning space in some small way .

Ahead of publication, I wanted to share a little teaser from Chapter 1: Introduction to ML Engineering, where I talk about what I believe is important to consider when doing machine learning 'in the real world'. I hope you enjoy the snippet and that you enjoy the book!

"The majority of us who work in machine learning, analytics, and related disciplines do so for for-profit companies. It is important therefore that we consider some of the important aspects of doing this type of work in the real world.

First of all, the ultimate goal of your work is to generate value. This can be calculated and defined in a variety of ways, but fundamentally your work has to improve something for the company or their customers in a way that justifies the investment put in. This is why most companies will not be happy for you to take a year to play with new tools and then generate nothing concrete to show for it (not that you would do this anyway, it is probably quite boring) or to spend your days reading the latest papers and only reading the latest papers. Yes, these things are part of any job in technology, and especially any job in the world of machine learning, but you have to be strategic about how you spend your time and always be aware of your value proposition.

Secondly, to be a successful ML engineer in the real world, you cannot just understand the technology; you must understand the business. You will have to understand how the company works day to day, you will have to understand how the different pieces of the company fit together, and you will have to understand the people of the company and their roles. Most importantly, you have to understand the customer, both of the business and of your work. If you do not know the motivations, pains, and needs of the people you
are building for, then how can you be expected to build the right thing?

Finally, and this may be controversial, the most important skill for you being a successful ML engineer in the real world is one that this book will not teach you, and that is the ability to communicate effectively. You will have to work in a team, with a manager, with the wider community and business, and, of course, with your customers, as mentioned above. If you can do this and you know the technology and techniques (many of which are
discussed in this book), then what can stop you?"



Machine Learning Engineering with Python - available for pre-order!

Machine Learning Engineering with Python: Manage the production life cycle of machine learning models using standard processes and designs by [Andrew McMahon]

After many months of writing, developing, designing and researching after work and at weekends, my book is now available for pre-order on Amazon!

I'm really excited as I've always wanted to write books and I had set myself a goal of writing at least one book by the time I was 30. So as long as the book is fully available for order in October as planned I would say challenge completed (my birthday is at the end of November).

Why did I write this book?

I have been working for a few years in a variety of roles in data science and machine learning (ML), and my career has now strongly went in the direction of focussing on productionization. A totally made up word that we all now use to mean 'taking ML proof-of-concepts and making them into working software solutions'. I've found this to be the hardest problem in industrial data science and machine learning, or at least the hardest problem that comes up often enough to justify focussing on it.

So given this focus and how much I know teams can struggle to understand some of the things they need to go from cool idea or draft model through to working solution, I decided to write this book. It's by no means perfect, but I hope it's a good collection of some of the ideas, tools and techniques that I think most important when it comes to ML engineering.

What's in it?

The book consists of 8 chapters:

  1. Introduction to ML Engineering
  2. The Machine Learning Development Process
  3. From Model to Model Factory
  4. Packaging Up
  5. Deployment Patterns and Tools
  6. Scaling Up
  7. Building an Example ML Microservice
  8. Building an Extract Transform Machine Learning Use Case

The book will kick off with more strategic and process directed thinking. This is where I talk about what I think ML engineering means and what is so different about building ML solutions vs traditional programming.

We then move onto learning about how to create models again and again by building training services and then how to monitor models for important changes like concept or data drift. I then discuss strategies for triggering retraining of your models and how this all ties together.

Moving on there's more of an emphasis about some important foundational pieces like how to create good Python packages that wrap your ML functionality and what architecture patterns you can build against.

The last piece of the book focusses on deep dives on a specific topic covers mechanisms for scaling up your solution, with a particular focus on Apache Spark and serverless infrastructure on the Cloud.

Finally, the book finishes with 2 chapters on worked examples that bring together a lot of what's been discussed earlier in the book, with a particular focus on how to make the relevant choices to be successful when executing a real-life ML engineering project.

What's next?

First of all, I'm just super excited this is real. As I mentioned at the top of the article this has been a dream of mine for a long time and I think the topics are important ones to discuss. Hopefully the book can help people in data science, software development, machine learning and analytics roles be successful. That would make me happy too.

In terms of what's next I'm thinking that it would also be beneficial to expand the topics of the book into an online course. That way, people who would like the material structured in a slightly different way can also get the benefit. It would also allow me to expand on some of the material a bit more, giving a more conversational flavour to the material. I like the sound of that, hopefully you do too!

All in all, I just hope people enjoy the book and get benefit from it. I benefited immensely from technical books in this space when I was starting out (and I still do) so I'm really glad I can make my own small contribution to that body of learning.




Insight Through Innovation Talk

In August I spoke at an MBN Data Technology meetup event on 'Insight Through Innovation: Delivering Scalable Data Products Through The Latest Cloud Tech'.

My section focussed on how to use Databricks and surrounding technologies on the cloud to deliver machine learning services that are robust and drive value day in and day out.

Data Science Conference Europe 2020

In November 2020 I spoke at Data Science Conference Europe. I spoke about working in extended and multi-faceted data teams and using modern data platforms and technologies to deliver value for your organisation.

The conference didn't share videos of the talks publicly but I was featured by the conference on LinkedIn.

Quantum Field Theory [1]

Hi I've learned a lot of physics over the years ... Some of it was dry (introductory optics was booooring) but most of it was pretty cool. One of the single most interesting topics I studied however, was a physical theory called 'quantum field theory', whose aim was to unite the worlds of quantum mechanics with the idea of fields and of special relativity to create a really powerful set of tools for describing phenomena at the smallest scales.

What's a field?

In physics, a field is simply an entity which can be described by a number (or series of numbers) distributed in space. For example, we can consider the temperature across a room as being what is known as a 'scalar field' (scalar = 1 number to describe the field at each point in the room) or we can consider the wind across a country as a 'vector field' (vector = (x,y,z) direction of the wind, for example drawn by an arrow, at each point across the country). Fields crop up a whole lot in physics, and in fact the study of them mathematically is known as 'field theory'.

Further to this, there is a distinction in physics between what are known as 'classical' or 'quantum' field theories. Classical field theories include the theory of electromagnetism, or of general relativity (gravity). They are denoted 'classical' by the fact that they describe classical or pre-quantum physics, not that they are necessarily any simpler or don't still provide fertile research ground.

The whole aim of a classical field theory, is to describe mathematically the properties of a given field (or set of fields) and how they interact with matter. For example, electromagnetic field theory provides a description of how electric and magnetic fields interacting with objects which possess charge, such as electrons and protons. General relativistic field theories describe how objects with mass interact with one another and create changes in the gravitational field through their distortion of spacetime.

So, that's a field theory, but what make 'quantum' field theories so interesting?

What's a quantum field?

The description of the behaviour of very small objects in the universe is  often accomplished by using the tools of quantum mechanics. Quantum mechanics tells us that the world of the very small (atoms, nuclei, electrons, protons, quarks etc) is a world in which the normal rules of motion do not apply. For example, electrons are both waves and particles at the same time and they can be in two places at once. What became clear through the 20th century however, was that quantum mechanics was not providing a complete picture of behaviour at very small scales for a variety of phenomena.

For example, the electromagnetic (EM) field around a point charge, for example the electron, was a bit of a conundrum. Physicists like Jordan, Heisenberg and Born had developed a method for creating a quantum field theory for the free EM field in the mid 1920s (think photons). This was an excellent start but it didn't have the concept of interactions in it, so if you plopped an electron into your system you couldn't really say much about it.


Embedding Insight Through Prediction Driven Logistics - Data AI Summit Talk - Nov 2020

Bit of a golden oldie here from Nov 2020! Do you even remember what life was like pre-pandemic? This was the talk I gave with Helena on our work at Aggreko with prediction driven logistics. Essentially all about how we were building models to perform regression on fuel levels and maintenance issues.

Some super cool stuff and a big profile event which was great!

Speaker blurb from the conference.

Hope you enjoy!

Machine Learning Architecture - Designing Scalable Solutions

A large part of my job involves designing, building and deploying machine learning solutions for the company I work for. In this post I'm going to quickly highlight some of my key thoughts on this so that if you are new to ML architecture, you can have some good pointers to get from from "blank sheet of paper" to "deployed and running" faster.

The Main ML Architectures

The way I see it, there are only a few ML solution architecture types (they may have fancy names somewhere but I won't know them). I reckon you could assign most, if not all, machine learning implementations into one of these categories (probably because they are so vague/wide-reaching):

  1. The ML Microservice
  2. The Streaming process,
  3. The  Batch process,
  4. The  Integrated approach

I'll go through each of these and list some examples to get you acquainted with what I mean.

The ML Microservice

Imagine you have a map based product, and the user can interact with this product by clicking on one of the entities on the map, for example a truck, and then they can click a button that says "Predict ETA at destination". If I was building an ML based solution for providing that prediction then it would most likely be behind a REST API to some application that read in a request for a prediction and quickly returned that prediction. One way of doing this would be to deploy a lightweight web application (like a Flask app) to a cloud based hosting service. You could design the Flask app to read in your pre-trained machine learning model from a model repository (cloud based storage) and serve predictions via a simple HTTP request-response interaction.

The Streaming Process

In a 'streaming' approach, you are going to work directly with the data as it comes in. This is a type of 'event driven' architecture, whereby an action in one part of the system initiates the next and so on. It is particularly useful for processes which can be a bit ad-hoc in terms of frequency requirements. For example, you may want a streaming process to analyse click data from your web site as it comes in since the production of results works on the same frequency as the ingestion of the data. When there are more clicks, there will be more analysis, when there are no clicks, no analysis. This also means that you do not have to wait for some threshold amount of data to be ingested before performing your action, like you might in a 'batch' approach (see below).

You can do this using tools like Apache Kafka or Spark Streaming.

The Batch Process (or 'ETML')

If the solution has the aim of producing a relatively standardised dataset with some kind of machine learning result on a schedule or very consistent basis, then it may make sense to deploy this as a batch processing service. In this case, you build a pipeline that extracts data, performs the relevant transformations to prepare for modelling, you then perform the model based calculations and surface the results. These steps are why I often call this 'ETML' ('ETL' being 'Extract, Transform, Load', 'ETML' is 'Extract, Transform, Machine Learning').

In my experience, this sort of architecture is often the easiest to develop and deploy. You are building something that is very stable and you usually do not have to worry about successful triggering of the pipeline. In streaming or dynamic services for example, you have to ensure that the REST API request or the event-driven trigger you employ works appropriately in a variety of circumstances through rigorous testing. For a scheduled batch process this is usually much simpler. You just set up your schedule via a cron job or other process, ensure that it works the first few times and then you're off.

The Integrated Approach

This final option is really a kind of exceptional case. I can totally imagine a scenario where you want to embed the machine learning capability directly within some general solution, rather than externalised as a separate service or process. Some good examples where this may apply are:

  • 'Edge' deployments: When you have machine learning running on an IoT device, potentially one with limited connectivity, it might make sense to directly put the machine learning functionality in your main piece of software. However, I would recommend actually following the 'ML Microservice' approach here using something like Docker containers.
  • Integration with another microservice: In this case, you may already be developing a microservice that performs some simple logic and the machine learning is really just an extension or variant of that. For example, if you have already built a Flask web app that returns some information or result, the machine learning could just be another end-point.
  • Out of the box product: If your team is in charge of adding functionality to a piece of software that you are selling onto a client, you may have to package up the machine learning inside the solution somehow. I'm less sure how this would work and I do think that you'd be better using a microservice approach in general.

In Summary

There's a few different ways to package up your machine learning solutions, and it's important to understand the pros and cons of each.


Whoa There! The Data Community and Covid-19

The coronavirus pandemic and its affect on our daily lives do not need an introduction - we are all living through it this rather surreal situation. So, I'll just get right to it:

 What can the data science/analytics community do to help in this crisis? 

I think this is actually a really tough question. Given that, in this article I wan't to try and chew over some of my thoughts on this. I'm keen to get the conversation going in the wider analytics community, so please comment with thoughts and suggestions on, Twitter, LinkedIn. Wherever you find this article and/or me, let's get chatting.


The Dangers of Being a Convincing Amateur

The global (and my local) data community consists of an amazing, diverse bunch of supremely talented and passionate people, who often want to go out and "make the world a better place" (as corny as that sounds). They want to do this through building analyses, products, services and technologies that entertain, guide and enable us in ways we never thought possible before. Just think where we were 10 or even 5 years ago in terms of big data, IOT, machine learning and the cloud to get a sense of how our capabilities and reach have grown as data professionals.

Now, when something like a global pandemic hits in this glittering age of data and technology, what are this community going to do other that roll up their sleeves and get analysing and building? Exactly.

I absolutely commend this, and I am desperate to get stuck into this problem myself, but I think we must tread carefully as a community. It is very easy to convince yourself that because you know your way around industrial or academic data problems you can just start creating analyses on Covid-19 datasets and that this will automatically be helpful.

I think there are a few things that can quite quickly go wrong if we, as a community, get ahead of ourselves and do not do the necessary and important groundwork.

Basically, on an important topic like this one, there is nothing more dangerous than convincing amateurs. And let's be honest, if you do not already work in epidemiology or have some training in that area, you are an amateur. It will be a new topic, but because of your years churning out data science and analytics solutions you will be able to build something sophisticated that looks great and uses cool sounding techniques. In other words, you'll be a very convincing amateur.

This can be really dangerous because if we in the analytics community start churning out analyses left, right and centre without key knowledge of this field, we could be coming to conclusions or making suggestions that are very convincing but potentially dangerous. And, as we all know, this is definitely a dangerous situation.

This week I have already seen analyses published on LinkedIn looking at things like case rates and calculated fatality rates for Covid-19 data from across the world where the authors drew some stark conclusions about the efficacy of state interventions or extrapolated outbreaks with some interesting (but ultimately misleading) exponential functions. I am not saying here that these analyses shouldn't have been done, but what I am saying is that

a) there are better things we could/should be doing as a community and

b) we will all need to be careful about the language we use and how we communicate.

Fine, Andy, you aren't happy - so what do we do?


Levelling Up & Leveraging What You Know

First, let's consider the case where you really have your heart set on analysing the outbreak data (who doesn't love an armchair epidemiologist?).

In this case, the least you can do is level up your knowledge before you go drawing strong conclusions from your data. At a base level read this thread from Adam Kucharski on Twitter. Kucharski is an associate professor at the London School of Hygiene and Tropical Medicine where he works on, you guessed it, epidemiology. I'd really also recommend his book, The Rules of Contagion (not sponsored btw). You can also do the usual and audit a MOOC, for example I am currently doing the 'Epidemics' course from UPenn on Coursera. Doing this sort of reading will at least get you comfortable with the basics of the field.

I would then suggest that when exploring the Covid- 19 datasets, don't try to do too much interpretation or extrapolation. This way you won't fall into the trap of over-reaching or making any drastic suggestions. I still think doing some exploratory and statistical analysis super useful so by all means get stuck in. Just check yourself as you go though and be mindful of how you communicate your results - if you think you've discovered some weird quirk of the pandemic or some previously unknown intervention side effect, you probably haven't*.

If you are keen to look at the relationships between interventions and case rates or fatalities then please try and work in collaboration with someone who knows what they are talking about (but obviously be mindful that if they are indeed a disease modeller, they may be working flat out already and will not have much to give you an education).

The other very important way we can all help is by leveraging what skills we have and what fields we know inside out. It might require some thought, but I am almost certain that all of you out there have skill-sets you can apply in a way that will genuinely help people in this pandemic.

Some ideas to get the ball rolling:

1. Dashboarding & Visualisation: I think one of the ways we can absolutely help other scientists working on this is to bring together the datasets they need and provide them with the tools they need to interact with it dynamically and easily. This is  something that so many of you out there will be world class at. If you have developed Power BI or Tableau dashboards for organisations, if you're a whizz at building R Shiny or Flask Apps for customers then you can probably build something that will allow researchers to interact with relevant Covid-19 datasets and to glean insights from them.

2. Getting the Message Out: Whatever country you are in just now you have no doubt been watching a lot of press conferences, briefings and announcements by world leaders, health experts and of course the World Health Organisation (WHO). The greatest weapons we have in the fight against Covid-19 are the advice, suggestions and knowledge being expounded by WHO and other scientific bodies on an almost daily basis. Can you help get their message across? Perhaps you can build some web scraper that brings together all of the relevant health body advice in your country, or can you build automated alerts that link to different government sources to help people keep track? Could you even use those skills you developed in marketing analytics to work out the best ways to re-share, plug and package all this information so that the message gets out to the most people and the advice is followed?

3. Support the System: By far I think the best thing we can do is support the organisations and people who will both fight the disease and also keep your country running throughout this crisis. In the UK we've already seen some of the systems and processes we rely on in our daily lives being stretched and strained. For example we've had people panic buying and leaving supermarket shelves empty, even when there is no shortage of food or essential items. We've had to call up thousands of retired and lapsed medical staff to make sure we can cope with the coming surge in Covid-19 cases and the government have had to step in with a promise to pay people's wages. This is all unprecedented stuff and will not be solved by creating another Jupyter notebook that tracks the cases of Coronavirus. These are challenges to the system that supports our way of life. So what we need to is find the people that are working to solve problems in logistics, economics, communications, transport, energy, social care and so on and ask if they need data produced or analysed, and help them if we can. If we can do this guided by their expertise (similar to my point on epidemiology above) then we will all automatically be so much more effective and our voluntary efforts will have more of an impact.

I have seen two great examples of this the other day (and I'd be keen to hear if you have found more!):

1. Crowd Fight Covid19 : This initiative is acting as a market place for scientists to help one another and collaborate. Why not see if anyone needs a data scientist or data analyst? From the site:

"Our proposal: This is a service for COVID-19 researchers. They only need to state a wish or a task, which can go from a simple time-intensive task to be performed (e.g. transcribe data, manually annotate images), to answering a technical question which is beyond their expertise, or to setting up a collaboration. They only need to explain their request in a few lines. Then, another scientist makes the effort of understanding that request and making it reality."

2. UCL made a call for mathematical modellers and data scientists to help them with their research. The call was filled up pretty quickly, but I bet you there are others like it out there. Seek these sorts of partnerships out!


If I was to summarise what I have been saying in a few bullet points it would be this:

  • Data skills can definitely help people during this crisis
  • Do not be an armchair epidemiologist, but by all means support epidemiologists (and other health care professionals) with your experience and skill sets. Most preferably under their guidance
  • Do work on tools to enable relevant professionals
  • Work on getting the public health messaging out there
  • Think laterally about the problems you can help solve. You may not think that helping the NHS work out who to send their most urgent advice to is as sexy as predicting the number of cases in a country, but you will definitely save more lives.

*I mean you might have .... but you probably haven't.