Whoa There! The Data Community and Covid-19

The coronavirus pandemic and its affect on our daily lives do not need an introduction - we are all living through it this rather surreal situation. So, I'll just get right to it:

 What can the data science/analytics community do to help in this crisis? 

I think this is actually a really tough question. Given that, in this article I wan't to try and chew over some of my thoughts on this. I'm keen to get the conversation going in the wider analytics community, so please comment with thoughts and suggestions on electricweegie.com, Twitter, LinkedIn. Wherever you find this article and/or me, let's get chatting.


The Dangers of Being a Convincing Amateur

The global (and my local) data community consists of an amazing, diverse bunch of supremely talented and passionate people, who often want to go out and "make the world a better place" (as corny as that sounds). They want to do this through building analyses, products, services and technologies that entertain, guide and enable us in ways we never thought possible before. Just think where we were 10 or even 5 years ago in terms of big data, IOT, machine learning and the cloud to get a sense of how our capabilities and reach have grown as data professionals.

Now, when something like a global pandemic hits in this glittering age of data and technology, what are this community going to do other that roll up their sleeves and get analysing and building? Exactly.

I absolutely commend this, and I am desperate to get stuck into this problem myself, but I think we must tread carefully as a community. It is very easy to convince yourself that because you know your way around industrial or academic data problems you can just start creating analyses on Covid-19 datasets and that this will automatically be helpful.

I think there are a few things that can quite quickly go wrong if we, as a community, get ahead of ourselves and do not do the necessary and important groundwork.

Basically, on an important topic like this one, there is nothing more dangerous than convincing amateurs. And let's be honest, if you do not already work in epidemiology or have some training in that area, you are an amateur. It will be a new topic, but because of your years churning out data science and analytics solutions you will be able to build something sophisticated that looks great and uses cool sounding techniques. In other words, you'll be a very convincing amateur.

This can be really dangerous because if we in the analytics community start churning out analyses left, right and centre without key knowledge of this field, we could be coming to conclusions or making suggestions that are very convincing but potentially dangerous. And, as we all know, this is definitely a dangerous situation.

This week I have already seen analyses published on LinkedIn looking at things like case rates and calculated fatality rates for Covid-19 data from across the world where the authors drew some stark conclusions about the efficacy of state interventions or extrapolated outbreaks with some interesting (but ultimately misleading) exponential functions. I am not saying here that these analyses shouldn't have been done, but what I am saying is that

a) there are better things we could/should be doing as a community and

b) we will all need to be careful about the language we use and how we communicate.

Fine, Andy, you aren't happy - so what do we do?


Levelling Up & Leveraging What You Know

First, let's consider the case where you really have your heart set on analysing the outbreak data (who doesn't love an armchair epidemiologist?).

In this case, the least you can do is level up your knowledge before you go drawing strong conclusions from your data. At a base level read this thread from Adam Kucharski on Twitter. Kucharski is an associate professor at the London School of Hygiene and Tropical Medicine where he works on, you guessed it, epidemiology. I'd really also recommend his book, The Rules of Contagion (not sponsored btw). You can also do the usual and audit a MOOC, for example I am currently doing the 'Epidemics' course from UPenn on Coursera. Doing this sort of reading will at least get you comfortable with the basics of the field.

I would then suggest that when exploring the Covid- 19 datasets, don't try to do too much interpretation or extrapolation. This way you won't fall into the trap of over-reaching or making any drastic suggestions. I still think doing some exploratory and statistical analysis super useful so by all means get stuck in. Just check yourself as you go though and be mindful of how you communicate your results - if you think you've discovered some weird quirk of the pandemic or some previously unknown intervention side effect, you probably haven't*.

If you are keen to look at the relationships between interventions and case rates or fatalities then please try and work in collaboration with someone who knows what they are talking about (but obviously be mindful that if they are indeed a disease modeller, they may be working flat out already and will not have much to give you an education).

The other very important way we can all help is by leveraging what skills we have and what fields we know inside out. It might require some thought, but I am almost certain that all of you out there have skill-sets you can apply in a way that will genuinely help people in this pandemic.

Some ideas to get the ball rolling:

1. Dashboarding & Visualisation: I think one of the ways we can absolutely help other scientists working on this is to bring together the datasets they need and provide them with the tools they need to interact with it dynamically and easily. This is  something that so many of you out there will be world class at. If you have developed Power BI or Tableau dashboards for organisations, if you're a whizz at building R Shiny or Flask Apps for customers then you can probably build something that will allow researchers to interact with relevant Covid-19 datasets and to glean insights from them.

2. Getting the Message Out: Whatever country you are in just now you have no doubt been watching a lot of press conferences, briefings and announcements by world leaders, health experts and of course the World Health Organisation (WHO). The greatest weapons we have in the fight against Covid-19 are the advice, suggestions and knowledge being expounded by WHO and other scientific bodies on an almost daily basis. Can you help get their message across? Perhaps you can build some web scraper that brings together all of the relevant health body advice in your country, or can you build automated alerts that link to different government sources to help people keep track? Could you even use those skills you developed in marketing analytics to work out the best ways to re-share, plug and package all this information so that the message gets out to the most people and the advice is followed?

3. Support the System: By far I think the best thing we can do is support the organisations and people who will both fight the disease and also keep your country running throughout this crisis. In the UK we've already seen some of the systems and processes we rely on in our daily lives being stretched and strained. For example we've had people panic buying and leaving supermarket shelves empty, even when there is no shortage of food or essential items. We've had to call up thousands of retired and lapsed medical staff to make sure we can cope with the coming surge in Covid-19 cases and the government have had to step in with a promise to pay people's wages. This is all unprecedented stuff and will not be solved by creating another Jupyter notebook that tracks the cases of Coronavirus. These are challenges to the system that supports our way of life. So what we need to is find the people that are working to solve problems in logistics, economics, communications, transport, energy, social care and so on and ask if they need data produced or analysed, and help them if we can. If we can do this guided by their expertise (similar to my point on epidemiology above) then we will all automatically be so much more effective and our voluntary efforts will have more of an impact.

I have seen two great examples of this the other day (and I'd be keen to hear if you have found more!):

1. Crowd Fight Covid19 : This initiative is acting as a market place for scientists to help one another and collaborate. Why not see if anyone needs a data scientist or data analyst? From the site:

"Our proposal: This is a service for COVID-19 researchers. They only need to state a wish or a task, which can go from a simple time-intensive task to be performed (e.g. transcribe data, manually annotate images), to answering a technical question which is beyond their expertise, or to setting up a collaboration. They only need to explain their request in a few lines. Then, another scientist makes the effort of understanding that request and making it reality."

2. UCL made a call for mathematical modellers and data scientists to help them with their research. The call was filled up pretty quickly, but I bet you there are others like it out there. Seek these sorts of partnerships out!


If I was to summarise what I have been saying in a few bullet points it would be this:

  • Data skills can definitely help people during this crisis
  • Do not be an armchair epidemiologist, but by all means support epidemiologists (and other health care professionals) with your experience and skill sets. Most preferably under their guidance
  • Do work on tools to enable relevant professionals
  • Work on getting the public health messaging out there
  • Think laterally about the problems you can help solve. You may not think that helping the NHS work out who to send their most urgent advice to is as sexy as predicting the number of cases in a country, but you will definitely save more lives.

*I mean you might have .... but you probably haven't.


Educating Clients about Machine Learning and AI

The responsibilities of a data scientist or machine learning engineer can vary tremendously depending your industry, the company you work for, the type of projects you typically work on and what stage of your career you are at. An important and, I believe, commonly overlooked skill that is key to master if you want to progress in your data science career however is ‘teaching the client’.

I know that there will be many ways of interpreting what I mean by this so I am going to focus on a very specific problem often faced by data scientists who have to map out the problem with clients (either internal or external):

How do you communicate the ideas, concepts and potential utility of machine learning and AI to non-experts in a way that:

  1. Empowers them to make decisions based on fact and not hype,
  2. Helps them understand the what is required to successfully implement machine learning and AI,
  3. Manages their expectations.

This is no easy task, but in this article I am going to share some of the things I have learned from discussing, designing and executing machine learning projects with a variety of clients and managers. Hopefully there will be something in my experience that you can apply to your own work!

WTF is AI, WTF is ML?

It is so important not to go in all guns blazing when explaining what may be a complex solution to a client. Focus on the high level concepts, what data you are using and, most importantly, if and how it solves their problem.

The best place to start with someone who has heard only tangentially about machine learning or AI is to try and define it for them. The key is not to go all heavy in jargon like these from wikipedia:

In computer science AI research is defined as the study of “intelligent agents”: any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.[1] Colloquially, the term “artificial intelligence” is applied when a machine mimics “cognitive” functions that humans associate with other human minds, such as “learning” and “problem solving”

Machine learning is a field of computer science that uses statistical techniques to give computer systems the ability to “learn” (e.g., progressively improve performance on a specific task) with data, without being explicitly programmed.”

That’s not to say there is anything wrong with these descriptions, they are just a bit … academic. Some of the ways I’ve tried to explain machine learning and AI to clients is by saying something like the following:

You can think of the field of AI as simply the study of how we make computers behave, act and solve problems like humans or animals. Machine learning is a subset of AI where algorithms don’t have to be programmed with hard coded rules (‘if this is the case, then do that’) in order to solve a problem but they work out how to solve specific types of problem through exposure to data.”

Now admittedly this is quite similar to the above, but I feel that it’s a bit less intimidating for a non-scientist or someone new to the field (as most business managers and clients are likely to be!).

Another great tactic is to give examples/concrete use cases. For example, imagine you want to predict the number of umbrellas that you are going to sell in a shop because you want to make sure you have enough stock. You have a look at the data and it’s clear that when it rains, more umbrellas are sold. So you hard code a rule in your system that says ‘if weather forecast predicts rain, order 100 umbrellas’. That’s a very basic way to solve the problem but a client will clearly understand this as your “baseline”. You can then tell them that if you wanted to do this with machine learning, you take the data and tell a machine learning algorithm that the number of umbrella’s is the “target” and the other data are what can be used for predicting this target. The algorithm then takes in the data and produces a model so that when you feed it similar data to what you’ve fed it before (for example the day of the week, the weather forecast and how many umbrellas you sold over the past week) then it makes a prediction with a given accuracy. The first is you hard coding a solution, the second is a machine learning or data science solution.

This is a bit contrived, but I feel that a concrete example like this covers a lot of technical topics, like “targets”, “covariates/features”, “training/testing”, “lift” and even “autoregression” without splattering jargon everywhere. An example like this can be used in many contexts, and most people will get the transferability to any other prediction problem.

Can we get one of them AlphaGo’s? Managing expectations

The next thing you have to do is educate your client or partner on the reality of all of this cool capabilities. A lot of people will have seen AlphaGo beat Lee Sedol or computer vision software successfully count the number of faces in a crowd and think that your problem must be easy. It is very rarely the case.

To manage expectations, just tell the client what you think is feasible given what you know about their data and set up. If you have successfully completed projects with similar starting points before then you are onto a winner, since you can draw quite heavily on this experience and use it as an excellent example. Don’t panic if you haven’t though, managing expectations is still very doable.

First, always make sure people are aware that machine learning is really centred around doing one of the following:

  1. Classifying something — telling you what something is,
  2. Predicting something — telling you what is likely to happen,
  3. Grouping something — pointing out things which are similar.

The optional 4th point to add to this list is ‘Solving something — acting intelligently to achieve a goal’ (read ‘reinforcement learning’), but if the client is new to ML this could create more confusion than is necessary at this point.

Using these stripped back and simplified definitions of classification, regression and clustering then the client can hopefully see a bit more clearly mind than what ML is actually useful for, and they can reign in their expectations accordingly. They can also then hopefully see (with your guidance) that if an algorithm is good at classifying something (e.g computer vision counting faces), then it doesn’t mean the same algorithm can ‘predict something’ (forecast umbrella sales). This helps to highlight the fact that its ‘horses for courses’ when it comes to ML and there is ‘no free lunch’.

Secondly, always bring it back to the ultimate goal, which is to solve a given (business) problem for the minimum amount of investment (time, energy, money). If it is not going to be necessary to predict the number of umbrellas to 99% accuracy every day then it isn’t even worth thinking about. Reiterate the Pareto principle that in many cases ‘80% of the effects arise from 20% of the causes’, so past a certain point you’ll only get small gains for a lot more effort.

Finally, to help reign in expectations, it is sometimes important to highlight that when some company makes a big announcement about some new amazing machine learning system ,they are only showing you the sparkling whites and not their dirty laundry. Any of these highly publicised systems (computer vision as a service systems from cloud providers are one particular case) will have areas where it doesn’t apply, can produce erroneous results and often will have been the result of an army of people with a lot of resources focussed on producing this one particular tool. This doesn’t mean you can’t solve your client’s problem, it just means that they have to be aware that machine learning projects are like any other project, more ambitious goals will require more resource. It’s that simple.