Scaling Machine Learning Workloads for the LLM Era — Python Glasgow 24th Oct 2023 ElectricWeegie, October 27, 2023October 27, 2023 Cross post from Medium. On Tuesday 24th October, I was lucky enough to speak at an AI and ML themed Python Glasgow meet-up. It was a brilliant event, hosted at the MBN offices in the city centre. We had nearly 70 people there, all eager to hear something about machine learning, large language models and anything else we could throw at them. The evening kicked off with a great pair of lightening talks by Mark Smith, who showed us how to build a Retrieval Augmented Generation (RAG) system using LLMs and MongoDB as a backend database. This was then used to analyse some log data using text based queries. The second talk was all about some fun stuff with mouse jigglers (and I’ll leave it there!). Was excellent, kudos to Mark stepping in when we had an unfortunate cancellation! Then it was my turn. My talk was titled “Scaling Python Machine Learning Workloads for the LLM Era: Distributed and accelerated Python for MLOps and LLMOps”. As you can maybe gather it was about making Python processes scale for different sizes of machine learning workloads. Some of the topics I covered in the first half were: ML engineering, MLOps and LLMOpsthe global data landscapemachine learning algorithm scaling behavioursthe strong scaling hypothesisopportunities for scaling and distributing workloads in machine learning Then for the main part of the talk I then spoke about how you can use the Ray framework to accelerate your Python data processing and machine learning workloads. I focussed on the ease-of-use of the Ray API and how it allows you to take existing Python machine learning code and supercharge it. I also spoke about how to leverage its packages to fine-tune LLMs in just a few lines of code. Through the talk, we were taking questions on a digital board to go through at the end, and there were so many good ones! Way too many to answer after the talk in fact, so I wanted to answer them all here, at least at some level. I hope this was useful. I am not going to share the full slides right now as I want to use some of the material for future courses. But I will definitely be sharing some key pieces soon! Q & A Is there a relationship between corpus size and number of parameters that leads to optimal training? In general, for any machine learning model, there is always a balance to be struck between dataset size and the numbers of parameters in your model. Too many parameters leads to overfitting and poor generalization. Too little may lead to poor representation of the data. There isn’t a written down mathematical rule for dealing with this (as far as I’m aware) but it’s something that has to be worked out empirically and tracked as you perform training, testing and validation. 2. If you didn’t come from an academic background, how did you get into ML? Personally, I had an academic physics background so I had that sort of preparation. In general though, I think we need more varied routes into this field. Data, ML, AI is going to be such a critical part of the future that it cannot just be something owned by the PhDs of the world. So programmes like graduate apprenticeships and other more vocational routes should become more mainstream in my opinion. We also have a responsibility to talk to the general public, be it in schools or other places and just share what’s going on and educate around topics in technology. 3. How important is data quality? Can better data help reduce training costs? We’ve all heard the adage “garbage in, garbage out”. This captures the fact that data quality is one of the most important things for systems based on data, but especially for machine learning systems. A machine learning algorithm only knows about the data you give it, so if this is of poor quality you can’t expect your algorithm to really “learn” anything important about your problem. Better data won’t necessarily reduce training costs, but it will make your model more performant. A great example of this is the “Textbooks are all you need” paper, which showed some performance improvements for LLMs when you only use extremely high quality data. 4. How can I prevent my blog becoming part of the training corpus of open AI and others? OpenAI published a mechanism for doing this recently, and I believe some platforms are doing this as default. In general, this is quite an open problem however that might not be settled until there are some landmark legal cases settlled on intellectual property rights and AI training data. 5. How do you feel about security and privacy of LLM? Challenges/Risks/Management of Risk/Behind the Firewall etc. Personally, I think that the data privacy challenges that we are now aware of due to the popularity of LLMs were always there. It’s just that LLMs need as much data as possible, so more people are being affected, but data was always there for the taking. The new risks I see for LLMs are in applying guardrails to them so that they don’t exhibit bias or discrimination, so that they don’t generate false answers that are misleading and so that they comply with legal, regulatory, ethical and operational guidelines. The point about false answers, hallucinations, is such a fundamental feature of LLMs that this one requires tons of research and potentially entirely new algorithmic approaches to overcome. The others I think will be a combination of points and alot of work on operational practices. 6. On the billions of tokens and the petabytes; do you/we have any understanding of what the ‘missing tokens’ represent? I’m not sure what missing tokens we’re talking about! Comment on this article please if you see this! 7. Do you have an ‘everything in python’ approach, or do you use a variety of languages/stacks? if the latter, which tools serve which needs? I love Python, and it can do a lot of things across the entire stack. But many data stacks will utilise SQL in a variety of dialects, languages like Scala and of course many backend stacks are still based around Java (Scala also is based on the JVM btw). I don’t work on frontend stuff but working in the Javascript/Typescript space makes sense too. For the workloads I’ve worked with before we’ve also used IAC tools like Terraform for the infrastructure management side. I basically think it’s horses for courses and the key is not to be dogmatic! 8. As GenAI generates more and more content on the internet how can we continue training LLMs on the internet without model collapse? Love this question. The only way we can stop this is by either dramatically increasing the quality of LLM output or reducing the volume of LLM output. I don’t see either happening in the near term so I feel we’re going to see a real slow down in the ability to gather quality training data in the near to medium term. Then we’ll potentially have a really high premium on the people who can generate high quality content (hello budding authors!), so we’ll have this kind of cycle of going back to human generated content being key. This paper actually suggested that we will run out of even low quality training data in the next few decades on current trends. 9. (Disclaimer : new to ML/LLM world) How do you strike the balance between scaling up computational resources and the performance of the model? Great question. This relates to question 1. I always think that you need to look at this in terms return on investment (ROI) lines. This means considering carefully what value you think can generate with a specific performance of your model and what further investment this justifies. For example, if you’ve spend £100k training a model to get to 0.9 F1-score but the solution is likely to only generate £10k in value, you have definitely over-invested right? If, on the other hand, you’ve spend £1k training a model that will generate £10k in value, and you know that gaining another few percentage points of performance will increase the returns even more, it might be worth investing in further compute and training time. 10. How do you balance operational ethics given the non-deterministic nature of generative LLMOps? I think that AI ethics is going to be one of the most important disciplines as these capabilities develop. When it comes to running things operationally, the challenges are similar to the ones I mentioned in question 5, but the onus here is going to be on managing things like reputational risk and even aligning the behaviour of models with the specific ethos and values of companies. This will be hard to do, and as I said before many techniques still need to be developed. 11. When tuning the model in parallel would you discard a “search area” if you’re consistently getting bad results in it in order to speed up tuning? Yes, so many hyperparameter optimization toolkits provide capabilities for ‘early stopping’ of bad trials! 12. Python isn’t the fastest language. I know much of ML is a Python wrapper around lower level libraries, but how much of a (negative) factor is Python in scaling? Really good question. As you mention, Python is never really the core compute engine. Even native Python runs C and C++ at the bottom. I personally have the view that the gains you get from having Python’s ecosystem and it’s ease-of-use should be considered a really important factor, even against any speedup you’d get from a lower level language. I think as well that we’re seeing Python go more and more back to it’s route as a ‘glue’ scripting language for helping to automate and orchestrate processes across multiple different tools and languages. For example, I don’t see anything wrong with using Python to define your main pipeline in something like Airflow, but the processing is done with Scale or Rust or something else. To your specific scaling question, these scaling frameworks are very heavily optimised, so I wouldn’t worry too much about using their Python APIs. 13. Have you considered how you might reduce the resource consumption to generate the same output? (Optimisation). Always! This is always an important part of doing a real project and can take many forms, from hyperparameter optimisation to model quantisation to memoisation or any other number of approaches. 14. What’s a good place to start learning more about Ray? any GitHubs you’d recommend, or projects worth reading up on? The Ray docs are excellent! The onyl drawback is they don’t have many full end-to-end examples, so definitely dig around in GitHub etc for the specific tasks you are trying to do to complement these. The AWS Ray docs are also really good. 15. What is your book called? where is it available? Machine Learning Engineering with Python, available on Amazon, Packt and most other places you get your books! 16. Other than Ray what other things excite you in the ML space? Great question! I’m very excited by the challenges of LLMOps, specifically building robust operational practices around testing and monitoring systems with LLMs in them. Also interested in the new architectures and design patterns that will be developed over the next couple of years. Finally, I’m just always excited by finding the areas where ML will drive value, whether that’s medicine, financial services, science, wherever! 17. How much resources/money required to start working with Ray? Nothing! It’s open source so you just need a laptop and an internet connection. 18. Is there more complexity or challenges for deploying ML models on the edge or IOT devices? I think there is. Specifically, the toolkits and resources you have available are very different. The architectures are also more federated and can vary depending on whether you expect the device to have good internet connection or not. A huge challenge with LLMs now are the memoery footprints of the models themselves. You can’t download a model that is 100Gb onto your mobile! 19. What is the roadmap to get into XOPS? So “XOPs” is my name for all of the operational disciplines taken together, such as DevOps, MLOps, DataOps, LLMOps etc. So the way to get into them is start with one of these specific disciplines. If you are interested in MLOps, follow me and buy my book! I’m joking, but definitely focus on what specific dimension first and you’ll find they all overlap quite strongly. 20. What infrastructure considerations are essential when dealing with LLM workloads? So, if your architecture doesn’t just leverage the vendor offered APIs, you’ll need to think a lot about the workloads for fine-tuning and for serving. This means you need to consider GPUs, managing these in a cluster and also some smart routing and latency management. These are all complex but there is tons of good information on them out there! 21. Is there anything apart from standard serialisation you need to implement to use Ray with your own data types? I don’t believe so, be interested to know if there is! 22. Actors — are they similar to Akka Actors in Scala which provides abstractions for writing to distributed systems? Yes, I believe they are similar and that is why they used the same name! They probably have some differences though. The Actor abstraction in Ray may even be at a higher level, I need to investigate. 23. Can Ray be used on a single machine (e.g. avoiding GIL restrictions) already at experimentation stage to minimise the scaling effort? It can be used on a single computer but it requires multiple CPUs at least, but if you are running this on something small (like your laptop) you’ll often find that the overheads for instantiation and management of the cluster may outweigh any scaling gain you get. Not necessarily though, it’s just something to be aware of. I think it’s useful if you have large datasets or lots of experiments you want to run and you can justify those overheads. 24. Models like Llama2 have been trained on large number of parameters? Doesn’t it lead to the problem of overfitting? Just to make it clear, we train on data to optimize the model’s parameters. So the parameters are essentially the ‘degrees of freedom’ of the model that are changed as we trained until we converge to a model we are ‘happy with’, by which we mean it has hit the convergence criteria set when we start training. If you have too many parameters (degrees of freedom) relative to your data volume you can definitely overfit. In a previous answer I mention that the solution to this is some empirical investigations and intuition you build up. For example, if you’re number of parameters is the same order of magnitude as your dataset size, you’ll probably overfit. If there’s a 1,000x difference, you are more likely to be fine. Techniques like regularisation are also important for helping to avoid overfitting. 25. Have you or do you think it’s possible to overcome the fear of using an LLM within your company? Maybe using it with company wiki pages etc? This is a really good question, and it’s not just applicable to LLMs but really to any new technology. The only thing that will overcome fear of adoption is proving out the value. So you need to take some smaller, lower risk bets and go with them first, just like the example you’ve given. People will be using LLMs in all aspects of their business but we definitely need to start small. Where I think there will be a challenge is there will be a lot of low value solutions that get created just because people want to use LLMs. This is the same problem we had with ML a few years ago. Maturing out of that quickly will be what puts you ahead. 26. Do you ever run any of these models against home lab/s, or is this financially insane… aka would this make your partner leave you… Everything I showed in the demo’s was initially ran locally! You can also get away with alot with the free tier of AWS and other cloud providers. Now, LLMs are inherently costly, so playing around with something like nano-GPT if you really want to get into the nuts and bolts might make sense. 27. What is the level of community support and documentation available for developers using Ray for ML projects? The Ray docs are excellent for getting to grips with the main points of the toolkit and have some simpler examples. When it get’s to more complex use cases I’ve found that they don’t have too many end-to-end examples explicitly available. They are often buried in the main Ray GitHub repo but not exactly worked all the way through in the docs. So I’d say you need to dig. I think they community is growing, I’ve yet to see it take off in the same way that Spark did but I think it’s time is coming. I need to find some good Discord servers (if anyone knows any please comment!) and I know there are really good conferences in the Ray community now, so I need to attend some of these! 28. So is Ray just an orchestrator? And is there an intelligent load balancing capability? My understanding is that it’s more a toolkit for distributed workloads. So, in terms of orchestration, it’s main job is to schedule and organise the tasks, actors and objects you are using at the lowest levels across the cluster and manage computational resources. It doesn’t really orchestrate really in the same way that something like Apache Airflow does, with time or event based triggering of jobs that you define (as far as I can see). The Ray Serve package, which is part of the toolkit, allows for you to deploy your Python functions as distributed services and I believe as part of that it provides some load balancing. A good one to investigate! 29. Regarding using AWS EC2 GPU VMs in your Ray cluster what do think about Kubernetes for ML? Kubernetes (k8s) is a great interoperable container orchestration/management platform and one that I think is really powerful for so many different software applications. In my book, I talk extensively about using k8s for ML in two different ways. One, you can use the Kubeflow toolkit to build ML pipelines that use k8s under the hood for managing the compute nodes. The other option is you can just wrap your ML model in a more classic software application, like exposing a model via. a REST API hosted inside a web service, containerise it and then deploy at scale using more vanilla k8s deployments. There is also a toolkit called KubeRay that allows you to run Ray applications on k8s. I think this will be useful if you have an existing k8s infra setup and you want to experiment with Ray. It may provide other benefits of cluster management that k8s brings while still giving you the easy-to-use ML focussed API of Ray. 30. Why is global data bad in Python? Global variables that aren’t pure configuration can be a very bad idea in Python because they can cause unstable couplings between software components. I might have a global variable called “MODEL_NAME” that get’s modified by one module and that affects other uses of this variable downstream. If it’s defined in an external file like a YAML though I can make sure it’s read in where it’s needed and the fundamental item is not mutable. People in this audience probably have more to add to that so please do in the comments! 31. What is your advice on getting started with AI? I think a couple of things are really important for getting into any field. One is to talk to people in the field and get their advice and insights. Another is read. Like read everything. Read allllll the things. The final one that is particularly important in technology is to BUILD! Get building something as soon as possible, get over any imposter syndrome and build something crappy and keep building. Use what’s out there to help you and have fun! 32. How do you validate and version control models? Model validation is a huge area with tons of resources out there so I won’t try and cover that here. When it comes to version controlling models tools like MLFlow, CometML, Weights and Biases etc are all built to give you version control for your models. In my book I cover model version control management extensively. The key thing to take away is that it has the same ethos as version control of code (reproducibility and maintainability) but is different in that different versions come after training runs or other types of experiments rather than updates to your base code. So they will happen less frequently but the metadata will be more complex, making it important to know what metrics and experiment details you want to make sure you track! 33. How do you orchestrate Ray for full cycle ops? there was a mention of Kubeflow? See question 29 for some of my thoughts on Kubeflow. To the other part of your question, the good news is that the Ray API has easy job submission commands and you can also just treat it as a standard compute engine like you would Spark or another tool. So if you orchestrate in something like Airflow or in a Pub/Sub model using Kafka or similar you can just trigger a Ray job like you would any other Python script or job! Data Machine Learning Talks
Interfaces and Contracts February 21, 2022October 28, 2022 I've been thinking a lot about interfaces and contracts recently, specifically when it comes to... Read More
Nice feedback on the book! February 28, 2022February 20, 2022 I've been really delighted to see such positive feedback on the book. When I got... Read More
Machine Learning Engineering with Python - available for pre-order! September 1, 2021October 28, 2021 After many months of writing, developing, designing and researching after work and at weekends, my... Read More