Beyond the Cloud: Reinventing AI with Local GPUs

Published on

March 15, 2024

Authors

Dr. Catarina Carvalho

Senior Data Scientist, Deeper Insights

Advancements in AI Newsletter

Subscribe to our Weekly Advances in AI newsletter now and get exclusive insights, updates and analysis delivered straight to your inbox.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

The need for powerful computing resources has become increasingly critical, especially in the field of deep learning. Traditionally, companies have leaned on cloud providers like AWS and Azure to access GPUs for training and fine-tuning deep-learning models. This approach offers flexibility and scalability, allowing businesses to pay for what they use without the burden of managing physical hardware. However, this model presents challenges, particularly for consulting companies with variable project lengths and costs that must be allocated directly to clients. The limitations of cloud computing, including cost variability and operational challenges, have led some companies to reconsider their reliance on cloud services in favour of building their own local computing resources.

‍

Access to GPU’s for Deep-Learning Model Training and Fine Tuning

Access to GPU’s for deep-learning model training and finetuning is paramount nowadays. To tackle that need, and for the past years, cloud providers such as AWS or Azure, have been offering access to GPUs and powerful machines, based on pay-as-you-go options.

The main advantages of these types of solutions are the relief of companies from having to acquire dedicated hardware (which utilisation may be far from 100%) and from having to have a dedicated operations department.

‍

The Challenges of Cloud Computing for Consulting Companies

However, and despite the existence of several pricing models for the use of GPU instances, most of them do not fit into the dynamic environment of consulting companies. Taking into consideration EC2 AWS offers, one example of that is the use of reserved instances. As consultancy environments are typically characterised by variable-length projects and costs that need to be imputed to clients, reserved instances are far from adequate. Another popular cost-saving solution is the use of spot instances. Also, in this case, that is not the most adequate option as this type of option implies reduced availability of instances and jobs may be stopped mid-processing.

All of the above considered, it may be sensible to understand why most consultancy companies opt in for, an on-demand pricing option, despite it being the least cost-saving one.

‍

The Challenges of Cloud Computing for Consulting Companies

Additionally to the costing considerations, running experiments on the cloud can be quite cumbersome, or at least, never as easy as it is locally. One clear example of this “difficulty” is that debugging code or GPU-specific errors (i.e. device allocation errors) can be very costly to find on the cloud, both in terms of time and cost. Every experiment run on the cloud, even if to test a piece of code, will be charged. Apart from those difficulties, pushing code and virtual environments onto the cloud was also cumbersome and time-consuming.

‍

Considering a Shift Towards Local GPU Resources

Long story short, we have hypothesised that active development, specifically training, of deep-learning approaches using a cloud environment as the only resource can be slowing us down, limiting our creativity and leaving little room for failing and subsequent results improvement.

At that point, we have decided to internally explore a parallel reality, one where we explore what would look like deep-learning development if we would bring the GPU’s “home” into our own infrastructure.

‍

Technical benefits from moving away from the cloud

Internal experimentation has thus allowed us to confirm that, moving away from the cloud would enabled us to explore some benefits, namely:

Perform more experimentations - by removing from us the shadow of pay-as-you-go, we could aim at full utilisation of a bare-metal cluster. This would enable us to run extensive experiments (hyperparameter search, longer training, cross-validation training), during days and weeks without being concerned about costs.
Increase team’s creativity - having our infrastructure would enable us to “fail” and test new approaches. There is a much lower risk of trying new approaches and getting improved performances from these experiments.
Debug directly on bare-metal - considering the wise suggestion our beloved operations team provided for the implementation of the bare-metal cluster using Kubernetes, we could attach directly to pods, via Kubernetes extension on VsCode, and debug code locally.
No more cost-ties - while the technical team is absent of cost considerations, project management is not. In the past, technical development had to be constrained due to running out of cloud costs, and that would no longer be a concern.

‍

Business Benefits from Moving Away from the Cloud

While the use of the bare-metal cluster could brought larger benefits (and happiness), to the technical team, this change would also had a positive impact on the business side, namely

Predictable costs of GPU’s usage - unlike the pay-as-you-go, which can be beneficial in many situations (such as deployment and production environments), training deep-learning approaches using this type of pricing model can produce very variable and, if one is not careful, astronomical costs. By expending fixed costs, related to the use of the bare-metal cluster, budgets are much better controlled and without surprises.
More value to the customer - While we always try to do our best, time and cost limitations can sometimes be challenging to overcome. As discussed above, by maximising utilisation of the bare-metal cluster, at a fixed cost, the technical team could experiment more and let their creative flow run, which would return more value to the customer either in terms of performance or in terms of exploitation of the problem space.

‍

Limitations of the Bare-Metal Option

Apart from the very compelling picture presented above, there are some limitations related to the bare-metal option.

First and foremost, a good and proficient operations and machine-learning engineers team is required. Their work is paramount for the lean usage of the GPU’s drivers and easy access to it.

Secondly, bare-metal clusters are mostly adequate for training deep-learning approaches, rather than for deployment and productionisation of the trained models. After a model has been trained, it is sensible to use services such as AWS for hosting and the pay-per-usage costing system is a very good option in this case.

‍

Conclusion

The decision between cloud computing and local GPU resources is a complex one, influenced by a company's specific needs and project dynamics. While cloud services offer unparalleled scalability and flexibility, they may not always be the most cost-effective or operationally efficient option for every business, especially those in the consulting sector. By taking control of their computing resources—a strategy Deeper Insights is internally exploring—AI consultancy companies can leverage significant technical and business benefits. This strategic shift emphasises the importance of adaptability and resource management in the fast-paced world of technology and deep learning.

‍