Tuesday, October 30, 2018

Buy, Rent, or Borrow Computing Power for Machine Learning

I am going through an interesting exercise at work. There are a number of us going through a Reinforcement Learning book and we all want to play with the code described in the various chapters. While I have the luxury of having a Linux machine sitting under my desk, others are not so fortunate. One of my coworkers has a rather large Linux box with a lot of computing power and so several people are running the exercises there. Unfortunately it can only handle about 3 really large jobs concurrently and so we are looking at other options for getting compute power for our Machine Learning exercises.

We tried the first solution by buying a large computer but there are a number of problems associated with that. I explained the first being that we are limited to about 3 people using it at the same time. the resources are fixed and don't scale well. The second problem is we have to maintain that computer. During one particularly large job, the computer stopped running at 4am and we have no idea why. We believe it might have been a hardware failure because the logs just suddenly stopped recording anything. It would be nice to have someone monitoring the computer 24 hours a day but that is not possible. Especially for a simple learning exercise.

The next option is to rent space on a cloud service such as Amazon (AWS), Google (GCE), or Microsoft (Azure). We do not require graphics processing units (GPUs) and so we can get enough computing power for all our experiments for around $650/month. We will take 2 to 3 months to read the book and would require about $2,000. That is significantly less than the price we paid for buying the computer mentioned previously. Furthermore the hardware will scale nicely. If we want to run more experiments, we increase the number of servers we rent. When we don't need them any more, we shut them down and don't pay to keep them running.

Finally there is the option of borrowing computing power. There are a number of other groups within the company that have spare compute cycles we could use for our learning exercises. This is the ideal solution if we only factor in cost. However the reality is that someone could be kicked off the hardware when other higher-priority tasks need to run instead or the problem of spreading out experiments evenly across the company.

We will probably end up renting servers from one of the public cloud companies as it seems to strike the balance between being cost effective and least troublesome. Your situation may be different and it is always worth considering all 3 options.

No comments:

Post a Comment