AWS EC2: P2 vs P3 instances

Amazon announced its latest generation of general-purpose GPU instances (P3) the other day, almost exactly a year after the launch of its first general-purpose GPU offering (P2).  While the CPU’s on both suites of instance types are similar (both Intel Broadwell Xeon’s), the GPU’s definitely improved.  Note that the P2/P3 instance types are well suited for tasks that have heavy computation needs (Machine Learning, Computational Finance, etc) and that AWS does provide G3 and EG1 instances specifically for graphic intensive applications.

The P2’s sport NVIDIA GK210 GPU’s whereas the P3’s run NVIDIA Tesla V100’s.  Without digging too deep into the GPU internals, the Tesla V100’s are a huge leap forward in design and specifically target the needs of those running computationally intensive machine learning operations.  Tesla V100’s tout “Tensor cores” which increase the performance of floating point computations and the larger of the P3 instance types support NVIDIA’s “NVLINK”, which allow multiple GPU’s to share intermediate results at high speeds.

While the P3’s are more expensive than the P2’s, they fill in the large gaps in on-demand pricing that existed when just the P2’s were available.  That said, if you’re running a ton of heavy GPU computation through EC2, you might find the P3’s that offer NVLink a better fit, and picking them up off the spot market might make a lot of sense (they’re quite expensive).  Here’s what the pricing landscape looks like now, with the older generation in yellow and latest in green:

When the P2’s first came out, Iraklis Mathiopoulos had a great blog post where he ran Hashcat (a popular password “recovery” tool) with GPU support against the largest instance size available… the p2.16xlarge.  Just a few days ago he repeated the test against the largest of the P3 instances, the p3.16xlarge.  If you’ve ever played around with Hashcat on your local machine, you’ll quickly realize how insanely fast one p3.16xlarge can compute.  Iraklis’ test on the p2.16xlarge cranked out 12,275.6 MH/s (million hashes per second) while the p3.16xlarge at 59,971.8 MH/s against SHA-256.  The author’s late 2013 MBP clocks in at a whopping 121.7 MH/s.  The p3.16xlarge instance type is about to get some heavy usage by AWS customers who are concerned with results rather than price.

Of course, the test above is elementary and doesn’t exactly show the benefits on the NVIDIA Tesla V100 vs the NVIDIA GK210 in regard to ML/AI and neural network operations.  We’re currently testing different GPU’s in our Worker product and hope to have some benchmarks we can soon share based on real customer workloads in the ML/AI space.  The performance metrics and graphs that Worker produces will give a great visual on model building/teaching, and we’re excited to share our recent work with our current ML/AI customers.

While most of our ML/AI customers are on-premise, we’ll soon be looking to demonstrate Iron’s integration with P2 and P3 instances for GPU compute in public forums. In the meantime, if you are considering on-premise or hybrid solutions for ML/AI tasks, or looking to integrate the power of GPU compute, reach out and we’d be happy to help find an optimal strategy based on your needs.

Iron.io at The Machine Learning Conference

Attending MLconf in San Francisco on November 10th? If so, come say hello!

We’ve been seeing more and more customers hiring machine learning talent in order to tackle operational efficiencies and hone in on their forecasting. Iron’s platform is helping in almost all phases of the process, from ETL operations helping with the build phase to building models through distributed, hybrid and on-prem, IronWorker deployments.  We’ve never thought of ourselves as a machine learning as a service (MLaaS) company, but we’re apparently getting a lot of traction in the industry which is music to our ears!

The speakers this year are incredible and we’re looking forward to the entire event. From Xavier Amatriain’s background with ML driven medicine to Franziska Bell’s work on uncertainty estimations at Uber, we’re pretty awestruck at the lineup.

The event is being held at the Nikko Hotel in San Francisco on the 10th of November, and you can find more details here:  https://mlconf.com/events/san-francisco-ca-2/

We’ll be following up with a great recap, so stay tuned.  We hope to see you there!

GPU support in IronWorker

In the past few months, we’ve spoken to quite a few customer that have added Machine Learning (ML) tasks into IronWorker. The problem is, these tasks can take a significant amount of time on a CPU vs a GPU. GPU’s were built to handle the parallelization of complex matrix/vector operations that gaming required, and it so happens that deep learning exercises also have similar requirements.

That said, we thought it was about time to add GPU support to IronWorker. We started with a simple test of doing image recognition via TensorFlow. After hacking the example python script to add the ability to download an image via a URL, we zipped up the script and uploaded it to IronWorker. We went ahead and used the latest Tensorflow docker image.

> zip classify_image.zip classify_image.py
> iron worker upload --zip classify_image.zip --name classify_image tensorflow/tensorflow:latest-gpu python classify_image.py --image_url "https://www.petfinder.com/wp-content/uploads/2012/11/91615172-find-a-lump-on-cats-skin-632x475.jpg"

In this initial push, we released support for the g2, g3 and p2 GPU instances on AWS. Once we fired off that task, here’s what things look like on our end:

It took about 10 seconds of run time for the image classification via IronWorker using a p2.xlarge instance on AWS. We didn’t have a chance to run this against a non-GPU instance, but we’ll leave that as an exercise for the reader. We’re pretty sure it will take a little longer than 10 seconds! The actual output from the script is as follows:


Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1e.0
Total memory: 11.17GiB
Free memory: 11.10GiB

Egyptian cat (score = 0.60871)
tabby, tabby cat (score = 0.12714)
lynx, catamount (score = 0.07766)
tiger cat (score = 0.07641)
cougar, puma, catamount, mountain lion, painter, panther, Felis concolor (score = 0.00148)

We’ll be working closely with a few customers in the coming months on some large ML/AI projects, and we’ll post as much as we can on their use cases and resulting benchmarks. As ML becomes more and more prominent in Business Intelligence operations, we’re expecting to see a big increase in GPU usage. If you have any questions about our GPU support, drop us a line and we’d be happy to chat.

Robotics with Iron

We recently sponsored a robotics event in Tokyo held by OnLab, DigitalGarage and Psygig… and, it was awesome.  Also, yes… the course below was an actual course from the event.

The participants were to break into teams, build a drone, implement machine learning techniques, gather and analyze data via Iron, and maneuver their drone across multiple courses.  The teams that finished the courses and displayed the most innovative technical solutions were crowned champion.

The ability to utilize GPU instances and fire up containers that run libraries like Keras and TensorFlow allows for the offloading of heavy computational workloads even in highly dynamic environments.  In the last few months, we’ve been speaking to more and more customers who are using Iron for large ML and AI workloads, often breaking them into distinct types of work units that require different levels of GPU, CPU and memory requirements.

Congratulations to the winners of the contest and all those that participated!  It looked incredibly challenging.  If you have any questions about utilizing GPU’s, machine learning, artificial intelligence, or any other computational heavy lifting jobs using Iron, feel free to contact us at support@iron.io as we’ll be happy to chat.

Full Circle and Ramping up at Iron.io

Iron.io was recently acquired by Xenon Ventures, a private equity, and venture capital firm. Xenon Ventures is headed by Jonathan Siegel, a serial entrepreneur who has founded many popular software services and has made just as many successful acquisitions.

Here comes the full circle. What may not know, is that Jonathan was Iron.io’s first customer and investor back in 2010 prior to Iron.io’s creation. Jonathan was client and friend of the founders’ consulting business prior to Iron.io and encouraged the founders to transform their consulting service into a product.

In 2011 the first version of IronWorker was launched and the serverless revolution began. After pioneering this space, Iron.io has grown significantly adding products like IronMQ, IronCache, and our latest development of our Open Source product, IronFunctions. This success is due all of our amazing customers and partners! Thank You!!

New and old faces

You may be seeing a new name fly around a bit as well.  I’ll (Dylan Stamat) be joining Iron.io as General Manager and moving Iron forward. A little about me: I’ve been a personal friend to the founders, I was a co-founder at RightSignature (https://rightsignature.com), founded one of the first HIPAA compliant companies to run on AWS (https://verticalchange.com), was previously the CTO at a large technology consultancy company (ELC Technologies), a Ruby on Rails contributor, a committer to Ehcache, and a big fan of Erlang and Golang.

You will also see and hear from many other familiar faces at Iron.io – Roman Kononov, Director of Engineering who has been leading our engineering office in Bishkek, Kyrgyzstan since 2011. Nikesh Shah, Director of Business Development and Marketing, and other various new and old members from our globally distributed teams.

Roman Kononov and Rob Pike (with an awesome jacket) at Gopherfest.

As of now – we’ve added 2 new offices to Iron: one in Las Vegas, Nevada, the other in Tokyo, Japan. We’ve hired new Engineers and Customer Success and are continuing to hire. Let us know if you have any interest in joining our team!

What’s to expect moving forward?

New graphs providing quick ways to visualize historical worker concurrency

The short answer is, things are going to get a lot better.  We’ve been very busy since the acquisition.  There have been a lot of bug fixes, improvements to internal tooling, and we’ve added concurrency graphs to help provide more insight into the system.

Near term, we are committed to continuing and ramping up development in our entire product line. This includes better performance and reliability, a new user interface, granular metrics reporting such as concurrency graphs, streamlining customer support and putting new systems in place to better track feature requests and bug reports, and bug fixes throughout our web applications.

Long term, as it relates to products, we are being guided by 2 core principles:
Open Source, and Hybrid Cloud Deployments.

I’m excited about the future, and getting to know all of you! We will have more exciting news to announce in the coming months.  Please feel to reach out to us and stay tuned!

Dylan Stamat
GM, Iron.io