Eric Sorenson and Melissa Sussmann discuss Puppet’s event-driven automation, Relay, and how it helps clean up the “DevOps Dumping Ground”.
Cloud teams are drowning in an increasing, unsustainable volume of external events: cloud events, git events, monitoring alerts, tickets, incidents, and others. In response, engineers manually perform a disparate set of actions across various cloud providers, container platforms, CI/CD tools, config mgmt tools, and hundreds of other APIs. To make this better, some developers try to create their one-off automation tools or integration hubs, usually per team or project. Eric Sorenson and Melissa Sussmann discuss Puppet’s event-driven automation, Relay, and how it helps clean up the “DevOps Dumping Ground” in your environment
Demetrius [00:00:19] Hey, everyone, thanks for joining this episode of Pulling the Strings podcast powered by Puppet. And I'm delighted to be here as your new host, Demetrius Malbrough, and I'm a principal technical product marketing manager here at Puppet. I'm fairly new here at Puppet as well. And also new to the DevOps space. However, I spent 20 years of my IT career in the data protection, backup, and recovery industry, so I'm definitely no stranger to IT automation. And I am really excited today to talk with Melissa Sussman and Eric Sorenson. So let's jump right in. And Melissa, if you don't mind, please introduce yourself first.
Melissa [00:00:56] Hello. I'm the marketing lead for Relay, previously Nebula. My background is primarily in the hardware and IoT space. I think I spent, since we're talking about a number of years, I think I spent about 12 years in the hardware space, which now I guess utilizes a lot of Cloud Native technology like Kubernetes to connect via bare metal. I started out in the DevOps space about, at this point, about two years ago.
Demetrius [00:01:18] Fantastic. Thanks for that intro. Eric, you're up.
Eric [00:01:22] Hey everyone, I'm Eric Sorenson. I'm a technical product manager at Puppet and I've been at Puppet for about eight years. As of this month, actually, which is kind of crazy when I think about it. Most of the time I've been working on Puppet itself, and Puppet Enterprise, but the last year or so I've been really focused on cloud technologies, and the projects that we've been working on that have evolved into Relay, which we're here to talk about today.
Demetrius [00:01:45] So fantastic. Thanks for that intro, Melissa and Eric. And also congratulations on your eight year anniversary, Eric. You are a true Puppet veteran. So let's go ahead and jump right into the conversation. So today we are going to talk about event-driven automation. And how about we start with the definition of what does event-driven automation actually mean?
Eric [00:02:09] So there's three words there, and each of them has a distinct meeting between them. Events, the first one, are things that happen across your infrastructure that are represented as messages that get passed around between different services. For example, an alert from monitoring is a classic, but more and more we're seeing our tools generating and consuming events. So a DevOps engineer opening a pull request to say the Puppet code repository could generate an event, a new instance being spun up is also an event generation. The next word, driven, means that the events come in and they cause other things to happen. Historically, at Puppet, we've talked about driving automation through classic config management's operation like CF Engine and Puppet agents that run continuously and enforce state. And that's really like an autonomous loop that's driving that automation. We started taking on use cases that were more on demand that made changes when a user wanted a particular task done, like orchestration or phased rollouts of Puppet configuration. And now we're seeing a change into these event-driven workflows where something happens upstream, like where I was just talking about, and that needs to drive some change downstream. And the last part, automation, is what those events are actually driving. We'll talk about this in a bit more depth in a minute. But with Relay, we're really interested in the work people have done, sticking together these different tools with what we call digital duct tape, something that DevOps teams or systems administrators have had to build this in-house scripting to tie together different parts of the toolchain that don't really normally talk to each other, over prescribe well-defined APIs.
Demetrius [00:03:46] And you know what? I really like that digital duct tape word. So I think I'm going to steal that and maybe carve off a brand new Pulling the Strings episode just around digital duct tape. Well, maybe we can even title this one as digital duct tape, but...
Eric [00:03:59] There you go. To be fair, I stole that phrase myself. So it's not a... I don't claim any kind of copyright over it. I think Deepak Giridharagopal, our CTO, is the first person I heard use that phrase.
Demetrius [00:04:09] Great. I do have a daughter that is a up-and-coming graphic designer, so I can just visualize her creating a nice logo of digital duct tape somewhere.
Eric [00:04:19] That'll be awesome.
Demetrius [00:04:20] Well, let's see if we can hear from maybe Melissa. So I want to hear about, you know, the benefits gained, you know, from Relay, which is what we're calling our event-driven automation service. So let's go ahead and hear that.
Melissa [00:04:36] So the main benefit is, in general, DevOps engineers need a higher level of abstraction above OS management. So traditionally, Puppet primarily focused on OS management and what Puppet brought to the DevOps world for that, we're basically hoping Relay can bring to the API management world via workflows as code. The idea is that traditionally in the data center you had, basically, test-driven automation, model-driven automation. And you're basically managing OSs, hardware apps, VMs. There's some things, I guess, on the hypervisor layer that you're probably managing. But nowadays, even things that are on the hypervisor layer are now starting to be managed on using things like Kubernetes. Even like their metal hardware is basically moving in the direction of that space. And I learned that during my time at Xilinx. In the Cloud, you have more event-driven orchestration. You have a bunch of different APIs, I mean, hundreds of APIs and services that you're working with. You're also in the serverless space using more container platforms. So instead of looking at it as a completely different thing, it's part of the infrastructure stack. And it's basically another level higher above the OS management layer.
Demetrius [00:05:44] So the benefits gained is basically that it's a level of abstraction that you can you can actually go above that OS level and then you can actually plug in your APIs and kind of manage those APIs as kind of workflows as code. So that sounds pretty cool. But what about like, who do you think will benefit from event-driven automation and in what way does it help them?
Melissa [00:06:08] So anyone who has to deal with we call the DevOps dumping ground. I think also Deepak might have coined that turn.
Demetrius [00:06:15] I love that.
Eric [00:06:16] He does have a way with the turn of phrase, that's for sure.
Melissa [00:06:19] Yeah. So anyone who has to deal with what we lovingly call the DevOps dumping ground as a result of cloud adoption will benefit from a consistent and reliable platform like Relay to organize their deployment stack.
Eric [00:06:30] Yeah, when I talked about this at the Puppet Camps talk that I gave a couple of weeks ago for Australia, my graphical image for the slide was a billowing tire fire, because a lot of times that's how these tools end up looking, like you have people writing scripts that sort of run out of their home directories that aren't really repeatable. It leads to the series of hacks that last forever and don't really have reusability or repeatability. Those things end up being more expensive because if somebody is spending a ton of time both building these one-off tools as well as then having to troubleshoot and maintain something that they built themselves, they're not really working on delivering customer value. So that ends up costing more than if they had worked with existing toolchain or outside of their own little world to build a solution.
Demetrius [00:07:21] So this DevOps dumping ground, like, I really love these terms, and it seems like this is a real thing. A DevOps dumping ground, I guess, is where you go to, maybe you create things and you just leave them there without worrying about cleaning them up in the past or in the future. Is that what I'm what I'm sensing around this?
Eric [00:07:41] Sort of, I mean, people are really inventive and will come up with a solution to solve their problem in some way or another. We've talked to customers who have repurposed Jenkins, for example, which is normally a CI tool, but they use Jenkins as their troubleshooting center and have jobs that run out of Jenkins and do things like clean up stale instances, which is great that it works. It's kind of amazing to see it in action, but that's not really what that tool was meant for, and somebody else that's coming in new to the organization, or if you need to expand that particular job beyond what it was intended to do, it pretty quickly runs into trouble.
Demetrius [00:08:18] So what if you are an IT director? Why should you care about Relay and event-driven automation?
Melissa [00:08:25] So I'm thinking that they don't want to use glue logic, which increases the risk to security. And, you know, it's generally better to stick to a repeatable process. Our customers tell us that they solve this problem today by leveraging their own code and their own functions, using AWS Lambda and others. And that custom glue code is not great for connecting various events. Companies like Netflix and Google are able to build their own products similar to internal products, similar to Relay, but most can't afford to.
Eric [00:08:56] I would say to that as a IT director, some of the main concerns that you have about your team and the overall operational efficiency come down to questions of access, control, governance and audibility, like who can do what when, if somebody did make a change what can we trace that back to an individual making a deliberate decision to do it? Was it an accident? Those kinds of concerns, I think, weigh heavily on the IT director's mind. And one nice thing that we've got in Relay, because it runs as a software as a service platform, is that we have not super fine grained access control, but I think pretty roughly useful access control where somebody can have a role that's just a viewer. They can see logs, they can go back over historical output. They're an operator at the next higher level where they can run workflows themselves. We have the idea of approval steps. So if there's an operation which you want a human to verify before we actually go through and make make some changes, someone can have an approval role and they can approve or deny those kinds of requests. And then at the top level, obviously you have an administrator role that has control over the whole system. And so you can aggregate those accounts together, as well as provide, you know, and kind of the principal, the security principle of least privilege. Make sure that people only have the access that they actually need in order to get their job done and not hand out the keys to the kingdom to anybody that happens to roll by.
Melissa [00:10:18] This goes back to the digital duct tape thing from earlier, you know, it's it's great for in the moment, but maybe it's not the longest term solution. So as an IT director, they have to think more than just, further ahead than just maybe the next cycle that they want to run. So they have to think about other risks. And a general ROI factor, you don't really want a security hole and you want a repeatable process as much as possible.
Eric [00:10:42] Yeah, exactly.
Demetrius [00:10:43] So we've talked about the benefits and we also talked about, you know, what would an IT director, you know, kind of gain from event-driven automation. What about, so I guess, can you break down, like, exactly how this event-driven automation will work?
Eric [00:10:57] I mean, I could talk about this subject for the entire length of the podcast so I'll try to keep it brief. But in general, as I mentioned a moment ago, it is a software as a service offering. Relay runs on Google Cloud platform. And in it, a user sets up incoming triggers that are events that come from the outside world. And the events come in as either web hooks, say, from we have a GitHub integration so you can get a web hook sent into Relay when a new pull request is open, for example, or a new repository is created. We can also take in generic events that get posted directly to the service. And obviously, if you want to, you can also run or generate a trigger manually by just running a workflow. We also support time-based scheduled triggers like cron jobs. Those triggers come in and we extract the information out of the payload that it came in from the trigger event and make a decision about whether to run the workflow and what parameters we are to fill out from the workflow. The workflows themselves are written in Yaml dialect that is pretty simple to get started with. It's pretty powerful, I would say. And the workflows are really just a definition of steps that you want to take in response to that trigger happening. The steps are run on GCP, as I mentioned, as individual containers in a Kubernetes cluster. And so each step is kind of like a single unit of work to get done in that workflow. And then it exits and can pass data onto the next steps and manipulate state, or change the outside world if it needs to. And then when the workflow exits, the logs are stored and persisted on the service. We go back to waiting for the next one. So it's a pretty cool system. It uses some open source components like project from Google, originally called tacked on pipelines that does the heavy lifting under the hood of running the steps in order and passing data around amongst them. And we also use HashiCorp Vault internally in the service to keep secrets and sensitive information like connection data that where we need to talk to outside services like AWS.
Demetrius [00:12:56] I think it's brilliant, you know, for anyone to be able to create a SaaS platform to do some of the things that it seems like really is capable of doing is just it's amazing to me, because my skills are a zero when it comes to that. But why does the industry need event-driven automation right now, especially like during this this COVID-19 pandemic thing that's happening right now?
Melissa [00:13:18] So the collective digital transformation that companies are experiencing right now utilizes a lot of cloud native technology and it's utilizing a lot of APIs. I mean, there are hundreds of API is out there. Also, economic circumstances make it more important than ever to manage cost saving resources. Automation is a way to save money. I think that that's something that a lot of companies will basically need at this time. Unused cloud resources are a pretty good example. They make everyone's pockets hurt. We have a blog post that actually covers reaping EC2 instances for any of listeners who are interested. But I think in general, that's the primary motivation behind what we're trying to do with Relay.
Demetrius [00:13:58] Unused cloud resources. It sounds like the digital dumping ground that everyone mentioned earlier. But can you explain, I guess, some of the use cases where, you know, Relay will add more cost savings and maybe add to the efficiency of IT organizations?
Eric [00:14:14] I mean, there's a concrete example from a user that we talked to just earlier this week. They have a sandbox environment in EC2 where developers are allowed to spin up any infrastructure they need to play with, but the rules in the sandbox are that you have to tear it down within a week. And so nothing should really live there for too long. The central DevOps team wants to know, in a nice way, of course, when people do leave stuff laying around. And their idea was to get a message sent into their Microsoft team's chat service saying, hey, this resource is sitting around. It looks like it's going stale. You should go check it out. My suggestion was they should also, if they have a karma point system on their chat to, like, subtract karma from the person that left that stuff running around. But they didn't think that was politically feasible. Anyway, so the event there in that case is really a time-based scan that looks across those accounts to see all the resources. And then the automation, the workflow would extract those user names and then generate the chat message and send that over to the team.
Demetrius [00:15:13] Yeah, that that one does sound really cool. It seems like that will be able to save companies a lot of money, you know, just to be able to kind of initiate that and run that. But let's jump to the future here. So I guess what cool thing do you wish to see event-driven automation do?
Eric [00:15:30] So this sounds a little buzz-wordy, and I'm a little leery of talking about the super far, far, you know, future where we all get jetpacks and that sort of stuff, but we are really interested in the area of machine learning right now. And what's what's this emerging field that's called MLOps, or machine learning for operations, and it's applying the concepts of machine learning to Ops tasks. So for Relay and for event-driven automation, more generally, that means building a rich, large dataset of event information and then being able to do things with that. Right now, we just have this one-to-one mapping of events coming in and workflows running. But we're exploring ideas around, what if we had a really large event stream and could do things like make recommendations for which workflows you might want to run in order to wrangle those event streams, or to sort of like a large scale monitoring system could do, to keep a, you know, a toe in the river as that stream of events is flowing by and say, hey, this looks like you're getting some anomalous security events or a large amount of monitoring where it's some kind of spike that large scale and be able to recommend actions that people could take or even, ultimately, I think, take actions on behalf of those users. Again, that's pretty far out. But I think that's a really promising area for us. It's something that we're really, really interested in diving into.
Demetrius [00:16:49] Yeah, it it sounds really cool to me. And I can't wait to partake in some of that futuristic stuff that you're talking about. But, you know, all this talk about event-driven automation and, you know, Relay, you know, when and I guess how and when is, you know, Relay launching to the public, I guess from a beta perspective, or any of the information that you want to provide that anyone can get their hands on on the product and start playing around with it.
Melissa [00:17:13] If you want to, you can just log in and sign up at Relay.sh. Basically all Nebula sign ups have early access to the product because Relay actually is built on the technology that we already created, which is Project Nebula. This is now the full fledged product version of Project Nebula. So the public beta officially launches in mid-June, but if you want to, you can just log in and sign up.
Demetrius [00:17:37] Fantastic. So I want to thank our guests for sharing with us about event-driven automation and Relay on Pulling the Strings powered by Puppet. So please tell everyone where they can find you on social media and whatever else you want to share, Melissa and Eric.
Melissa [00:17:52] Yes. Thank you so much for your time. I just want to quickly do a shout out for our blog, Relay.sh/blog. That's where you're gonna find a lot of really cool resources. You're going to learn, you know, how to use K native. You're going to learn how to read EC2 instances. There's a lot of really cool content on there. You should definitely check it out.
Eric [00:18:11] And if you want to talk to us more about using the product and you get into it, you can tweet at me. I'm @ahpook on the Twitters or jump onto the Puppet community Slack. We have a dedicated Slack channel at slack.puppet.com. It's just #relay, so hop on there. I'm there pretty much all the time as well as the Dev team and other users of the product. So we'd love to talk to you, see how are you using it, and see how we can make it better for you. Thanks so much for having us.
Demetrius [00:18:41] Thank you both.