It’s Halloween 2020 again with Ben Ford, Rob Nelson, and Mike Smith sharing their DevOps horror stories.
We know Halloween is over; however, these #DevOps stories are still mentally wreaking havoc on all that was involved. Ben Ford, Rob Nelson, and Mike Smith bravely share their experiences that were close to creating resume generating events in this episode. Tune in for how they recovered and the timeless lessons learned.
Demetrius Malbrough [00:00:20] Hey, everyone, thanks for joining this episode of Pulling the Strings podcast powered by Puppet, and I'm delighted to be your host.
Demetrius Malbrough [00:00:27] My name is Demetrius Malbrough. I'm on the Product Marketing team here at Puppet, and I am really excited today to talk with Rob Nelson, Ben Ford and Mike Smith. We have some horror stories for you today. And this is our Halloween edition, which will be DevOps horror stories. So sit back, relax, grab your cup of coffee or whatever type of drink you would like to quench your thirst on this episode of Pulling the Strings. So, first off, Rob, how are you today, sir?
Rob Nelson [00:01:03] I am doing OK yourself.
Demetrius Malbrough [00:01:05] I am fantastic if you don't mind. I think we had a short conversation about what you are going to share today. If you don't mind. Go ahead and tee up your story and hopefully everyone will be able to figure out exactly what happened.
Rob Nelson [00:01:22] Sure. So my name is Rob and I've been working on it for 20 years. So if we go back way back to 2001, I was working at a small government office doing some support for the county government itself. And we managed email and email was like brand new to most people. It was just, awesome thing and lets you communicate. You didn't have to send, like interoffice memos or track people down. This was hugely revolutionary at the time. We've all come to hate it. But at the time email was the thing. And of course, we were managing it. We'd been managing it for three or four years as a company. Other people had been managing it longer. But to users, people were still figuring out how to use it. And one of the biggest problems, I think we still have is people just collecting email, just grows and grows and grows. And, some people have like 5K email inboxes because they read it and they delete it. But most people have just gargantuan emails. But back in 2001, we had some pretty serious issues with that disk space being at a premium and early email systems you often had to run some sort of program that would like clean up the database. If you deleted an email that was shared with 20 people, it would stick around because they were trying to save space until nineteen other people deleted it. And then when they did, it would still be out there until you ran a program that would go and clean it up. So we would run that every every so often it would take all day and it would, chop out 50 meg out of somebody's email inbox, which, you know, again now seems like 50 meg, who cares? But at the time we had, one terabyte was not something that even existed, that we had, a couple gig of data for some of the servers. So as time went on, we started hitting one of the file limits that I'm aware of.
Rob Nelson [00:03:15] You know, people around at the time might be aware of.
Demetrius Malbrough [00:03:17] And what exactly was it was a two gig file size limit.
Rob Nelson [00:03:21] Any single file could be no more than two gig. And the mailboxes for people were just a single file.
Rob Nelson [00:03:28] You know, if you had three thousand emails. So it's still one file and your system had to find the right point, the file to find an individual email and display it in your client. So we found there were some issues because our program was older. The two gig file size limit on servers had actually been fixed. You could have larger than 2G file size limits, but it didn't mean all your programs could. They had to be updated to know that some of them would just freak out when they found something that was 2G or larger.
Rob Nelson [00:03:59] In this case, that was the situation. We ran the mailbox clean up one time and it got to, I don't know, we had five hundred users. It got to like the hundredth user and that person had like a 2.1 G mailbox. They just got a bunch of stuff recently since the last time we ran it and the system just deleted their inbox.
Demetrius Malbrough [00:04:17] Because it was over that size?
Rob Nelson [00:04:19] It just freaked out and it deleted it. And then it went to the next user and deleted their inbox and the next user and it just hit every other user. So five minutes later we had like one hundred people with email and four hundred people that had no email ever.
Demetrius Malbrough [00:04:34] Oh, man. So you've got to stop right there and we need to put this into context. So can you set the scene of how it went down?
Demetrius Malbrough [00:04:45] Like did people start calling the Help Desk?
Rob Nelson [00:04:49] Or so you would go into the back room and you would start up the email program and you'd come back hours later to check on it. Right. So this was the case where you go in, you start it up, you go out, you're doing your work, this was desktop support, so I'm like at somebody's desk and all of a sudden, somebody's head pops up over a cubicle.
Rob Nelson [00:05:06] Hey, what happened to my e-mail? I don't know what happened to it. You come over and you take a look. You're like, huh? You don't have any email. Well, that's weird. Yeah. And then all of a sudden somebody else does it and then your boss shows up behind you. Hey, what's up? What do you mean? We're getting a lot of complaints about people that don't have email. So, yeah, it was because the system was broken. There was like, no, I mean, we didn't really have an alerting like we do now, but they like the system didn't alert us. It just failed. So it didn't tell us that there was a problem. It didn't hang on that first person. It just started deleting stuff immediately. And then. Yeah, and then the fun thing was, you know, some people had like an email open on their system and that was fine until they, like, clicked on the next email and then all of a sudden it went away because what was on the screen was fine. Yeah. And then they would like minimize it and reopen it and all said it would go away or whatever. And then, yeah, we started getting serious calls. So we spent the next I don't know, probably till like seven or eight at night restoring. Thankfully we had backups, we were a good shop, we had backups. So we get this all restored. Now, people have lost like emails from like 8 p.m. the night before to whatever time has happened. Right. And then the other fun part was if they had gotten any emails, they lost those two.
Rob Nelson [00:06:27] So but we got them to a point where, you know, that 2.1 G file was still there. So the lesson was, know what your system does. And at the time, we didn't have like a test system, but it was find out what kind of when a limit is removed from one part of the system, make sure it's actually removed elsewhere in the system. And then the second part of the story, because that's not horrible enough. Right. So another year went by and I was actually leaving to go someplace else, very amicable departure. I just happened to be physically moving. So I'm on my last day before where we're going to move. And me being the smart guy that I was decided to run the mail clean up before I left. I totally forgot. And what we had done, like six, 12 months before, I don't know why I did it. I have no idea why. And everybody's email started going. Now this time I was fast enough to notice it and I ran over and I hit control-C. So we only had to restore a few email inboxes, still had some data loss. But the fun part of this story. So that's all horrible, right. But there's a little, a little fun part in it. So you know, that was my last day. Right.
Demetrius Malbrough [00:07:34] Hold on. That wasn't your last day. Meaning you were fired, was it?
Rob Nelson [00:07:38] No, no. That was my last scheduled day of work. I was leaving the company.
Ben Ford [00:07:44] I have to say that as you're telling the story, that meme of the little girl kind of like walking away from the fire behind her. That's totally what I'm seeing in my head right now.
Rob Nelson [00:07:53] Yeah. So this is my schedule last day. Right. So I get this all fixed and my immediate boss was like, all right, you aren't touching anything the rest of the day. Like, I think I took a few phone calls and opened some tickets for people, but I didn't do anything else the rest of the day. But then an hour later, my big boss calls me in and I'm like, oh, but he's like he's like, no, no, no. It's your last day. No big deal. You know, these things happen. You just shouldn't do stuff twice. Right. You know, make the mistake once or twice. And then he says, but, you know, it is your last day. And I wrote a recommendation letter for you and I thought you should check it out beforehand. And he gives it to me. And I'm nervous. I just kind of skim it. I'm like, yeah, that looks great. You can see. Sure. What do you mean? Am I sure you wrote a great recommendation? Why wouldn't I like this? So he's starting. Oh yeah. He put in there something like something along the lines, I highly recommend Rob Nelson for any job. Just don't let him touch your email system on the last day.
Demetrius Malbrough [00:08:49] Oh, no, I don't mean to laugh but...
Rob Nelson [00:08:54] He wrote me a real recommendation letter, but I still have I still have that one. He gave me a hard copy of that one. I keep that one. I keep that one to help me remember. If you make a mistake once. That's OK. Try not to do it twice.
Demetrius Malbrough [00:09:09] That is indeed a horror story. And I appreciate you being open and honest and sharing that story. And in giving the pulling the strings listeners, kind of a peek into your life as an administrator or an engineer per say, because we all have a horror story and I even have my own. But of course, I'm not going to share right here. I'll leave that up for Ben Ford, who will share his story. Ben, how about it?
Ben Ford [00:09:41] I don't know, Demetrius, I. I think we're going to have to put you on the spot at the end of this. So be prepared for that. Hey, I'm Ben, I've been in the Puppet community for a long time, and my job right now is Product Manager for the Forge. But at the time of the story.
Ben Ford [00:09:59] I guess you could call it Education Ops. I was working in our Education Department, building a lot of our training and a lot of the tooling, a lot of the content, and kind of the infrastructure around our classrooms.
Ben Ford [00:10:13] And that's sort of the context for the story, because we had this really cool tool that would kind of interrogate a couple of different sources of truth to figure out what classes were coming up and what students were in those classes and who was teaching them and everything, and then would reach out to a cloud service that we were using. This was Amazon and stand up a virtual classroom. And this was like a puppet master and then some virtualized Puppet agent machines that people could log into and it would just like build out its own little mini environment. So you just log in and you run Puppet and it would hit the server and everything was all great and it would email a link to the instructor. It did all the things for you, just like automation should do. Unrelated, we also had a separate tool that was sort of company wide and most of us have something like this was it was the Reaper and it would go find stale machines and just clean them up because like nobody wants that giant US bill from stuff you've forgotten about. And that's how we got rid of classes. Once we were done with them, we would set a lifespan tag and say, you have this class is going to run for three days. So let's give them a little bit of buffer time so that people can log in and finish their exercises afterwards and what not. So, give them like four or five days worth of runtime and then the reaper would just, like, kill off the classroom. Yeah, all of that makes total sense. Right. So we have these classes running. They're going worldwide. There is some of them that are like, live right now at this minute. And some of them that are like getting queued up for classes are going to be starting in the morning. And it's about nine o'clock in the morning or so Pacific Time. So that's a couple hours ahead in the East Coast, approaching lunchtime. And then all of a sudden everything goes red. And it was like almost instantaneously, like you could see like this wave just sort of offline.
Ben Ford [00:12:17] And then, like, all the chat messages started happening, hey, where my classroom go. Hey, what's going on. Hey, I can't log in anymore. Could somebody help me out here like some people were polite about it and some people were pretty panic about it because these are instructors that are standing up in front of the classroom in front of like 15 to 20 people who all of a sudden are dead in the water and they can't do anything. So I panic'd right, holy crap, what do I do now? And then we have a couple of things that just sort of all compound and it turned out to be like we failed disaster planning 101 because we didn't have a plan for this. We didn't plan for everything dying all at once. What do we do? So we kind of scramble around with chickens with our heads cut off for a few minutes while we're trying to like in the heat of the moment, figure out what to do, get a couple of people, Michael's off like standing up classrooms again. He's doing it by hand, bam, bam, bam at the keyboard. And that's a problem I'll come back to. And I'm working on figuring out what's going on because we don't know what's going on at this point.
Ben Ford [00:13:26] Yeah. So Michael stands up a classroom and 90 seconds later it dies. It goes away. So I'm like, OK, OK. Either AWS is having like this major meltdown or wait, wait, hang on. There was that thing that Cody built. So I'm like, Cody, what's going on? No response. So now who do I talk to? And that's another mistake that we made, is we didn't have a clear communication plan because I didn't know when the reaper went crazy. I didn't know how to track down who was in charge. I didn't know how to track down who could shut the thing down, who could stop it from going.
Ben Ford [00:14:05] So like I literally like all of our classes across the entire world are just all imploding all at once.
Demetrius Malbrough [00:14:10] So this was a global thing?
Ben Ford [00:14:11] It was totally global. Yeah. And I think that we had something like eight classes running, live classes running at the time. And then there were another like 15 that were about to to start. So I just sent this mass message to like every single person in the company. This is a big deal. I need somebody to tell me who to talk to. Finally find somebody who has access like the keys that they could turn this thing off, get the thing shut off, and then we all get back to our back to our task of standing up to these instances. And once once the reaper isn't like coming behind us, killing everything off, it's actually relatively simple. But it was a manual task because we didn't automate "rebuild the world". We just had it automated so that every week it would be like, do next week's classes.
Demetrius Malbrough [00:15:02] So is is Rieper like Chaos Monkey of Netflix?
Ben Ford [00:15:06] It does some similar things. The Reaper is... this was an unintentional Chaos Monkey. The Reaper is intended to be very specific.
Ben Ford [00:15:16] It's like find all instances that have expired and kill them where the chaos monkey is like, hey, hey, hey, guess what, you're going to die.
Ben Ford [00:15:27] And it just picks out some random services. And the idea is that if your application can survive a few random nodes in the whole infrastructure going down and be able to like fail over to the next machine running the endpoint, then it's a resilient service. So that would be what the Chaos Monkey did. There were a bunch of things that we learned from this and we sort of like improved going forward. Is that having a disaster plan, even if it's like, we have no idea how this is going to happen. But you know what if and having that disaster plan being flexible enough to say like, hey, if something comes up that we've never thought of, this is how we're going to communicate. This is how we're going to divide up the work that we're going to do to recover from it. This is how we're going to interface with the rest of the company. Those sort of things are absolutely critical. And it's just like, depending on how big your team is, that's how much time you have to put into figuring that out. And then the other thing is we absolutely failed at having the communication plan for figuring out how to get this service tracked down and shut off. It took me twenty minutes to get the reaper shut off so that we could get our classes stood back up. And that was twenty minutes of, people could not actually access the classroom. So the instructor literally just had to stand up there and talk, which is part of their job. But still, people were in the middle of homework. So having that plan of how you're going to talk to people, how are you going to track down who is in charge of a service, for example? Oh, and then the fact that we had to manually run the tool each time to stand up to class. If we had built it in a truly resilient form, it would have once we had shut the reaper down, it would have brought itself back into the state that it should have been. And those are all things that will happen now, like in one form or another. But at the time, it was like a perfect storm of all these things that just went wrong just to make everything like an immediate disaster for several hours while we were recovering from it?
Demetrius Malbrough [00:17:45] And in any time someone says all of a sudden and also I hit control-c from Rob's story, then, you know, something's not right and just the name of Reaper alone is a little scary. And you have to be careful when the reaper is out there looking for something to destroy. So then that's it.
Ben Ford [00:18:08] It destroyed everything.
Demetrius Malbrough [00:18:10] That's a crazy story. So, yeah, I appreciate you sharing that. So the lessons learned there, I think is having that disaster plan. One thing as well, probably in both stories is testing. Testing, right.
Ben Ford [00:18:23] So testing, planning for disaster and communicating.
Demetrius Malbrough [00:18:27] I think Rob got caught up on his last day, though, where he was like, I was about to go do something else and I just decided to run this. And it's like famous last words.
Ben Ford [00:18:36] That's like the ultimate Friday, right.
Rob Nelson [00:18:38] I thought I was doing him a favor helping him out on my last day. Yeah.
Mike Smith [00:18:43] A parting gift!
Demetrius Malbrough [00:18:48] All right, so, Ben, thanks for sharing your horror story. Next up is Mike Smith and let's see how horrible Mike's story is. Mike, how are you?
Mike Smith [00:19:00] I'm doing fine. How about yourself?
Demetrius Malbrough [00:19:01] I am well.
Mike Smith [00:19:03] So first, I already have nightmares coming back from Rob's story from years ago.
Mike Smith [00:19:10] And if you ever feel like nobody knows what you do, just try rebooting an Exchange server in the middle of the day. I've done that accidentally many times. And, back in the day when you didn't have all these automation orchestration tools, you had to RDP to a machine and click the update button manually. And, you think you're RDP'D to one machine to get to another machine and hit reboot. Which one did I reboot? And then suddenly your phone started ringing because nobody can email. Oh, shivers right there. But my story is a little bit smaller, a little bit more personal. So I work at Puppet. I'm a Sales Engineer. Some of your listeners may have interacted with me in the past, but part of our job is to do workshops at customers. So help people learn, use case basis, let's do some stuff on on some servers together. And I was onsite at a customer and I provisioned all the machines ahead of time and I tested them. I think this is working. This is good. And I got my machines provisioned. So for this we have a pair of machines for each person, Linux and windows machines and a whole bunch of machines up there, I validated a few of them work. I got my slide deck together. I got I got my agenda.
Mike Smith [00:20:24] People are coming in the room. So there's live people here and I'm in front now. I've done a lot of workshops and I like to think I'm fairly comfortable in front of a crowd. But we start getting into the workshop. Nobody can access their machines. Like, what are you talking about? It's here I can do it. And it turns out I used and provisioned the machines, specifically the Linux machines with the wrong ssh key, so nobody could access their Linux machines so they couldn't do half the labs. And then it worked on my machine because I had the right key. But it turned out that was my key, not the public key. And I said well crap. So, starting to get nervous here because I'm standing in front of actual people, pre-covid when that was a thing. And I'm like, oh, OK, what do I do? And essentially what I did was I paused and I said, well, this particular one was a Bolt workshop. And I said, well, Bolt is something that I can ssh and do some things with. So I used, I turned the failure into a learning opportunity, at least for the folks there that took me a while to get. There's a way I can fix the key with mine that has access to everything. So I gambled and turned on the screen share to my machine.
Mike Smith [00:21:40] I said, All right, everyone, I'm going to show you how to use the tool you're supposed to learn about to solve the problem. But there was a lot of a lot of nervousness there in the beginning. And I was like, crap, I don't know, I'm trying to get another SE was helping me. I was like, hey, what can we do? Like, I don't know how to do this. And again, being in front of people just exacerbates that. Right. Makes a little bit more nervous. Worse than it was. Again, time and people looking at you, are waiting for you in any of these classrooms. We got emails, we got a workshop, same situation. But I think the biggest lesson that I learned there was even with all the tools that you have and with all the speed and velocity with which everything gets provisioned. And then we got the Reaper service out here destroying them or anything like that. It's really important just to say you're still going to make mistakes, pause and think about how you can solve it. And really having an understanding of what the tools are capable of can help remediate that situation, hopefully pretty quickly and correctly. But it's really hard to take that moment and stop because you're always in the situation, in the heat of the moment. There's pressure. There's people waiting for you. It's people staring at you. There's people that took time out of their day to come sit in this classroom. That doesn't work, you know, like, oh, my God, I've got to get these machines back up for these folks. So taking time to think about those tools has been probably my biggest lesson learned from that.
Ben Ford [00:23:04] Wow. Yeah. I think I would totally corroborate that because it's like when you're in the heat of the moment and you're like, I got to do a thing, I got to do a thing. It's like your critical thinking isn't quite as as clued in is your fight or flight response. And you it's like you almost always do the wrong thing and you just make it worse. Whereas if you slow down a little bit and just like put some thought into it and be like, what is the right thing to do? What is going to make this better?
Mike Smith [00:23:34] And I think the DevOps tools of today allow us to fail more spectacularly and in more glorious ways than we used to in the past and a lot faster.
Demetrius Malbrough [00:23:46] Yeah, if you're going to fail, you need to fail well, right?
Ben Ford [00:23:50] Hey, that's a really good point. When I learned to ski, the very first thing that they taught us was how to fall down and not kill yourself, right?
Demetrius Malbrough [00:23:56] Yeah. or break something. Well, definitely so, Mike, the lessons learned there was, I guess, give me a recap of what's the lesson learned there and just maybe one walk away or take away from that story, if you don't mind.
Mike Smith [00:24:14] Yeah. So there's lots of new tools out there and in the world of DevOps, if you will, and modernization and understanding those capabilities is key before you start something. And then when you're in one of those situations, whether you're in front of people, whether you got people waiting for you, taking a minute to pause and critically think about what should I do? What is the actual root cause of the problem? How can I solve it? What's a contingency plan? If that doesn't work? It's going to be really helpful. And in some cases, the tool that you're actually using or is broken, maybe the thing that can actually help you get out of it as well.
Demetrius Malbrough [00:24:48] Yeah, I like that as well. And also something else, you know, since we are, you know, still in the midst of the pandemic and people are working from home. So we're all isolated in our own rights. But, having a team member, to be that second pair of eyes and having a peer review, having someone kind of look over, what you're planning on doing, having that plan, and making sure that everything is tested out. And, we have development environments that you can actually run things on now. Right. So there's a lot of ways that I think people can actually get around, making some of these mistakes. But at the end of the day, you're going to make a mistake. And, if you own up to it, you fess up to it and you say, hey, it was me. And, here's what I learned and here's what we can do differently. It's definitely a positive after the dust has settled.
Ben Ford [00:25:39] Well, like Rob said, you try not to make the same mistake twice, right? You learn from it and you get better. But it sounded like you were about to close out. Demetrius, we haven't heard your story yet.
Demetrius Malbrough [00:25:50] Oh, yeah. I don't really have a story. You know, I'm going to kind of dance around this a little bit, you know, while I'm while I'm recording. But just kind of in a nutshell, my story was around. This was when I was a TSM admin. So this was back in the day. And I was responsible for managing a large backup environment. And TSM has this thing called scratch tapes. So scratching a tape means that you wipe out all of the existing data on that tape so you can reuse it as a fresh and brand new tape. And so TSM also has these things or had these things where you could actually run a command within a script. And so the script also had these pre populated parameters in it. And there was a bug. So depending on what version of that product you were running, there was a bug with the command and the flag to actually delete data from a tape. And let's just say I kind of ran that thing and it ran and went out and did some things that it wasn't supposed to do. And I forgot the number, but it was probably well over fifty tapes. You know. It was well over a lot of tapes that actually made you realize, you know, that feeling in your stomach when it feels like the bottom drops out and it's like, you know what, life is going to end for me right now.
Ben Ford [00:27:16] Did you just say you killed fifty tapes of backup material? I did. Oh, my gosh, I did.
Demetrius Malbrough [00:27:22] However, there was a copy, so there was a copy of the tape. But just to wipe out their primary tape like that, it is scary as hell. OK, so that's my story. And the the lessons learned there was to confirm, what you have as far as in your scripts and when you're running scripts, especially if you run lower level versions and you're not running the latest and greatest version of a bug of that particular software, then you definitely want to make sure that you know exactly what's fixed, what's not fixed, etc. So that's my story. It doesn't sound bad, but trust me. I felt that day was going to end my career. So I'm still here.
Ben Ford [00:28:09] And maybe that's the lesson from your story because I've felt that feeling to the oh crap. This is like a resume generating moment. But like you said, I'm still here too. So we learn from it.
Demetrius Malbrough [00:28:19] I might get a phone call after this thing goes live. And, you know, it might be someone say, hey, so that was you. OK, we're definitely going to send you some papers over so you can show up and stand before the court. I hope not. I'm not speaking it into existence, but this was definitely fun. And Rob, Ben, and Mike, any final passing words, I guess, from today's, DevOps horror stories?
Rob Nelson [00:28:51] Yeah. I'll just add, I think one of the biggest lessons I've learned over the years, and it's one of those that I think you have to re-learn all the time, because it feels like anathema that it shouldn't be that way, but most of the time, when you make a mistake, when you mess up, it isn't as big a deal as you thought. So take that time, take a breath, figure out what's going on. It's probably causing one person to scream, but it doesn't mean everybody's affected by it. It doesn't mean it's as terrible as it seems. So take that moment, take a breath, figure out exactly what it is before you dive back in, because like everybody's been talking about, you start sweating, you start making bad decisions, start rushing, you run the next tape through without checking the first one. And all of a sudden 50 are done, take that moment, figure it out and get back to it, because it's probably not that bad. Now, there's probably some places where it really is, you know, if you're working on medical stuff. Maybe this doesn't apply the same way. But for most of us, I think we can take that time and we should.
Demetrius Malbrough [00:29:57] Yeah, great advice indeed.
Mike Smith [00:29:58] And I agree.
Demetrius Malbrough [00:30:00] And is there is there any way that people can reach out to you, Rob on social media and after you Rob, Mike and Ben we'll let you guys kind of divvy out your your social media handles, if you like for the listeners to reach out to you.
Rob Nelson [00:30:16] Yeah. The easiest way to reach out to me is @RNelson0 on Twitter.
Ben Ford [00:30:21] And likewise I am binford2k on Twitter, GitHub, Slack, all the places.
Demetrius Malbrough [00:30:34] All right. So I appreciate everyone sharing their DevOps horror stories. I'm not quite sure how I got looped into this. I guess it's a peer pressure. So thank you, everyone, for sharing with us on Pulling the Strings podcast powered by Puppet.