Splunk’ing Jira for deep insights into application, database, and server health trends
Transcription
Hello everyone, and thank you so much for joining us today. My name is Ben Lack, and I’m on the marketing team at cPrime. And today’s webinar session is Splunk’ing Jira for deep insights into action, database, and server health trends. Joining me today is one of my colleagues, Justin Evans, who’s a director on the DevOps team. Justin’s got a ton of experience with Jira and Splunk, and I want to just give Justin a quick second to say hello before we go through a couple quick housekeeping items. Justin you want to say hi?
Justin: Hi everybody. It’s been said, I’m a director in our DevOps department. And I have a lot of experience with both Jira and Splunk, which we’ll be talking about shortly.
Ben: Terrific. Before we have Justin present, I just want to go over a couple things real quick. Number one, today’s session is being recorded so we can make the recording and the slides available to you afterwards. We want to make today’s session as engaging as possible, so they’ll be a couple of poll questions that we would love for you to answer so that we can comment on in just a moment. But also there’re gonna be sections where you can feel free to jump in with questions or use cases or case studies that you have in terms of some of the trouble that you’re either having with Splunk or Jira because I think part of today’s session is really going to be around providing a lot of use cases and thoughts around how to make your lives a little bit easier.
So there’s also a three question survey when the session is over. I would love for you to fill that out, let us know how we did, give us any feedback so that we can make these webinars as useful for you as possible. I wanna kind of get our conversation started with a quick poll. And the question is, “Are you having problems or issues with Jira stability?”
So if you could just select one of the following. Yes you’re having problems, no, or you’re not sure, I’d appreciate it. And if you’re not sure, if you could chat me why you’re not sure. If you say yes you are having problems, you can chat me, if you’re comfortable with it, what kind of problems you’re having. And if you’re no, then you’re in a really good place. So you can just listen it. But we’ll give everybody another five to 10 seconds and then I’ll close the poll and share the results. And then, Justin, have you comment on it.
Okay. So our poll shows that 45% are not having any problems with Jira stability, but we do have about a third of the folks that are having problems. And 20% who aren’t sure yet. So some of the comments for folks that are saying yes, some of it is… Like Lavina says, “I’m new to Jira,” so she doesn’t know. Bob says that we’ve had a Jira board disappear and then reappear days later. And then there’s some other comments around instabilities, and they’re either working with us or other folks. Another comment is lots of downtime or bulk update issues. Or the Jira board also disappears and the tasks aren’t appearing. So just curious to hear your thoughts Justin on kind of the responses that are coming out of this poll.
Justin: Yeah those are all good responses. I haven’t listed made those in particular for in some of the case studies we’re gonna be including a little bit later, but it sounds like we’ve got some common threads going on there. And I believe Splunk would be able to help in this capacity. So thank you for your responses.
Ben: That’s a perfect transition to this second poll that I want to launch real quick, which is, “Are you currently using Splunk today?” So, if you could quickly let us know. Go ahead and select yes or no or what is Splunk? That would be super helpful for us. And if you are using Splunk, if you could let us know how you’re using Splunk. And if you’re not sure what Splunk is, we’ll be sure to educate you on what Splunk is and go from there. So just three or four more seconds and then I’ll close the poll and share the result.
So more than 50% are not using Splunk today, although almost a third are. And Aaron is saying that he is using Splunk, and he’s analyzing log files, but not very well. And Andrew is saying that they have it and they’re just about to turn it on, and so they’re really hoping to get some insights today on how to leverage it. But Sharon’s also commenting that they’re using Splunk for some monitoring services. Ben is saying that he answered no, but he doesn’t really know what Splunk is. So Justin, curious to hear your response ’cause we’ve got a broad range of responses.
Justin: Yeah that actually seems a little surprising to me. In the people we’ve talked to recently, it’s always surprising to us how many organizations have Splunk installed, but are not getting maximum use out of it. So I think the ratios sound about right and it’s really common for people to have it installed and to throw a bunch of information at Splunk, but not use it to its full capacity. So very interesting responses, thank you.
Ben: Terrific. So without further ado, Justin, if you’ll go ahead and put your slides into presentation mode, I will pass off the baton to you and just say thanks to everybody so far for jumping in. This kind of engagement is what we’re looking for and hopefully you’ll find Justin’s session useful. Justin, feel free to take it away.
Justin: All right thank you everybody for joining us. We’ve got about 50 minutes left and a lot of ground to cover, so hopefully I go quickly, but not too quickly. Ben, feel free to jump in and let me know if I’m going too fast and I’m not making any sense. I think you have the power to do so, so please feel free.
Again, welcome. This is Splunking Jira for Deep Insights and Application Database and Server Health Trends. Essentially, what are we talking about is common case studies where Jira’s stability becomes a problem under certain use cases and how you can use Splunk to identify those problems and take action upon them.
So my name is Justin Evans. I’m the director of products here at the DevOps department at cPrime and we’ll get into it. The agenda today, we’ll talk about ourselves just a little bit. cPrime fancies ourselves to be a thought leader in the Jira and Splunk space. So we’ll go over some common Jira performs degradation and failure mechanisms. We’ll then move into a live demo of Splunk and how Splunk hooks in with Atlassian application dashboards, so we’ll cover Jira at least. If we have time, we’ll also cover confluence and bitbucket, which we’ve also created some dashboards for. We’ll talk about how cPrime could help if you’re having some of these issues and we’ll open it up to Q&A.
Introductions. Again, I’m Justin Evans, director in DevOps department at cPrime. I do have six years of Atlassian administration experience. I’ve been Atlassian user group leader from Los Angeles. I am currently the Splunk user group leader for Portland. Recently moved my family from L.A. to Portland. Always trying to stay involved, good networking opportunities, sharing best practices, thought leadership with the community is something I’m passionate about. So, that’s a little bit about me. I will talk a little bit about cPrime.
Our mission is to help teams produce the extraordinary. Our expertise is finding the business value and maximizing the business value through the confluence of team tools and process. That’s really what we’re all about.
So why are we here today? Talking about integrating Splunk with Jira. Splunk, their taglines are Turning Machine Data Into Answers. Any Questions, Any Data, One Splunk. Where Splunk really shines is machine data aggregation. Every software application you have, every IoT device, every thing that is connected to the internets or is running on a server somewhere is generating information. Whether it be log files, database record entries, what have you. The idea behind Splunk is to get all of that information at one central location and then to be able to filter and sort and aggregate all of this information together into certain visualizations that give you real business intelligence.
You’ll see on the left hand side. We’re talking primarily about Jira today, but Jira is not just Jira. Jira is a job application that often runs behind a Web proxy like Apache or engine X. Apache Tomcat is the web server within Jira software that has a database you’re probably using PostGres or MySequel. It’s probably running on a server that has an OS of your Linux or Windows and it’s running on some kind of infrastructure. Whether experimental in your office or using a cloud provider like AWS, UCP, or Azure. So Jira, again is not just Jira. It’s a combination of a whole bunch of different software suites, frameworks, utilities, servers, what have you. And the idea is to take all the information that’s being generated from all these software systems and services and get that into Splunk. And then you put that together and start getting some actionable intelligence out of it.
So again, cPrime fancies itself as a thought leader in the Atlassian space as well as the Splunk space. So we will talk about some common Jira performance degradation and failure mechanisms. These are things that I’ve seen firsthand over the previous six years of being an Atlassian application administrator, infrastructure administrator. We’ll get into it. I haven’t seen it all, but I’ve seen a whole lot and every time I give this presentation, I have people coming up to me saying this is another case study that I’ve seen a couple times. And as we get that information we try to aggregate it and compile it into a larger presentation. Again, hopefully this doesn’t take too much time. We’ll just run through a few very common scenarios. And how Splunk could be used to either help identify that scenario was happening or to give you more intelligence about how to take the next steps. Once you see these case studies happening.
First and foremost, too many concurrent users is one of the main problems with Jira stability. Depending on how many resources you have allocated to your job or virtual machine running your Jira application, too many concurrent users can really cause Jira to slow down. So this is something we’ve seen often. Teams that are fully agile in an organization, that have multiple agile teams. They all do their sprint ceremonies, their split sprint planning in their backlog grooming at the exact same time, the exact same day, every single week. Everybody’s using Jira to its full capacity and users are reporting that Jira’s slow.
So how do you identify that this is actually happening to your organization or what could you do about it? You can monitor your HDTV access logs and run stats for a given time frame, but then correlate your actual response times with the number of users and what they are accessing. So you can see that everybody is logging into Jira agile boards and there’s no 250-user license, maybe 200 users are using the application at the same time. And you might be able to take action on that in terms of allocating more resources or changing your process so that your teams are not all using Jira at its fullest capacity at the exact same time of the same day.
Another case study. Things that can happen to degrade Jira’s performance. If you have long running or suboptimal JQL and filters or dashboards, that takes resources on Jira. And that’s where more people are using those resources in those longer running threads within Jira. It causes slow counts for everybody. So one thing I’ve seen is you’ve got Jira’s service desk installed and you’ve got multiple queues in your service desks and they all leverage a script running JQL to filter queues by common authors and timestamps. That’s very processing intensive in terms of being able to access the database and pull up the issues that are relevant based on authors and time stamps. And you’ve got 50 people who are all using that exact same filter and refreshing constantly looking for new issues coming into their service desk. And everybody saying that Jira’s slow. Well that’s why.
So again, how do you identify that this is the problem? And what can you do about it? So you can monitor your web proxy or DB slow query log using MySequel. It has a slow query log app box. You can determine which filters are causing problems and see what the true impact is to the end user in terms of response times. And then you can dig into how those filters are written in terms of JQL and optimize it and, hopefully, lessen the page load time for a filter, thus improving performance for everybody else.
Another thing that we’ve seen that causes Jira to slow down over time is an excessive number of total issues and excessive number of custom fields. Every issue and every custom field that you add the system increases the number of records in the database, which causes read times to slow down substantially depending on how many fields you have. We’ve seen clients that have up to as many as 4,500 custom fields and they report that Jira’s slow. So typically, an organization installs Jira, everything’s blazing fast, everything’s fantastic, and then over time, it gets a little bit slower a little bit slower until you’re to the point where you have something like a couple hundred thousand issues, a couple hundred to a couple thousand custom fields, and users are reporting that Jira’s slow.
So how do you know that you know this is case study is applicable to your organization? You could monitor your Jira database for customs field numbers and creations over time of number of issues that be created over time. And you can correlate that with the response times over time. So if you immediately hook up Jira database and your web proxy logs on day one and you run it through for three, six, nine, 12 months, couple of years, you can actually correlate that’s okay as you increase number of issues as you increase the number of custom fields. You can see what that means in terms of response times to the end user.
Ben: Justin I got two quick questions for you on your past use case that I wanted to ask you first.
Justin: Sure.
Ben: Well the first, Alberto was asking for Jira, “Does Jira ship have a cloud version or just or an on-premise version or both?”
Justin: That’s a very good question. So Atlassian offers Jira in the cloud. I think you limit it to 2,000 users. What I’m talking about here is Jira on-premise, because you have access to the logs, you have access to the database, you can send all that information into Splunk if you have it on-premise. I probably should have prefaced that on-prem Jira is required to make this all work Atlassian restricts your access to certain things about the Atlassian stack when it’s hosted in the cloud. So yes.
Ben: Right. And then for the use case that you just talked about for custom fields, Ben asks, “How do you efficiently get rid of custom fields to improve the performance?”
Justin: I will touch on that a little bit later once we get into the live demo of the Splunk dashboards. Just to allude to it, you’ll see you can actually track the usage of each custom field and if you find that you have 400 custom fields and 150 of them are only populated on maybe two or three issues, you can talk to whoever created a custom field and say, “Maybe can we get rid of this?” And we’ve demonstrated that’s shown to improve performance.
Ben: Thanks.
Justin: Okay. So there’s a couple case studies. Very common. We’re gonna get into some really involved ones. So this is the section I call times I wish I had someone. I call them outrageous case studies because these are real edge cases. But Splunk would’ve really helped immensely to be able to identify that these outrageous case studies were happening to us. So we’ll get right into it.
I was at a company for a while who shall remain nameless, because someone might actually be on this webinar and looking at this right now. I don’t want to throw anybody under the bus. But I have reports that you know Jira’s been slow for weeks and, of course, I knew that. I was using Jira every day and it was getting slower and slower. I had no idea why. So we said okay, we need to do something about this. And the first thing I ask for my IT team was, “Okay, give me a copy of the application log files.” Again, at this time, I did not have Splunk installed. And they turned the turn around and gave me a log file that was 20 gigabytes large and I couldn’t open it, because my machine only had eight gigabytes of memory. And every time I tried to double-click on the files, open it, and my favorite text editor, my computer would crash.
So for whatever reason, we didn’t have log rotate enabled on server, which makes those log files more digestible and openable on desktop software. We got some improvement. It didn’t actually fix the issue, but I was finally able to see what was being put in the log files. And it turns out Structure, one to one of the premier add ons for Jira, has this concept called synchronizers, where you synchronize the structure board to an agile boarder to a project. And they were failing every minute for weeks. So the synchronizer was running every single minute. 30 of them were running every single minute. And failing for weeks and it was number one, colluding the log file, which made the log files look large. And number two, it was taking resources that could have been used for aspects of Jira that were actually working.
Okay. So … what could have helped. Real time log aggregation along with learning, which Splunk is also fantastic at. I would have seen these new errors within a few minutes of the first error commission or beginning. So the reason why the synchronizers were failing was because you had to assign an owner to those synchronizers and that owner had left the company without signing it over to somebody else. And his account was disabled so the security just couldn’t run. And all of this you know took weeks of investigative work and I could have found this immediately if I had had had these log files going to Splunk and setting up some alert. That said, whenever it says error. You know, tell me about it.
And here’s where you begin your slow descent into alcoholism. Here at cPrime, we had a client once said Jira won’t start. We don’t know why. So Justin, could you jump on a call. So I say disable all your add-ons, because that’s the last thing on best practice of Jira won’t start. And they say yes. I want to verify for myself. I say, let me see your log files. And you know, we have high security, you can’t access the server, and we’ve got security concerns, so they open VNC session and this is the kind of command you run on LO terminals to see the logs that are being generated in real time as an output to Jira and to try to identify what the problem is. The only problem with that is … I’ll get to that in just one second. But the thing is if you’re telling log files, the information is flowing so rapidly that you’re going to miss the important information. Especially during crunch time when minutes count.
So I’ll just give you a quick example. This is not the actual log files. This is what it looks like when Jira starts up and you’re trying to find that needle in the haystack to try to find a smoking gun as to why Jira won’t start. Isn’t that our real problem? Is it a problem with the database? What kind of problem is it? This is what you’re looking at when Jira starts up. You’re trying to identify that the problem is. You want all of these log lines sent to a log file aggregator in real time so that you can search it after the fact. So at some point, this will just shut down, freeze, and Jira won’t start. I may have to do it again. Then try and find that needle in a haystack.
If you’re sending all of this to an aggregator like Splunk, it’s all easily searchable, but no bias time stamps and you don’t need to keep restarting it, looking at this, restarting it, looking at this, and trying to find that smoking gun. So very, very frustrating.
So the point of all this is, with these case studies, with these extreme examples, and how you can really send all this information on Splunk, Splunk is really good for actionable intelligence. So what can you do knowing that these case studies have happened? Or these extreme examples have happened to you? You can propose process changes. For example, that every team doing separate planning or backlog grooming at the exact same time, the exact same day, every single week, you can propose some process changes. Maybe we should stagger our meetings so that Jira is performing for everybody. Application figuration modifications. You see that you know as we have so many issues as you have so many custom fields, you could propose that, “Hey maybe we might want to get rid of it through these custom fields to improve performance for everybody.”
Requests for IT budget and resources if you have the hard data that says, look, Jira is slow because we don’t have enough Java resources allocated and we don’t have enough resources allocated, because our server is not big enough. We need more budget to bump up the server or go to data center. You have that information at your disposal to make that argument. A lot of times, I’ve seen before using something like Splunk. People would just say Jira’s slow, we need more resources. And it’s like okay, well everybody kind of shrugs because, yeah we know Jira’s slow, but they don’t know why. You can specifically say we need more CPU, we need more memory, we need X and Y and this is why. It just makes a better argument to get that budget to beef up your systems.
And alerting and forecasting. Getting ahead of complaints. Again, if the CPU is burning at 95% for an hour or if you’re reaching your maximum capacity for RAM, you want to monitor those things so that you can take action before Jira actual crashes, which does happen from time to time.
And at this point we’ll go to a live demo, which will [inaudible 00:24:03] application dashboards that we created with Splunk and hopefully this transition to my web browser is seamless. I’ll probably ask Ben to verify that …
Ben: Chris is asking if you’re going to provide actual query examples that they can run against their own environment. So, if you could comment on that real quick.
Justin: I won’t … So Splunk has this framework called SPL, Splunk Processing Language, it’s a lot like JQL but it’s Splunk specific. I won’t go into the actual queries that have been used. I’ll talk a little bit about how we’ve integrated it, but I don’t think I’ll be showing any actual queries. I won’t be focusing on the queries that generate this data, but if there are any specific questions about them I’d be happy to field them.
Justin: Okay. So we are looking at dashboard of Splunk. Again, I’m sending application performance monitoring logs. I’m sending Linux logs, and I’m sending Jira application logs, I’m sending Java garbage collection logs, I’m sending engineX reverse proxy logs, I’m sending HTP access logs. I’m hooked into Jira’s database, which you know we recommend you hook into read-only replica database, so that the queries that you’re putting against Jira’s database don’t hurt actual production or Jira performance. But the point being is that were that earlier slide that I showed you that showed you all those different software systems, frameworks, and applications, and servers. All those logs were being sent to Splunk. And when we do something like that, you can create these great real time dashboards. I say near real time, because I don’t want to actually add them in real time, because that actually hurts Splunk performance, which is maybe a topic for another day. But each one of these widgets on the dashboard is tied into a report and that report runs a query against log files over recent time frames. Monitors records created in Jira’s database over a recent time frame and puts that all together.
So the few boxes you see across the top here are classic application performance monitoring. You want to see how much CPU is being used over time. You see we had spiked recently, but are doing pretty good currently. Input and output bandwidth at the time these reports were run, clearly nobody was actually using the system at the exact moment this report was run. It’s got spark lines and it’s got trends so you see that the reading before the zero kilobytes a second was actually 72 kilobytes second. So it is being used. JVM [inaudible 00:27:11] utilization. We find that people sometimes have performance problems based on how they tune their JVM parameters with the Jira application. And you see that currently we’re at 33%, which is pretty low. So if you wanted to optimize JVM for performance, we might actually allocate less memory to Jira, which would also save some money. So you have some pretty good insights by what your job with garbage collection logs.
And, of course, property disk remaining, secondary disk remaining. We have some alerts set up where if it gets below a gigabyte or if it is below 500 megabytes, send it over to the IT team and then you go into AWS and they allocate more hard disk space for the Jira application.
Request times by method. So this is, I think and [inaudible 00:28:01] cross proxy log. I thought it’s a nice telling image that you see that our performance is not too bad. Everything seems to be getting returned in under two seconds, which is probably a pretty good threshold to work with. And you see you know before 9:00 a.m. of course, with classic business hours, not a whole lot of activity. And then after 9:00 a.m. see a lot more activity.
Ben: Just at a high level as you’re going through these stats, can you just quickly speak to how you managed to get these metrics into Splunk like CPR, JVM, heep and others? Is this out of the box?
Justin: Yeah so … Some of this is out of the box. The Atlassian Jira log is something that is there immediately and it’s always there. Some of these use Jira’s HTTP access log, which is not enabled by default. You have to do some configuration tweaks on Jira’s configuration files to enable those. But I’ll try and speak to what the source was for each one of these different panels that you see.
Ben: Thanks.
Justin: So for CPU and network utilization and discriminating, I use a homegrown APM web application called Oriole I. It’s available for free [inaudible 00:29:21] for JVM peak utilization, that’s garbage collection logs. It’s not very verbose, out of the box with Jira. So yet again, you have to go into Jira and tweak some settings to make your JVM garbage collection logs verbose to give you the information that you’re looking at here.
Request times, I think I mentioned EngineX reverse proxy logs. So that’s where we are right now. And good governance things, some metrics about just today, configuration changes today. Did somebody add a component, did somebody remove a version, did somebody ask somebody to a project? And I will say cPrime advocates health consultancy is very large, but our day to day usage within Jira, I think we have maybe 100 users use Jira on a daily basis. And of those, only a handful are using it at the same time. So we don’t typically run into the performance problems that I’ve been speaking to just because we do have a smaller instance than many larger organizations out there. So you can see two configuration changed that’s not very many. If you are a big organization you might see several hundred.
And the beauty of this is this shows you where you are today. But any time a record is created or log event is generated, that gets ingested and it’s always searchable back in time. So these widgets are configured to show you all today, but if you want to see how many configuration changes happen you know per day, or per hour, or per month over time since Jira was installed or since Jira was with Splunk, you can see all those metrics. Don’t get too deep into the kind of like the niceties, the quantities of what’s happening today. But you can see the next, which is custom fields created over time and you can see that January we started with 40 and just like most organizations, sometimes want to add another custom field and as you have more custom fields again, your performance tends to degrade slightly until you have so many that it becomes a significant impact.
So this shows you know how many possible fields have been created over time and if you were to click into this report, which I don’t think they want to keep moving on for the sake of time, you can see who created it when they created it. And it’s also good for governance because is you have multiple Jira administrators, you can follow along and say, “Okay well you’ve created a custom field code for approved, but we already have two custom fields for approval, why’d you create a new one instead of using the old one?” Lots of different governance do by having the data at your disposal.
And as I alluded to earlier, a couple widgets here, most popular custom fields, and least popular custom fields. So you’ll see we have at least 10 custom fields here that have only been populated on one issue. Now there’s a very strong reason that we have for that. We’re trying to add on that requires custom fields to be in place, and it’s not totally rolled out yet. But you could see that if you have you know 300 custom fields and of those, 100 of them are sparsely used, you can consider removing them from a system and increasing performance.
Other niceties, issues [inaudible 00:32:46] category system of course he’d like to see more resolved over time than are [inaudible 00:32:53] in progress. And issues by status category over time. You know it looks like back in March we still have 110 issues that were created in March that are in To-Do. So hey, we should probably look into those and try and close them out, and try and find out why that is and make it so that maybe we don’t have to create those issues the future, which again, more issues degrades performance. So people are creating lots of issues that never get in touch. Maybe change your process you don’t have to create issues here create one epic, create one task concentrating on results fast. Change your workflows. Intelligence of that nature can come out of really simple graphs like this.
How are people using it? Who’s logging in every single day? You see you Wednesday, Thursday, Friday, we have an increased number of users. Saturday and Sunday, not so many. It’s another nice craft to have. Nothing really actionable for our purposes, but for your purposes, that could be an important thing to track. Leaderboards, active users today, as you can see, from some of our European developers have been very busy today, which makes sense, because they’re 10, 12 hours ahead of us so they have more time to use that system. Nice insights like that.
Application errors by source. So these are all Jira’s log files. They have security logs, access logs, service desk logs, incoming mail logs. You want to break it down perhaps by how many log lines or events are happening in each one of those log files. So if security also becomes a bigger piece of the pie and you should probably look into what’s happening in the security log files.
Ben: We got a quick question, sorry to interrupt. A few folks, Ameed and Halal and others are asking if the widgets are created that you’re showing off are being created from Jira access logs. And if so, can you speak so, and are there other metrics from issues that are being pulled from other places?
Justin: Yes. I’m sorry, I said I was going to say what the source is for these little widgets and I totally abandoned that. Custom fields created over time. That’s looking into Jira’s database using a free lunkhead on call DV Connect. Same story with custom fields of custom fields. Issues by status category. I believe that’s also Jira database. Log-ins, I think that’s using Jira’s HTTP access log, which again, you have to enable one configuration. It’s not enabled out of the box with Jira. Same thing with most active users today. Errors by source, that is out of the box. It’s literally every log file that Jira creates, it goes into this pie chart.
Notable application thread events today. This takes all of those log files and breaks it down by thread within Jira. So this comes out of the box as well and I do want to talk about this just a little bit. You see that for us, HTTP threads are throwing 300,000 errors today. That’s a lot of errors. It’s including the log. It’s creating havoc with some of our reports. And we actually were able to do some investigative work here and we found out, using Splunk actually, that these errors started at the very moment we installed a new add-on. Again, I’ll leave them nameless. But we are working with them to address these errors that they’re throwing. You know it is putting there a lot and it doesn’t seem to hurt the performance, but we can’t say for certain that it isn’t hurting performance. So we we’re working with the developer to fix whatever errors this is throwing. I think it’s throwing like a 500 server error every time someone tries to edit an issue, which is not a good thing. But again, it hasn’t really rated performance so bad that it’s all hands on deck. That’s fine. Again, that’s just standard Jira logs.
Request times by country. Another case study is something that happened to me a couple years ago, another company is we imported [inaudible 00:37:15] from Eastern Europe. They were old, in fact. They explicitly said, “We’re not going to use your Jira. We’re to be working Jira, because it’s too slow for us.”
And we were like, “Okay, well is it really too slow for them or do they just have process allergy and they don’t want to be pulled into our systems?” [inaudible 00:37:35] By reverse looking up IP addresses versus geographic locations with EngineX or Apache access logs, you can see the response times are for a country, per state, per city, whatever you want look at. And we see that in Eastern Europe, we actually do have a problem today with response times. Over a second and a half, which is a lot. It’s a lot worse than it usually is. So you can verify whether problems are really problems. For one user or for a set of users, globally, what have you.
If you’re a Jira admin and then you know about reindexing so you can index all the information database to make it so that it’s searchable. Every once in a while you have to perform a reindex and as you’re performing a reindex, it allocates a certain amount of resources for Jira. So you know some CPU is dedicated to reindexing some memory is dedicated to reindexing. So while a reindex is running, Jira performs to great for everybody else. So if you have to reindex, you want to know how long you can expect some performance degradation take place. And again, cPrime’s, for instance, is very small. So our reindex time is only about two and a half minutes. But if you’re a larger organization, you have several hundred thousand issues. I see reindexes taking up to several hours. You want to have that knowledge to say, “Okay well we can expect reindex to take two and half hours. We’ll put out an IT notice saying the next two and a half hours, please bear with us. Jira might be slow and do not tell us that it’s slow.” Just to cut down on the noise.
Audit log, Jira has an audit log. This is a database connector, I believe. We look into Jira’s database and pull all of the audit log events and bring them to Splunk. And more of a governance type thing. And just more governance, in general. Again looking at the Jira’s database with Splunk TV connect, we have a policy here at cPrime that anybody who is not a cPrime employee, their user name in Jira should be their email address. It shouldn’t be some kind of shortened username version. But we see here that some people who don’t have a cPrime address don’t have their email address as their username. So we go in and clean up this data and tell people who onboard to be consultants, new clients, we have a policy, please try and stick to it.
And going back to the reindex time frame, you want to know when is the best time to trigger something like that. What is the best time to upgrade? When’s the best time to plan for downtime? So you can see using all of your log files and checking out how many events happened in your log files in terms of accesses by the real human beings, you can check out the general activity by the day of the week for Saturday and Sunday or low activity by hour of day. Of course, I don’t think you see the numbers here. But you know late at night, very low usage, which stands to reason. So when you’re planning downtime for upgrades, or for reindexes, what have you. You can say well it looks like Saturday at 1:00AM is the best time. That’s fantastic. But as a Jira administrator, I’d rather do these things or Saturday at 1:00AM. So compromise with your boss say, “Okay, well how about Friday?”
“Friday is no good, because you don’t want to be working over the weekend in case anything goes wrong.”
“So next up, Thursday.”
“Right, Thursday looks good. Looks like people kind of tail off towards 4:00PM, so how does Thursday at 4:00PM sound?”
And you can use this data to say, “4:00 PM I optimal for business and for my sanity. So let’s try the schedule for that.” So, that is our Jira dashboard in a nutshell. I should say that each one of these widgets is based on a report. And I’ll show reports currently and you see we have a bunch of, primarily within lasting applications within our Splunk, because we are lasting in solvency. So we do a lot of [inaudible 00:41:47] reports not just for ourselves, but for our clients as well.
So, which one of these widgets is based on one of these reports? A couple of reports that I will show you that I found personally interesting is on page load histogram, it’s not showing the visual, but when you’re using something EngineX APM logs, it tells you what the request time is from the time the request was made to the first response from the server going back to the end user. But you know that’s not how it usually works. The real overwrites if you make a request to say an agile board in Jira, you’re getting a response from your reverse proxy. But Jira is also doing a lot of things in the background. It’s hitting its own API to return the issues is hitting API’s for your various add-ons you have installed that are integrated with your issues. So your page load time is longer than what your reverse proxy would suggest.
So I made this neat little report. And you can see some of what SPL is all about. It’s a lot like JQL if you’re familiar with Jira, it’s very native to Splunk. And by using this SPL we are able to parse up those log Files and categorize certain parameters to certain fields. And what I’m showing you here is actual page load times. So this user hit this URL and when he hit this URL, there’s luckily on in the background. Jira is making all these requests essentially to itself to return information on the screen. How long did this whole process take? How long did it end? In the real world how long did it take the page to load and in this case, it took 15 seconds.
So this is something I just created yesterday for my own benefit. I can’t really say much more about that. Not sure how I’m going to use it, but this is one of those things. Going back to the case study about filters, if you see that a certain filtered out as being accessed over and over again and the page time in the real world is taking a very long time, you might want to go ahead and tweak that filter, to make the page load faster so that Jira performance doesn’t degrade for everybody else who’s not even accessing that filter.
Another interesting report that I’ve created is Jira’s JVM memory over time. This is by checking be verbose garbage collection logs. You can see how much memory we have allocated to our Jira’s instance right now, it’s 3.8 gigabytes. It also shows you what the JVM memory heap is for garbage collection, after garbage collection. Eden is like a short term caching mechanism with JVM, you can see what Eden is before and after and you can see how Eden, the short term allocation plays with the heap, which is the long term allocation. As you have more long term allocation in your JVM, you have less available for your short term allocation, so I think it’s just a really neat graph to show you how garbage collection works in Jira and how garbage works in general.
So we talked about Jira stats. We talked about a couple really neat reports and how these dashboards are based on reports and those reports are generated from data sources whether they’re Atlassian out of the box log files. Things you have to enable with in the Atlassian application configuration files. Jira’s database, reverse proxy, log files, like all of this can be thrown in and the reports that you create, they don’t need to be limited necessarily to just one data source, but I think for all intents and purposes this dashboard shows you one data source per report.
Ben: Justin, I got a couple of quick questions that I wanted to ask you from some folks that are watching this. Rick had noticed in the previous graph that he saw Route 53, is Splunk able to pull in AWS logs as well?
Justin: Yes. You know what, I will show you something very, very quickly. There’s a very powerful add on for Splunk called Splunk Add-on for AWS. And it gives you every bit of data under the sun if you integrate all of your watch logs and go step by step when you’re … I’m not sure why it’s not showing.
I’m sorry I’ll jump back real quick. But it goes step by step and says, “AWS is generating this, do you want to end rate this data?”
You say “yes” and it says. They will go back to AWS and here’s how you did it with Splunk. And it gives you a God’s eye view of everything happening with AWS. And again I apologize. I think it’s been recently updated. Not sure why it’s not showing what I’m expecting to see. Oh, I’m sorry SPLUNK app for AWS, my apologies. This tells you every configuration change that was made in the AWS by a region, by instance, how many instances do you have running? VBS volumes. You name it, if something happens in AWS, it could be captured in Splunk. And this Splunk App creator AWS is really powerful. Again it’s free. It gives you insights into things like … you have so many easy-to instances that have really low utilization, might want to think about lessening the instance size or … less instances. Always rate at 99% capacity, may want to think but bumping it up, say from T2 to C4. Or something like that rate, or large to an extra large.
So again that’s an aside. But yeah, you getting this app in particular ingests all the information. From AWS to Splunk and it gives you really nice dashboards that are AWS-specific. But again, these are all based on reports, which are based on actual machine data, which you can also integrate with Jira or any other applications. So a good question.
Ben: Thanks, and I have one more for you. Chris is saying that some of this seems like it comes out of the box but some of this is also custom work. So in terms of how if companies are interested in trying to get some more robust reporting in graphs, how could they work with cPrime to make that happen? Is this kind of work some of the things that we do with clients when they’re trying to up-level their [inaudible 00:48:44] instance.
Justin: Yeah that’s exactly right. This is the whole suite here. It’s not something that you can necessarily install and just get straight away. Like I said, this requires things like installing an APM web application on the server that’s running Jira. To get these top widgets integrating with the free Splunk add-on, Splunk to be connects. It’s something that you can do, but it’s the kind of case by case basis, we can’t really prioritize that. Changing Jira’s configuration files to enable certain logs is something that we can’t just sell this as a tool without also digging into you know Jira’s configuration file. All these APM tools on its server, Splunk to be connect, orchestrating the ingestion of new records, because you don’t want that to be running in real time.
So it really is high touch. But we’ve built a lot of expertise around us, and we know a lot of best practices about how to make this work so that you’re not only not affecting Jira’s performance, but you’re not affecting Splunk’s performance for other users as well. So, what I’m showing you is not something that can just be downloaded and installed. It’s a lot of high touch, with the whole application layer, the infrastructure layer. So yeah, in working with cPrime, it’s something that we bring to the table if you’re interested in something like this. We’re fairly confident in the scope and level effort to deliver something like this.
But yeah, I apologize [inaudible 00:50:26] downloading it, I’m just trying to show something that can be done and if you wanted to do something like this, I can certainly talk to you. And, again, these widgets, these dashboards are not just the widgets that I’m showing you. If you have very specific use cases, if you have very specific needs, very specific concerns, we can of course create new reports, create new data sources, create new widgets, new dashboards that are relevant to your organization. So feel free to reach out to cPrime.
Ben: Cool, thanks, Justin, you can keep going.
Justin: Alright thanks. And just really quick, because we’re running out of time, we’ve done this also for confluence. You might notice that these parameters are all very similar to what we saw in Jira and that’s because we have our confluence application running on the exact same server as the Jira application. Again, request times, more niceties, the database, it shows you how many spaces, pages, comments, and attachments you have in your instance. It’s another one of those things, just like Jira, you add issues and custom fields and the performance starts to degrade with confluence, you add more spaces, you add more comments, you add more attachments, things start to degrade, you can tie that back to your performance over time. To say you’ve had these resources, you’ve had this performance, you’ve had these metrics in your system.
So looking forward, what can you expect to need hardware-wise or in a virtual application-wise to satisfy the needs a month down the road, three months down the road, six months down the road? Leaderboards, popular pages this week, if those things are interesting to you. Again, errors by source, it looks like we have some schedule worker errors, I believe these are pretty common, so I’m not too concerned about those. One thing I do want to share is, with cPrime, we went through our own internal transformation recently. We’ve had Jira confluence installed at cPrime for several years, but it wasn’t until January 2017 where we said we’re going to rethink our entire configuration for Jira with confluence. We’re gonna just burn it all to the ground, start fresh, and really do things in a best practice way. And you see for two years, we had more and more usage out of it. But then once we changed the way confluence was configured, it really just exploded the usage. It was easier to use, everybody as more intuitive. So just by making some configuration tweaks and changing the way that confluence is used in the organizations, it really exploded in terms of usage. Same, I’m sure, with comments and attachments as well.
So nice things to know, Bitbuckets, again, this is on a different server, so you can see the APN information is a little bit different. Very similar to widgets for Bitbucket. Things that are unique to Bitbucket: number of projects and repos. You can also see how many requests are made today, how many were merged or denied today, who merged them, who denied, who created the full request. Metrics over time, which one of your developers has the most full requests approved, and the most percentage of full requests approved. So very deep insights into your DevOps processes and your teams, if that’s something that interests you as well. And again, leaderboards, popular repos, that kind of stuff if that’s what interests you.
So at this point, will jump back into the presentation, I will go back into presentation mode and we’ll talk about how we can help, which I alluded to earlier. cPrime is in a unique position where were are platinum solution partner and we’re also a partner for Splunk. Experts in both tools, but tools are only one part of what we provide. Again, between team, tools, and processes, we try to maximize our business value. Again, experts in these tools, in particular. You can see some of our other expertise in the team and process areas.
As I mentioned earlier, this is not something that’s easily downloadable, you can’t just pay us a certain amount of money to download this and install it. It is high touch, high configuration, but if you do decide to partner with us, we have the expertise to build these very efficiently. And at that point, we’ll go into Q&A.
Ben: Awesome.
Justin: So thank you, thank you everybody for your time.
Ben: Yeah, thanks Justin. We’ve got quite a few questions. So we may go one or two minutes over. The first question is, “For questions over time, how is the sequence stored? Is the Splunk database add-on that does it, or is it something else?
Justin: For comments over time? [inaudible 00:55:17] over time, that kind of thing, I believe that is something that’s in confluence’s database using Splunk TV connect.
Ben: Cool. And then Haddon asks what is the biggest lesson learned when integrating Jira and confluence and Splunk?
Justin: The biggest lesson learned is that, you know, I remember talking about Atlassian probably six months ago. And they [inaudible 00:55:42] they used block internally and they used it for a lot of very unique use cases. And that’s everything they needed they got through their log files, and I found that not to be true. I thought it was going to be really straightforward to use Alassian’s out of the box log files, and get a whole lot of insight out of that. Everything that I wanted to know about Jira, and that wasn’t true. It required more [inaudible 00:56:07] it required more server configuration, in terms of adding a web application ATM tool on the server that Jira was also running on. Including Spunk TV connect. Changing configuration files, the JVM garbage collection logs verbose, that was the biggest thing that I learned. In terms of integrating Atlassian tools with Splunk, it couldn’t be done out of the box there was a lot of behind the scenes integrations and configuration changes that needed to be made to enable all this.