Root cause analysis (RCA) is probably the most critical function in security outside of maintaining a secure and compliant network. Anytime a security event has been detected or even perceived, you are tasked with discovering the origin or cause. Root cause analysis can be a daunting task, so I would like to pass along some best practices that got me through those stressful root cause analysis sessions.
Sometimes less is more. When looking into your log management or SIEM technology, it is sometimes better to start with a single key word, IP address, or username for a specific time frame rather than a specific detailed search. The results of a simple search may provide insight into activity from other devices which can help you back track and find out where the event originated. Pay attention to large amounts of certain events like errors, access failures, file activity, and change events that contain the same username, IP Address, or both. Also, look for changes by the same user or source IP across multiple systems.
Visualize your data. Visualizing data is a way to identify trends in the information flow. A large spike of a single event or a sustained amount of events over a period of time usually signifies an anomaly. Visualization can also help you quickly identify a time frame to start your investigation. Finally, if you are using a SIEM or log management product, you can typically build dashboards based on various criteria like network traffic, authentication, file, and change events.
Correlate data from different devices to identify security events. Correlation of log data across different devices, systems, and applications adds another layer of security monitoring and may reveal security issues that fell under the radar. For example, correlating a spike in outbound email logs not sourcing from your internal email server is a good indication of malware. Advanced Persistent Attacks (APTs) can be hard to detect. However, if you’re investigating logs you can look for random software installs correlated with outbound FTP traffic logs from your firewall within the same time frame as an attack.
Establish templates to support incident response. RCA and incident response are different functions, but they rely on each other. Determine what a response team, legal, or your leadership will require from the data (i.e. IP address, port, username etc.), then create Standard Operational Procedures (SOPs) and templates for general and specific situations. Regardless of who is present when an event occurs, everyone needs to be clear about what information is critical and who to contact.
Next to the suggestions above, the most valuable lesson I have learned over the years is that a central location for log data can save a huge amount of time. Even if it is just a syslog server you can use native search tools to dig through logs and it’s much quicker than addressing each system individually.
Hopefully, you find some of these suggestions helpful. Sometimes in the heat of the moment it is best to stop and go back to the basics. This isn’t a comprehensive list so please pass along any tools, tips, tricks, and experiences that have helped you become more efficient in responding to events.
This post is part of our Best Practices for Root Cause Analysis Series. For more best practices, check out the index post here: Best Practices for Root Cause Analysis