Log Ingestion
By Dave Hoff
Designing and setting up a new SIEM can be intimidating. While software vendors often provide support for cluster setup and sizing, setting up how you ingest logs is usually up to you. It’s easy to feel lost in the wilderness without a map, especially if you’re trying to bring in data from less common sources. Let’s walk through a high-level plan of attack.
The first and most important aspect of log ingestion is choosing what you want to log. The most common mistake I see from companies designing their own SIEM is trying to log everything from everywhere. While there are benefits to having every log from every service you use in a single location, the cost to do so is high in both setup time and hardware/cloud expenses. Is keeping firewall logs in your SIEM valuable enough to justify reducing overall retention time to stay within budget? Your chances of success are much higher when you start by determining which log sources are most likely to contain useful, actionable information and focusing your efforts on those sources. Additionally, many individual log sources can be configured to reduce the amount of data they send. Whether this is done by upping the severity threshold (choosing not to ship “debug” or “info” level logs) or by excluding specific event IDs (common with Windows event logs), trimming down the amount of data sent by each service can have a large impact on your SIEM’s performance and usability.
Once you’ve determined which logs are essential, it’s important to plan out how to organize your data. Pick a naming scheme for your tables/indices that allows for future expansion while maintaining your sanity. For smaller organizations, you might be able to name an index with only the source service name, while larger organizations might split up their logs by department first, then by company, and lastly by service (i.e., infra-Amazon-CloudTrail). This is a step that often doesn’t get much thought at the beginning but can be difficult to change down the road.
So, you’ve chosen your sources, have a great naming scheme, and your first log sources are sending data. What’s next? Data normalization. Depending on how well your log source integrates with your SIEM software, your data will either be split up into nice, individual fields, or it will all be lumped into one big “message” field that will need to be parsed out. Either way, it’s important to work towards common field names for as much of your data as possible. Take this situation as an example: Your Azure AD logs have a field named “azure.user.name” in the format “Domain\Username” but your Windows logs use the field “winlog.user_name” having only the username, no domain. Correlating data between the two sources will require your analysts to remember the differences and manually adjust their queries when switching data sources. If you’re using Elasticsearch, you can map your data to Elastic Common Schema fields, which would indicate both above fields should be placed in the field “user.name” and the domain placed in “user.domain”. Now you’re able to punch in one query to search all your datasets.
Lastly, think about how you will be using the data you’ve ingested. Look for any common, repetitive actions your analysts take when searching the data. Do your Windows logs have the full path of an executable that’s been run, but your analysts often use a leading wildcard to search for an executable name (*lsass.exe)? Do they frequently search for logs with an RFC1918 source IP to find outgoing connections? These situations can be improved with a concept known as schema-on-write. Instead of using these complicated queries (schema-on-read), it’s often more efficient to split the executable path during ingest, creating an “executable.name” field that can be searched without wildcards. We could also automatically assign directions (outbound vs. inbound) to our network logs, making inefficient IP range queries unnecessary. Changes like these not only make your analysts’ lives easier, they also can have a dramatic impact on search performance, especially when they reduce wildcard usage.
Log ingestion is a technical, complicated process and it’s easy to get lost in the weeds while you’re writing parsers and configuring log shippers. Sticking closely to a detailed plan will greatly improve usability, organization, and will ultimately save time and money over the course of the project.
Behind-the-Zines
Jason asked our designer why there was a picture of logs on this one.