Monday, December 29, 2014

Log solution scaling

After interesting overview of firewall scaling, let's have a look at how (security) event logging can be built.
Primary value for evaluating log server performance is events/sec, that is how many events can a solution receive and process. This value of course depends on the hardware of the log server and applications running on top of the log server.
Secondary value is the amount of queries that can be done over the log data. Of course it very much depends on how the log data is stored, complexity of the query and how much data is to be processed by the query.

Single log-server

This is a common solution in many places, where to satisfy the security policy requirements a security appliance or a server is installed to perform log collection and analysis.

This solution has a few limitations, when it comes to scalability, as the amount of logs it can collect is limited by the hardware resources and also in order to collect the logs, it has to have connectivity to each of the elements of the whole environment.
Another disadvantage is that any analysis query takes away the resources from collection, so if resources are not sized properly events might be missed.

Log-server chaining

After realizing that standard log-server solution has problems with performance, many companies buy another server and split the event logging to lower the load. Of course then it gets quite difficult to perform log analysis, so another server is purchased to process only interesting events that the previous nodes pass on.

Actually structure of the tree can depend on business needs and there can be several servers with "analyze" components if there are many queries but not that many events.

This solution provides separation of the collection and analysis functions, so resources are not shared and therefore loss of events is less likely.
There are however other challenges here, as elements have to be assigned to specific collection node, so it is necessary to know how many events elements generate and how many events one collection node can process and forward to the log server.

Big(er)-Data logging solutions 

While aggregating and pre-filtering solutions do the job (at least for alerting when something happens), to be able to do more detailed digging into the logs, something more flexible with access to all the log data is needed. In order to do this, it is necessary to consider distributed storage and parallel processing. With not all data being stored on 1 node, queries have to be run on several nodes in parallel and then the results need to be aggregated (correlation might be a bit problematic though).

Possibly the picture is a bit misleading, as there are 3 functions here:

  • Data input (converting syslog or other events into standard format for storage)
  • Data storage (distributed event storage system )
  • Data output  (executing queries on the data and providing results)
Data storage is no longer just a simple write into a file, it is a more complex distribution of the event data to several machines not just for redundancy or speed of access, but also for the ability to execute analysis requests on each of them.

Of course big-data solutions have to be tailored to provide meaningful results, so the solutions require also aggregation or correlation functions as well as knowledge to build queries for information needed.
And that calls for a software programmer and operations engineer roles to work together in much faster and more effective manner than now in order to provide the right information at the time when it is needed.
Besides that, the challenges of the log-server chaining model still remain as collection of appliances and proprietary elements can only produce logs in client-server fashion (e.g. syslog protocol) and won't be able to distribute the load to many collection nodes .

Future of logging

Predicting the development of the entire industry is difficult even for industry analysts, but let's put my 2 cents on the table and describe what I would like to see.
With the increased popularity of the cloud, all hardware resources are more available and more flexible when it comes to re-allocation. With the separation of the functions in log collection and analysis, it is now possible to distribute the load and collect/process more events at the same time.
In order to scale it even better, more granular separation might bring better results. For this, container systems like LXC or Docker come in handy, as you can spawn many processes and distribute them on various platforms as needed. There can be even specific software for each query or report written, so that it runs only when it is needed or when a specific type of events occurs.
This all can be compared to a neural network, where specific neurons get triggered when there are signals with certain strength present on its dendrites.

With collectors (red dots) being specific types of devices, conversion to a generic event structure is much easier to implement and maintain in operations.
Storage (blue dots) are a system of its own, where they synchronize data between themselves as needed and pre-filtering or processing requests (green dots) can happen on each storage components on the data that is available there.
In the output layer (orange dots) all the relevant data is then collected and produces a specific report that is needed and when it is needed.

Major challenge here would be to build a signalling or passing of data between each container without overloading the network or storage I/O. Also to train the network to forward only relevant data to each of the output nodes.
But with the flexibility of small containers it is possible to spawn and run as many nodes and have more layers and various output nodes that this could potentially grow with the cloud and have small enough footprint to make it quite effective.