Showing posts with label operations. Show all posts
Showing posts with label operations. Show all posts

Thursday, April 30, 2015

Planning network element patches

As it happens, sometimes security engineers need to grab networkers and make them patch all those vulnerabilities that sysadmins keep fixing as soon as they get out.
With systems, this all is quite easy as there are almost no restrictions (besides compatibility), but network elements like routers or switches have limited resources and updates are not broken down to smaller components, but are usually just one big file containing everything from kernel to all the "applications" or processes and supporting utilities.
In this post, I would try to describe the process of planning such upgrade.

Inventory phase

First step is to find out what needs to be upgraded, as in later phases this information is quite important for selecting the software version as well as the process of the upgrade.
Following information needs to be collected about each router/switch/firewal/etc.:
  • Hardware type and version (not just the type printed on the device chassis, but also slot/port count information. For example "Cisco Nexus 56128P Switch" or Catalyst 2950T 24 Switch)
  • Memory information (RAM as well as storage or flash memory size is important)
  • Management IP (or IP address via which the software is going to be uploaded, as pcmcia or console modem options are usually not very fast)
Memory information is needed to find out if the software can run in the memory the hardware provides as well as if the software can be stored on the memory provided. Some devices have only one memory and split it in the two types (like Cisco 7200 routers) , while others have dedicated storage memory and RAM.

Software selection phase

After having all the information from previous phase, we can move on to finding the appropriate software version, that each device can support and contains the fixes needed.
Each software vendor has at least one web-tool that provides the information (or even the software download) needed:
Sometimes the vendors have links directly from security advisories or notifications, but it's not necessarily there, so the safest way to get the software and information about it is via the download pages.

Some vendor make it easy to select the latest version, while others have a set of sub-versions indicating feature upgrades or just patches; standard or extended support; early deployment or limited deployment; etc. Each vendor has a document describing what each part of the version means, and it can also be different for each product series.

Besides having the choice of software to download, there's also release notes or readme document for each version, where the vendor describes:
  • how to perform the upgrade 
  • what are the pre-requisites (which platforms and current software versions are compatible)
  • what new features are introduced and old ones removed
  • what issues/bugs/problems were resolved with that software version
  • what caveats were identified with this version
If the current version is way too old (by 1 or several major version releases), it might be needed to perform several upgrades in order to ensure that configuration is properly translated to new syntax or with new features. This should be described in the pre-requisites in order to ensure trouble-free upgrade. This phase has to be repeated for each of the versions that need to be installed before the latest one can be applied.

With constant change and improvements in the network field, features come and go, so it's necessary to watch out for removal or modification of features used (default deny could change to allow any; or statically configured IPSEC local networks might be auto-negotiated in newer version).

List of bugs resolved is a good source for identifying whether the new version would fix the recent vulnerabilities flowing in the wild. This might help with vulnerability management tickets or anomaly reports that are overdue.

And the caveats are good to know problems that were identified during vendor testing of the new version. When the local conditions are similar as those described in caveat, this might put a stop to the installation of that version (or the upgrade).

Software validation

With all the information collected from previous phases, only very brave people would install the software right away into production.
A lot of companies have labs, where new versions can be tested before installing them into production. In larger data-centers there could be canary elements for testing, where this could be done.
Goal of validation should be to ensure:

  • current configuration syntax is fine under new version
  • all required features are going to work as expected (with the same licences)
  • redundancy mechanisms would work (no timer defaults or protocol defaults changed)
  • monitoring functions get the same format of data as before (no snmp OID or syslog message format  or API changes)
  • migration/upgrade plan is not going to cause an impact (some systems require same version of clustered elements to work)

Whether all this is automated or done manually by verification team with defined validation test-cases, it's up to each company to decide, but what most of the IT managers wouldn't like is to have total outage of core network after software upgrade of central router or switch.

And let's not forget to verify the hash of the downloaded software (if the vendor offers it on the download website), as network elements are the best place for MiTM attacks.

If you know of anything else I missed, let me know and I'll update the post.

Tuesday, April 28, 2015

Event management solution scaling - Practical example

As described in the previous blog post, every software; every server or every appliance has its limits.
Scaling beyond these limits is a task for an engineer to build something that can cope with the loads.
In theory one could adjust the open-source solution and live happily ever after, but in the real world.. well one has to deal with proprietary software or appliances and it's not easy to just migrate or replace it.

For such scenario, I've developed a small program called NFF that forwards the incoming traffic to several configured destinations. Currently it is built to listen on one port and forward it to several destinations, but with different configuration file it can run for several services (e.g. syslog; snmp-traps; netflow)


Note: in current version it only forwards the flows, but later on when protocol decoding is implemented, it would also be able to forward flows to specific destinations based on rules.

Integration would be done by installing this program on the same IP address that all systems send their logs/netflow/data to, and the appliance or software analyzing these would move to a new IP address.

In case the management decides to buy a bigger box or choose different supplier, this can be added to the distribution list during trial period in order to see if it fulfills the needs and expectations.



As I don't have a job where I could test this idea at scale, I hope some of you would provide me some feedback how well it can perform. I already have several ideas how to make it work faster..

Monday, December 29, 2014

Log solution scaling

After interesting overview of firewall scaling, let's have a look at how (security) event logging can be built.
Primary value for evaluating log server performance is events/sec, that is how many events can a solution receive and process. This value of course depends on the hardware of the log server and applications running on top of the log server.
Secondary value is the amount of queries that can be done over the log data. Of course it very much depends on how the log data is stored, complexity of the query and how much data is to be processed by the query.

Single log-server

This is a common solution in many places, where to satisfy the security policy requirements a security appliance or a server is installed to perform log collection and analysis.


This solution has a few limitations, when it comes to scalability, as the amount of logs it can collect is limited by the hardware resources and also in order to collect the logs, it has to have connectivity to each of the elements of the whole environment.
Another disadvantage is that any analysis query takes away the resources from collection, so if resources are not sized properly events might be missed.

Log-server chaining

After realizing that standard log-server solution has problems with performance, many companies buy another server and split the event logging to lower the load. Of course then it gets quite difficult to perform log analysis, so another server is purchased to process only interesting events that the previous nodes pass on.




Actually structure of the tree can depend on business needs and there can be several servers with "analyze" components if there are many queries but not that many events.


This solution provides separation of the collection and analysis functions, so resources are not shared and therefore loss of events is less likely.
There are however other challenges here, as elements have to be assigned to specific collection node, so it is necessary to know how many events elements generate and how many events one collection node can process and forward to the log server.

Big(er)-Data logging solutions 

While aggregating and pre-filtering solutions do the job (at least for alerting when something happens), to be able to do more detailed digging into the logs, something more flexible with access to all the log data is needed. In order to do this, it is necessary to consider distributed storage and parallel processing. With not all data being stored on 1 node, queries have to be run on several nodes in parallel and then the results need to be aggregated (correlation might be a bit problematic though).





Possibly the picture is a bit misleading, as there are 3 functions here:

  • Data input (converting syslog or other events into standard format for storage)
  • Data storage (distributed event storage system )
  • Data output  (executing queries on the data and providing results)
Data storage is no longer just a simple write into a file, it is a more complex distribution of the event data to several machines not just for redundancy or speed of access, but also for the ability to execute analysis requests on each of them.

Of course big-data solutions have to be tailored to provide meaningful results, so the solutions require also aggregation or correlation functions as well as knowledge to build queries for information needed.
And that calls for a software programmer and operations engineer roles to work together in much faster and more effective manner than now in order to provide the right information at the time when it is needed.
Besides that, the challenges of the log-server chaining model still remain as collection of appliances and proprietary elements can only produce logs in client-server fashion (e.g. syslog protocol) and won't be able to distribute the load to many collection nodes .

Future of logging

Predicting the development of the entire industry is difficult even for industry analysts, but let's put my 2 cents on the table and describe what I would like to see.
With the increased popularity of the cloud, all hardware resources are more available and more flexible when it comes to re-allocation. With the separation of the functions in log collection and analysis, it is now possible to distribute the load and collect/process more events at the same time.
In order to scale it even better, more granular separation might bring better results. For this, container systems like LXC or Docker come in handy, as you can spawn many processes and distribute them on various platforms as needed. There can be even specific software for each query or report written, so that it runs only when it is needed or when a specific type of events occurs.
This all can be compared to a neural network, where specific neurons get triggered when there are signals with certain strength present on its dendrites.



With collectors (red dots) being specific types of devices, conversion to a generic event structure is much easier to implement and maintain in operations.
Storage (blue dots) are a system of its own, where they synchronize data between themselves as needed and pre-filtering or processing requests (green dots) can happen on each storage components on the data that is available there.
In the output layer (orange dots) all the relevant data is then collected and produces a specific report that is needed and when it is needed.

Major challenge here would be to build a signalling or passing of data between each container without overloading the network or storage I/O. Also to train the network to forward only relevant data to each of the output nodes.
But with the flexibility of small containers it is possible to spawn and run as many nodes and have more layers and various output nodes that this could potentially grow with the cloud and have small enough footprint to make it quite effective.