Security Engineered

Wednesday, September 28, 2016

Network Segmentation - Access Filters

Creating access filters in live networks

Every vendor or consultant presents fantastic solutions, that would work in ideal environments they think every customer has.
The truth is, only the newly designed networks are like that, while the rest is more heterogeneous with many exceptions and special fringe cases.
And as it is more common to have network engineers, who don't speak to the application people, lack of know-how about the traffic flows in their networks has serious implications.
With inexperienced architects or engineers, this usually ends up with even more non-standard environment or in worst case with dysfunctional network.

Another issue is the necessity to do network re-design in order to re-route traffic via filtering devices, which often can cause downtime or outages when implementing such solution.

So in this post, I'll share a technique of creating access lists for network separation without causing major interruptions or needing downtime for these changes.

There are many aspects (like performance, location, support), that are outside of the scope for this post, but have to be considered before using filtering in production networks.

The examples below are based on the idea of a L3 Cisco network with several internal VLANs (where the ACLs would be applied), but the method works for L2 networks (applying ACLs on ports or uplinks) as well.

Phase 1: Observation

As primary problem of most network engineers is not knowing what packets flow through their network, first step has to be to find that out.
And this is accomplished with creating an access list, that would allow any traffic through and log the hits.


ip access-list extended ThisVLAN100

1000 permit ip any any log

To prevent overload of the log server, it's better to specify known traffic flows already into the ACL. Good example is DNS traffic, that is very common in server as well as in user networks with high probability of occurrence.

Phase 2: Adaptation

In highly populated networks, there would be loads of traffic and logs would be growing faster than a human can read them. So the goal of this phase is to minimize the log growth by inserting ACL entries with most common traffic patterns:


900 permit ip host <some host> any

This phase can have many iterations, where ACL entries can be adjusted to be more generic or specific (depending on security level that is acceptable).
Time spent in this phase also depends on the probability that all devices would perform all the traffic patterns (e.g. monthly backup or data upload)

Phase 3: Conclusion

With ACL populated with identified traffic flows, that are acceptable/expected, the finalization is to change the

no 1000
1000 deny ip any any log

It is always advisable to log the last entry, as when some application changes in the network, this would provide good source of information for troubleshooting.
If Adaptation phase was successful, there should be almost no hits to this rule.

Maybe I should mention, that in internet-facing networks this last rule would be looking a bit differently and there would be other rules for logging anomalous traffic:

999 deny ip <other internal networks> <protected network> log
1000 permit ip <protected network> any

But I hope you got the picture how the process works to establish ACL filters in production networks without major impact.

Next steps

Depending on the network architecture, this process can be automated either by monitoring the logs and generating changes to be applied or by using "canary deployment" and replicating the result to the rest of the network.

The knowledge gathered in phase 1 and 2 should also help the network team to understand the traffic patterns better for future design and improvement projects.

Thursday, July 9, 2015

DoS protection solutions

Most of the companies ignore the fact that their services can go down, and therefore not even consider using protection against DoS attacks.
So in this post i'd like to analyze what are the potential attack areas and what ideas are there to resolve it.
Lets exclude the attacks that use specific vulnerabilities of a product to deny the service and focus on brute force attacks that flood the target with excessive valid service requests.

Statement of the problem

As the attack is using valid service requests, it's hard to distinguish them from normal user requests. This means that whatever action is taken, normal user requests have to be served in reasonable time.

Secondly the service providing solution has to be capable of dealing with normal service requests. This means, it has to be elastic enough to scale up if necessary to serve all incoming requests.

The most common bottlenecks (on the side of a customer or enterprise) are:

Server resources - system(s) that provide service might not have necessary resources to deal with user requests (no network bandwidth; not enough memory; CPU over-utilized or other)
Firewall resources - filtering and inspection as well as tracking of all the active flows need resources as well (not as many as a server).
Internet router - although not very probable, but still router might not be able to deal with that many packets per second
Internet link - the most common problem, where the link to the ISP is not sized properly to deal with growing user service demand, might lead to degradation of service response time.

There surely are other potential bottlenecks, but the list above is to be considered in every case (independent of the service, location or team).
Mitigation of these bottlenecks is the key to deal with DoS attacks, and while scaling is always an option, it might not be the most efficient one. Plus the costs grow exponentially, so at certain point of time, it becomes important to choose an alternative solution that scaling.

Potential solutions

Excluding the possibilities of re-engineering the application (providing the service) or buying bigger hardware or upgrading the internet link speed to deal with potential distributed DoS attacks (which could be 10Gb/s or more nowadays), let's see what ideas there are.

There are 3 functions that are part of each solution:

Detection (distinguishing what is a valid service request and what not)
Protection (blocking invalid service requests without impacting valid ones)
Service (providing the actual service)

Each type of solution differs based on who is responsible for these functions.

In-house DDoS prevention

This solution uses in-house detection capability to spot the DDoS attack and then request support from the internet service provider to block the source IP address(es) for limited amount of time.Having a good security event management system with all the possible event sources helps to identify attacks early and allows to provide much more reliable response (false-positives mean lost clients).

Response can be automated by using provider's service API or standard routing protocols (like BGP Flowspec) for blackholing the source, or it could be manual by reporting abuse to the appropriate ISP contact point.

There are many vendors who provide such solutions that include DPI detection as well as signalling, but there also needs to be a contract in place with service provider to support such service.

This table should summarize the location of the DDoS protection functions:

Function	Customer	DoS protection provider
Attack detection	Most of the detection
Attack prevention	Signaling only	Most of the protection
Actual service	The service is provided by customer

Service gateway

Principle of this solution is pre-filtering all the requests via a gateway service, where only valid requests fulfilling specific set of rules (like max 10 requests per client; valid session should last longer than 1 minute; etc.) would be forwarded to the real server.

Depending on the rule-set, customer's server would receive only valid requests and would not have to deal with excessive traffic.

In this case, the location of the detection and prevention functions is at the service provider, so all the traffic ends there. But the responsibility of the service operation is still up to the customer.

Function	Customer	DoS protection provider
Attack detection		Most of the detection
Attack prevention		Most of the protection
Actual service	The service is provided by customer

It's important to note that these gateways are either generic (for IP or TCP packets) or highly specific for certain type of application (mail gateway; dns gateway; web application firewall;..).

CDN service

Although this is highly specific solution and requires some cooperation with the service provider, for common services like data or content distribution, this could be very effective.

Content distribution network (CDN) providers have large infrastructure geo-distributed and built for huge traffic flows. For some of them even large distributed DoS might look like minor increase in the normal traffic level.

Principle here is quite simple: customer uploads all the data to the CDN service provider, where it would become accessible to the clients.

This solution moves all the functions to the external service provider, who has to deal with the attacks and guarantee service to the customer.

Function	Customer	DoS protection provider
Attack detection		Most of the detection
Attack prevention		Most of the protection
Actual service		Service is provided by CDN

Conclusion

Despite having lots of great tools that promise miracles in preventing DDoS attacks, it always comes down to a specific solution for specific customer. Whether rules need to be tailored and constantly adapted for the first type of solution; or gateway needs to be adjusted to deal with non-standard protocol behavior; it all comes down to the skills of the engineers dealing with the solution.

In hands of capable engineer, it would make the promised miracles come true; in less capable hands it can make the service very unreliable or vulnerable to distributed DoS attacks.

Thursday, April 30, 2015

Planning network element patches

As it happens, sometimes security engineers need to grab networkers and make them patch all those vulnerabilities that sysadmins keep fixing as soon as they get out.
With systems, this all is quite easy as there are almost no restrictions (besides compatibility), but network elements like routers or switches have limited resources and updates are not broken down to smaller components, but are usually just one big file containing everything from kernel to all the "applications" or processes and supporting utilities.
In this post, I would try to describe the process of planning such upgrade.

Inventory phase

First step is to find out what needs to be upgraded, as in later phases this information is quite important for selecting the software version as well as the process of the upgrade.

Following information needs to be collected about each router/switch/firewal/etc.:

Hardware type and version (not just the type printed on the device chassis, but also slot/port count information. For example "Cisco Nexus 56128P Switch" or Catalyst 2950T 24 Switch)
Memory information (RAM as well as storage or flash memory size is important)
Management IP (or IP address via which the software is going to be uploaded, as pcmcia or console modem options are usually not very fast)

Memory information is needed to find out if the software can run in the memory the hardware provides as well as if the software can be stored on the memory provided. Some devices have only one memory and split it in the two types (like Cisco 7200 routers) , while others have dedicated storage memory and RAM.

Software selection phase

After having all the information from previous phase, we can move on to finding the appropriate software version, that each device can support and contains the fixes needed.

Each software vendor has at least one web-tool that provides the information (or even the software download) needed:

Sometimes the vendors have links directly from security advisories or notifications, but it's not necessarily there, so the safest way to get the software and information about it is via the download pages.

Some vendor make it easy to select the latest version, while others have a set of sub-versions indicating feature upgrades or just patches; standard or extended support; early deployment or limited deployment; etc. Each vendor has a document describing what each part of the version means, and it can also be different for each product series.

Besides having the choice of software to download, there's also release notes or readme document for each version, where the vendor describes:

how to perform the upgrade
what are the pre-requisites (which platforms and current software versions are compatible)
what new features are introduced and old ones removed
what issues/bugs/problems were resolved with that software version
what caveats were identified with this version

If the current version is way too old (by 1 or several major version releases), it might be needed to perform several upgrades in order to ensure that configuration is properly translated to new syntax or with new features. This should be described in the pre-requisites in order to ensure trouble-free upgrade. This phase has to be repeated for each of the versions that need to be installed before the latest one can be applied.

With constant change and improvements in the network field, features come and go, so it's necessary to watch out for removal or modification of features used (default deny could change to allow any; or statically configured IPSEC local networks might be auto-negotiated in newer version).

List of bugs resolved is a good source for identifying whether the new version would fix the recent vulnerabilities flowing in the wild. This might help with vulnerability management tickets or anomaly reports that are overdue.

And the caveats are good to know problems that were identified during vendor testing of the new version. When the local conditions are similar as those described in caveat, this might put a stop to the installation of that version (or the upgrade).

Software validation

With all the information collected from previous phases, only very brave people would install the software right away into production.
A lot of companies have labs, where new versions can be tested before installing them into production. In larger data-centers there could be canary elements for testing, where this could be done.
Goal of validation should be to ensure:

current configuration syntax is fine under new version
all required features are going to work as expected (with the same licences)
redundancy mechanisms would work (no timer defaults or protocol defaults changed)
monitoring functions get the same format of data as before (no snmp OID or syslog message format or API changes)
migration/upgrade plan is not going to cause an impact (some systems require same version of clustered elements to work)

Whether all this is automated or done manually by verification team with defined validation test-cases, it's up to each company to decide, but what most of the IT managers wouldn't like is to have total outage of core network after software upgrade of central router or switch.

And let's not forget to verify the hash of the downloaded software (if the vendor offers it on the download website), as network elements are the best place for MiTM attacks.

If you know of anything else I missed, let me know and I'll update the post.

Tuesday, April 28, 2015

Event management solution scaling - Practical example

As described in the previous blog post, every software; every server or every appliance has its limits.
Scaling beyond these limits is a task for an engineer to build something that can cope with the loads.
In theory one could adjust the open-source solution and live happily ever after, but in the real world.. well one has to deal with proprietary software or appliances and it's not easy to just migrate or replace it.

For such scenario, I've developed a small program called NFF that forwards the incoming traffic to several configured destinations. Currently it is built to listen on one port and forward it to several destinations, but with different configuration file it can run for several services (e.g. syslog; snmp-traps; netflow)

Note: in current version it only forwards the flows, but later on when protocol decoding is implemented, it would also be able to forward flows to specific destinations based on rules.

Integration would be done by installing this program on the same IP address that all systems send their logs/netflow/data to, and the appliance or software analyzing these would move to a new IP address.

In case the management decides to buy a bigger box or choose different supplier, this can be added to the distribution list during trial period in order to see if it fulfills the needs and expectations.

As I don't have a job where I could test this idea at scale, I hope some of you would provide me some feedback how well it can perform. I already have several ideas how to make it work faster..

Friday, March 20, 2015

Cryptography education

After a long break, I finally got back to writing yet another post (coincidentally, about what I did the last few months).
As cryptography is quite essential part of security engineering I decided to challenge myself by doing online course by Dan Boneh from Stanford called Cryptography I.
Originally I didn't expect that university courses go into much detail of how crypto algorithms are implemented in practice, but this course surely changed my opinion on practicality of university lectures. Of course I should have known that lectures from famous universities like Stanford are expected to be worth the tuition fees.

Course content

Despite the practical nature of this course, it requires very good understanding of various areas of Math as well as the scientific method of lectures.
This actually was important part of practicality, as some areas of security engineering require or expect a mathematical proof of algorithm or process security. I'm not sure how such proof would hold in front of an average auditor, but this course surely does good work proving weaknesses of explained cryptography functions.
The first course contains brief introduction with a short refresher in discrete math and goes right into stream and block ciphers (like DES or AES). Later it covers message integrity and hash functions and finishes the course with public key encryption (RSA and ElGamal).
The second course (expected to take place in June 2015) would surely bring more advanced and recent topics like elliptic curve cryptography or maybe digital currency concepts, so there's more to learn.
In most of the topic practical examples were given on how some services or protocols used the cryptography in wrong way.

Programming part

Besides the theoretical part it also included coding assignment, where the tasks were from encrypting content in efficient and secure way to deriving keys from insecure implementations. This experience would surely be very useful in code review or penetration testing work.
All the programming assignments were done in Python, with help of various libraries that offered the necessary crypto algorithms already implemented.

Conclusion

Everybody in the field of security who wants to call himself a security engineer should do this course (and pass), as the knowledge gained is very important in many areas of the security field. Whether one does security policy (to define acceptable crypto algorithms/key sizes etc.); ensures application security (performing code reviews or testing); provides authentication, file or storage encryption or builds VPNs, this course will help to understand what are the implications of choosing one or the other security algorithm and what would be the performance and security impact of these decisions.
With recent openssl vulnerabilities I can't stress enough how this knowledge improves the decision-making whether or how to deal with these vulnerabilities.

Monday, December 29, 2014

Log solution scaling

After interesting overview of firewall scaling, let's have a look at how (security) event logging can be built.
Primary value for evaluating log server performance is events/sec, that is how many events can a solution receive and process. This value of course depends on the hardware of the log server and applications running on top of the log server.
Secondary value is the amount of queries that can be done over the log data. Of course it very much depends on how the log data is stored, complexity of the query and how much data is to be processed by the query.

Single log-server

This is a common solution in many places, where to satisfy the security policy requirements a security appliance or a server is installed to perform log collection and analysis.

This solution has a few limitations, when it comes to scalability, as the amount of logs it can collect is limited by the hardware resources and also in order to collect the logs, it has to have connectivity to each of the elements of the whole environment.
Another disadvantage is that any analysis query takes away the resources from collection, so if resources are not sized properly events might be missed.

Log-server chaining

After realizing that standard log-server solution has problems with performance, many companies buy another server and split the event logging to lower the load. Of course then it gets quite difficult to perform log analysis, so another server is purchased to process only interesting events that the previous nodes pass on.

Actually structure of the tree can depend on business needs and there can be several servers with "analyze" components if there are many queries but not that many events.

This solution provides separation of the collection and analysis functions, so resources are not shared and therefore loss of events is less likely.
There are however other challenges here, as elements have to be assigned to specific collection node, so it is necessary to know how many events elements generate and how many events one collection node can process and forward to the log server.

Big(er)-Data logging solutions

While aggregating and pre-filtering solutions do the job (at least for alerting when something happens), to be able to do more detailed digging into the logs, something more flexible with access to all the log data is needed. In order to do this, it is necessary to consider distributed storage and parallel processing. With not all data being stored on 1 node, queries have to be run on several nodes in parallel and then the results need to be aggregated (correlation might be a bit problematic though).

Possibly the picture is a bit misleading, as there are 3 functions here:

Data input (converting syslog or other events into standard format for storage)
Data storage (distributed event storage system )
Data output (executing queries on the data and providing results)

Data storage is no longer just a simple write into a file, it is a more complex distribution of the event data to several machines not just for redundancy or speed of access, but also for the ability to execute analysis requests on each of them.

Of course big-data solutions have to be tailored to provide meaningful results, so the solutions require also aggregation or correlation functions as well as knowledge to build queries for information needed.
And that calls for a software programmer and operations engineer roles to work together in much faster and more effective manner than now in order to provide the right information at the time when it is needed.
Besides that, the challenges of the log-server chaining model still remain as collection of appliances and proprietary elements can only produce logs in client-server fashion (e.g. syslog protocol) and won't be able to distribute the load to many collection nodes .

Future of logging

Predicting the development of the entire industry is difficult even for industry analysts, but let's put my 2 cents on the table and describe what I would like to see.

With the increased popularity of the cloud, all hardware resources are more available and more flexible when it comes to re-allocation. With the separation of the functions in log collection and analysis, it is now possible to distribute the load and collect/process more events at the same time.

In order to scale it even better, more granular separation might bring better results. For this, container systems like LXC or Docker come in handy, as you can spawn many processes and distribute them on various platforms as needed. There can be even specific software for each query or report written, so that it runs only when it is needed or when a specific type of events occurs.

This all can be compared to a neural network, where specific neurons get triggered when there are signals with certain strength present on its dendrites.

With collectors (red dots) being specific types of devices, conversion to a generic event structure is much easier to implement and maintain in operations.
Storage (blue dots) are a system of its own, where they synchronize data between themselves as needed and pre-filtering or processing requests (green dots) can happen on each storage components on the data that is available there.
In the output layer (orange dots) all the relevant data is then collected and produces a specific report that is needed and when it is needed.

Major challenge here would be to build a signalling or passing of data between each container without overloading the network or storage I/O. Also to train the network to forward only relevant data to each of the output nodes.
But with the flexibility of small containers it is possible to spawn and run as many nodes and have more layers and various output nodes that this could potentially grow with the cloud and have small enough footprint to make it quite effective.

Tuesday, November 18, 2014

Firewall scalability

I'm sure in every environment there are moments, where traffic grows and grows until it reaches some limitations. It could be the ISP link or the proxy throughput or maybe the firewall.
In that case, the IT or OPS team has several options to resolve the problem. Of course most of the short-term options involve reducing the traffic generated either by telling users to stop using internet for non-business related purposes, or block websites that are not needed for work (white-listing).
But lets have a look at more technical options, which also tend to be long-term solutions.

Firewall upgrade

This is quite common choice by many companies, who just throw money at the problem. Buying newer and faster firewall is surely easier than trying to re-engineer the network. Migration is also quite simple as it involves configuring the new firewall and just replacing the existing one (with roll-back possibility).

But as it seems, there are also limitations here. For all the Cisco fans, the firewalls currently on the market from your favorite vendor can do max 15Gbps as an appliance and 20Gbps as a service module. As these values are theoretical, I wouldn't expect them to be reached in real-world situations.

Now for those who are able to consider other vendors, Fortinet announced a carrier grade firewall FortiGate 5000, which can deliver more than 1Tbps firewall throughput. Of course that's just a marketing statement, as it's a sum of all the blades, which can deliver 40Gbps each.

There are also some tricks with using firewalls in parallel, but synchronizing state between all the units might be a challenge. Some vendors tried it with dedicated link between 2 units, others tried it with multicasting the state changes, but effectivity of such solutions was decreasing with each unit and number of flows that were being passed through them.

Firewall bypass

Although firewalls are limited by their inspection ASIC chips, that need not just to analyze the packet headers but also keep state information of each flow, switches with forwarding ASIC chips are much faster when doing just forwarding.

So in some companies, engineers though about this fact and came up with the idea to only inspect the relevant packets to keep the state information and the rest of the packets can be just passed on.

So they send all the TCP packets with SYN, RST or FIN flags set (including any non-TCP packets) to the inspection unit (can be a firewall), while the rest of the packets can be forwarded to their destination directly.

This idea called "fast-path" was also adapted in SDN and virtual networks, as with OpenFlow 1.3+ the packets can be easily redirected to inspection device, which can instruct the controller to drop the flow if it doesn't match the security policy.

Despite the fact that many vendors currently support only OpenFlow 1.1, many of the are already considering support for 1.3 or have announced switches supporting it (like Brocade).

With such a solution and data flows which have lots of packets, traffic speeds can be much higher than any hardware firewall appliance can offer in near future.

But still the limitation exists on the speed of the SYN/FIN/RST flag containing packets processing and also on the forwarding speed of the network. Also this idea is based on the fact that most of the traffic is TCP based, as for other protocols the conditions to detect when flow is started or finished differ. Plus what is also not shown on the picture is the feedback necessary from the firewall to the router to allow only existing flows.

Firewall re-location

With all the ideas described above, inspection was happening on the edge of the network (based on best practices for firewall deployment). So the idea of doing computation intensive tasks like packet inspection on one centralized system restricts the performance of throughput with limitation of hardware performance of this system.
As the general solution to this limitation is usage of parallel computing, you can also see that vendors tried it by building blade chassis designs. Next step was to virtualize the firewalls and move them closer to the data-flow sources, but the most scalable solution is to have it exactly at the source, so either end-point firewalls or server/VM distributed firewalls.
As flows originate or terminate at each VM, firewall inspecting traffic for that VM would only need to track these flows and don't have to synchronize with other firewalls. Of course if VM moves, firewall has to move too. And in respect of performance, with VMs there is a limit of how much data it can send out, and the more it sends the more inspection firewall has to do. But firewall and the VM share the same CPU and memory resource, so the system would self-regulate if firewall can't keep up with the data being sent out.

This all sounds like the ultimate scalable solution, but there is a dark side to it: Management. With large amount of firewalls, the configuration of each would be quite time consuming. Automation or VM profiles is the usual answer, but for that the way how network security engineers or administrators operate has to change. Just consider troubleshooting connectivity problems when you have 1000s of devices generating logs and these devices move around and might not be able to reproduce the problem.

Conclusion

So despite the fact that there are solutions to scale the firewall throughput to the sky, there are many considerations to be made on the way there.
From the type of data-flow patterns up to the administrator's skill-set, before the sky is reached all these obstacles have to be dealt with.
Just as there is a nice blue sky, there is also a deep dark rabbit-hope into which Alice can fall.

Tuesday, October 21, 2014

Importance of free training appliances/software

I try to get any opportunity I can to keep my technical skills up-to-date, from reading articles down to getting virtual versions of the software/appliance and play with it. While major vendors provide technical documentation and various papers for free, there are still some who hide their documentation behind authentication wall.

But that's not the prime concern, as engineers prefer something tangible, something to play with..
And here comes the idea of virtual labs that some vendors provide for money or rental labs from various training and certification companies. But all that still costs, and this is not very viable for home-use to learn the technology or get acquainted with the management interface.

With the recent advancements of virtualized environments came network and security virtual appliances, which could be re-purposed for training or testing. There are some vendors who offer 30 day validation, but honestly.. in real world people have to deal with many tasks and there is no continuous 30 days time, when one can focus on evaluating the product and perform all the tests and analysis. Also such evaluations are not always well planned with all the test cases identified before performing evaluation, so some cases might be missed and PoC has to be re-built to do them later.

So when actually engineers need or want testing/training VMs:

when selecting new network/security elements for purchase (e.g. shortlisting for evaluation)
when evaluating compatibility of various elements/vendor systems
when preparing for new job position
when testing new management/automation systems/monitoring software
when developing code for the above
when preparing for certifications
for collection purposes

All of these reasons would benefit a vendor (if he provides free evaluation VMs) in the following way:

customer's engineers already know the product(s), so no barrier to sell different products(or from different vendor)
compatibility would be evaluated without shipping try+buy equipment and problems would be identified (and resolved) quite early, especially if some bounty program would exist
with more engineers available to the customer on the market, they might grow/expand faster (=> more equipment can be purchased and deployed)
same applies to NMS/SIEM/automation software configuration
developing code in free time is also not very easy if one needs Cisco Nexus 9000 to actually test it.
although certifications are also source of income, but by lowering the cost of LAB preparation, it might motivate more people to do it.
some people collect stamps, while others might collect VMs.. but really it's more about being prepared for one of the reasons above..

There are more benefits to mention, but this surely is sufficient for any product manager or marketing director to stop for a while and think about it.
Some of the vendors already did it, and they chose this strategy in order to win hearts and minds of engineers, who otherwise would not have the opportunity to find out how good these products are.

Conclusion

Dear Vendor,
whether you want to introduce new product or gain larger market share, providing free VMs of your products (with limited performance of course) might bring you more advantages than risks.
Even when you are having problems recruiting good pre-sales or professional services engineers, the availability of free VMs for preparation or training on your products would expand your hiring selection choice in the long run.
What is also important to mention, engineers who would know your products (and like them) could indirectly be supporters or enablers of potential sales opportunities wherever they would work.
So please consider this when adjusting your product portfolios and creating your marketing strategies.
Respectfully yours,
Security Engineer

Tuesday, August 26, 2014

Private VLANs on NX-OS

In the field of network security there are not just firewalls and IDSs, there are also technologies and features that can be used as security controls (like network segmentation or access control) as well.

Private vlans (RFC 5517) is one such technology that is very helpful in case where one server needs to see all the clients, but clients should not see each other. Typical scenarios where this can be used is a backup network (one NAS or backup server and many clients) or OOB monitoring&control network (one NMS or AAA server/station and many network or server elements). There might be some fringe scenarios of filtered networks that need to use a common resource (a gateway/licence server/..), but these are not as common as previous cases.

To state some basics about private vlans, there are 3 types of vlans:

Primary vlan, containing ports that can talk to any other ports (promiscuous, isolated or community ports)
Isolated vlan, containing ports that can only talk to promiscuous ports
Community vlan, containing ports that can speak to promiscuous ports, but also to the ports in the same community vlan.

For better explanation how private vlans work, it's better to visit the RFC document linked above or one of the referenced sites at the end of this blog entry.

Configuration

The configuration steps are listed in the appropriate order, as in several cases it is necessary to shut down existing interfaces in order to put in the private vlan configuration when configuring it in different order than usual.

Enabling the feature

Luckily this feature doesn't require licence, so it can be just enabled:


feature private-vlan

To allow propagation of private vlans to other switches, other features are required (although they should be enabled already to have that functionality):


feature fex trunk

VLANs definition

Let's create a primary vlan with ID number 100 and associate it with secondary vlans:


Vlan 100

private-vlan primary

Next let's create a community vlan 101:


Vlan 101

private-vlan community

And vlan 102 as isolated vlan:


vlan 102

private-vlan isolated

To verify that vlans exist the following output should be observed:

# sh vlan private-vlan
Primary Secondary Type Ports
------- --------- --------------- -------------------------------------------
100 primary
101 isolated

102 community

Now with vlans existing we can associate it with the primary vlan:

Vlan 100
private-vlan associate 101,102

So for verification this is what the show command should show:

# sh vlan private-vlan
Primary Secondary Type Ports
------- --------- --------------- -------------------------------------------
100 101 isolated
100 102 community

Note: the vlan configuration is applied and shown correctly only after exiting the vlan configuration area.

Promiscuous port

With all vlans defined, we can proceed with configuration of appropriate ports.

int gigabitethernet 1/1

Switchport mode private-vlan promiscuous

Switchport private-vlan host-association 100 101-102

The association specifies the primary vlan first and then the list of secondary vlans that correspond to it.
Also it is recommended to use bpdu guard, as in today's world of virtualized switches on hosts, one never knows what might show up on ingress..

In order to verify the result the following would show up:

# sh vlan private-vlan
Primary Secondary Type Ports
------- --------- --------------- -------------------------------------------
100 101 isolated Eth1/1
100 102 community

NOTE: Promiscuous ports can only be configured on Nexus 5k physically, it doesn't work on ports on fabric extenders (Nexus 2k).

Isolated port

Configuration of isolated port is a very similar to promiscuous port:

int gigabitethernet 1/2

Switchport mode private-vlan host

Switchport private-vlan host-association 100 102

Association specifies only one secondary vlan, which corresponds to the isolated vlan that the port should be in.
In order to verify the result the following would show up:

# sh vlan private-vlan
Primary Secondary Type Ports
------- --------- --------------- -------------------------------------------
100 101 isolated Eth1/1,Eth1/2
100 102 community

Community port

And this is the configuration of a community port (it looks the same as isolated port):

int gigabitethernet 1/3

Switchport mode private-vlan host

Switchport private-vlan host-association 100 101


In order to verify the result the following would show up:




# sh vlan private-vlan

Primary  Secondary  Type             Ports

-------  ---------  ---------------  -------------------------------------------

100      101        isolated          Eth1/1,Eth1/2

100      102        community         Eth1/3

Trunk port configurations

For standard transit trunks, the VLANs look just like 2 separate VLANs, as the magic happens only on the end-points.

There are other trunk port types, which are used when trunking with non-"PVLAN aware" devices. Main point is that the frame forwarding which happens on secondary vlan has to be also sent to primary vlan and vice versa. This happens by re-writing the VLAN tags depending on the pairing of the interface.
There is a article on Cisco support forum describing the special cases where this could be used.

Promiscuous trunk

Beginning with Cisco NX-OS Release 5.0(2), on the Cisco Nexus Series devices, you can configure a promiscuous trunk port to carry traffic for multiple primary VLANs. You map the private VLAN primary VLAN and either all or selected associated VLANs to the promiscuous trunk port. Each primary VLAN and one associated and secondary VLAN is a private VLAN pair, and you can configure a maximum of 16 private VLAN pairs on each promiscuous trunk port.

Isolated or secondary trunk
Beginning with Cisco NX-OS Release 5.0(2) on the Cisco Nexus Series devices, you can configure an isolated trunk port to carry traffic for multiple isolated VLANs. Each secondary VLAN on an isolated trunk port must be associated with a different primary VLAN. You cannot put two secondary VLANs that are associated with the same primary VLAN on an isolated trunk port. Each primary VLAN and one associated secondary VLAN is a private VLAN pair, and you can configure a maximum of 16 private VLAN pairs on each isolated trunk port.

NOTE2: Portchannel interfaces can't be used for private VLANs.

References

Wednesday, July 16, 2014

Test Driven Design in firewall engineering

After doing one Berkeley course on BDD and TDD in software engineering, I got a very interesting idea for new kind of security device. This article would explain a bit about how this idea could work and what is the use-case for it.

Background

Firewall rules design is a skill that is quite hard to get right and quite easy to make mistakes in. If one just keeps on adding rules, firewall might end up with 10000 or more rules and get difficult to maintain or in worst case run out of memory for the rules.

On the other hand if one keeps on aggregating, firewall would end up permitting unwanted traffic and this won't be spotted until it actually causes some damage.

So the best spot is somewhere in between these two cases, by ensuring holes to be kept minimal but aggregating rules that can be merged.
This either can be done by doing careful planning and analysis, or it requires very good service monitoring to spot if something stops working or something suddenly becomes allowed that shouldn't be.
In most cases the first one is performed (if of course there is a brave soul to dare to touch a working system), but with rise of continuous deployment and automation second choice is something what might be more handy.

In software development BDD/TDD is done by writing tests before actual coding takes place, so that the "user story" is failing before the coding and becomes green after the code is implemented.
Also there are 2 types of tests that need to be done:

positive (how this feature should work)
negative (how this features should not work)

And what is so great about testing in software development area is, that they developed a language to describe them. In the course I've been practicing this with Cucumber, which describes the expected behavior of a application interface in very readable way and makes testing more fun.

Idea description

Now for the idea to work I would require a IDS-like system, which has following capabilities:

to receive any packets that come to the firewall as well as leave the firewall
to send packets with any kind of source/destination on both sides of the firewall

As IDS systems are already able to do that, I don't think there's a problem to build such a system:

Next part is to create a language for writing the tests, which would describe the expected behavior of the firewall and validate it it timely fashion.
I've written the cucumber tests to show what can be done, but it requires quite some coding to implement this type of test conditions, but it illustrates how such tests could look like.

Feature: Test HTTP protocol 
  Webserver 10.0.0.1 should be reachable on HTTP port from outside

  Scenario: External user allowed connection
    Given there is "HTTP" server "10.0.0.1"
    When I send "HTTP" packet from "Outside" to "10.0.0.1"
    Then I should see "HTTP" packet on "Inside" to "10.0.0.1"

  Scenario: External user dropped connection
    When I send "HTTP" packet from "Outside" to "10.0.0.2"
    Then I should not see "HTTP" packet on "Inside" to "10.0.0.2"
    When I send "SMTP" packet from "Outside" to "10.0.0.1"
    Then I should not see "SMTP" packet on "Inside" to "10.0.0.1"

Theoretically all the tests that are currently done manually by scanning for ports or analyzing firewall logs and sniffer data could be then automated and repeated any time it is necessary (without disturbing the security engineer :).

So with these capabilities I am able to generate packets and see what can pass through the firewall and what not, but it still can be a bit problematic if firewall does any kind of intelligence or behavior analysis and starts blocking IP addresses that are used for testing.
Another problem might be that server might process the requests and fill up the job queue or tcp stack with waiting for response packets.
So either the solution would be able to stop the packets automatically reaching anything outside the firewall, or the tests have to be written in a way to not block or over-utilize any resources.

With many vendors increasing support for various API interfaces, this could theoretically also be implemented directly on the firewall, but with firewall clusters or farms, this might not be very practical.
And of course the saying "trust is good, verification is better" still applies here.

As to SDN or more specifically NFV, this service would be ideal candidate to use for verification of changes or control software validation.

Use-Cases

As many vendors (that might grab this idea and build something) are concerned about why would customers buy something like this, let's think about some cases that would demand this idea:

Continuous deployment scenario

With applications being changed several times a day, there might be a need to adjust network elements to address the changes. For example when application changes the type of database it needs, resulting in need for Mysql flow needed to a newly spawned (or existing) database VM.
As the change has to happen automatically to keep the development cycle short, this would be done by a script and person doing it would want to see if firewall change had any impact on already permitted flows and whether new flow would not allow too much.

Firewall migration scenario

Upgrades or migrations of firewalls require all rules to be re-considered and probably rewritten. Having a set of tests that would show that new firewall does provide the same functionality and would not cause major issues for support teams. This way migrations would not need extra effort to investigate outages of broken connectivity (as every problem would be blamed on the firewall for months to come), and service status would be visible right after the new or upgraded firewall becomes operational.

Continuous monitoring scenario

Although continuous monitoring is mostly focused on services (customers want service to be up, they don't care about the infrastructure), for troubleshooting and support it might be quite useful to spot what causes the problem. Very often the monitoring is not analyzing all the firewall logs and even if so, simple rule change can generate masses of dropped connections and it might be tedious to see what flows are no longer working.

By performing continuous monitoring of data flows on the firewall would exclude this area from investigation by just looking at any failing tests (as it takes quite some time to search through logs to identify if something was dropped and whether it should not be dropped).