One of the aspects I really enjoy about my job is that I get to learn from some of the world’s top network and data center design engineers, and I get to hear about technology adoption challenges across the world. If there is a complex network or data center design being worked by our customers, if our customers are under time pressure, or if our customers are facing key business or technical challenges, Cisco Services’ consultants are often called in to help. Globally then, they experience first hand the challenges of deploying advanced technologies. In this blog, in the same spirit as my OpenStack Deployment Challenges blog, I’d like to share their experiences on some of the most common challenges and misconceptions faced by our customers when building Storage Area Networks (SAN). I’ll publish this in 2 parts – so look out for the concluding part next week.
Before continuing, I’d like to thank two of our SAN expert consultants, Barbara Ledda and Wolfgang Lang, for sharing their experiences and challenges.
First, before I start, a caveat: please don’t take this as the definitive list of SAN design challenges! That said, they are some of our “favourite” issues that we see with customers. Let us know if you have a favourite to add to this list, please add a comment in the box below or ping me on Twitter (details below). Now on to the list …..
#1 Don’t assume that your server multi-pathing software is installed or working, or even licensed, or installed but never used/tested by your server team!
Designing a highly resilient SAN network design with built-in redundant paths is a key aim of our consulting team. On the server estate, such design requires that multi-pathing software [e.g. EMC PowerPath, Microsoft I/O (MPIO)] is installed and operational on each server. Such software creates redundant paths to the storage environment (among other capabilities).
However our experience shows that in many cases, the customer doesn’t have this installed or in cases even licensed. In other cases they’ve never tested this capability on their servers. Testing is important – I doubt if anyone is immune to making configuration mistakes when setting up a networking or compute device. We’ve come across this in real life: the first time when failover was required, the customer experienced an outage in the SAN network as a result of mis-configurations in their multipath software.
Therefore Lesson #1 is to make sure your server team are engaged in your SAN design process, and that they have this important software installed and tested, in a position to be able to exploit your advanced SAN design.
#2 Tendency to significantly over-estimate utilization on the SAN network.
You may think that as a hardware vendor, we like customers to over-specify their network. This is not the case – we are always looking for the most cost-effective solution for our customers. We have, in practice, noticed that customers have a tendency to significantly over-estimate the utilization on their SAN network. For example, it’s not uncommon for us to review customer designs were they have specified 16 GB/s links when they will use a maximum of 1 Gbp/s!
As you will be aware, Ethernet LAN network are able to deal with packet loss, and they rely on the upper layer protocols to do so. This is not the same for Fibre Channel networks which require the traffic flows to be lossless in order to avoid I/O disruption and loss to connectivity to storage. Consequently, to avoid any risk of congestion, SAN designers tend to overprovision their SAN.
However over-provisioning of the SAN links is not always the solution, because in many cases, the root cause of the congestion is that some receivers of the traffic flows are “slow drain devices” – that is, they are “slow” in processing the packets. You can read more on this in the whitepaper here.
Hence while over-provisioning is a necessary design technique for SAN, the perception we find is that typically customers think they need 5-10x over provisioning; our consultants’ experience indicates that typically you need only double.
Lesson #2: Get expert help to accurately assess your SAN capacity needs.
That, then, concludes my part 1. I’ll extend this list next week with additional tips. In the meantime, you may be interested to register for the upcoming Cisco seminar on “Designing and Deploying Highly Scalable SANs with the New Cisco MDS 9396S” on August 25th, so please register here!
In the meantime, I’d like to hear what are your top SAN design challenges? Let me know via the comments box below or contact me on Twitter and we can discuss with our Cisco Services’ SAN design experts! And finally, look out for my part 2 next week.
If I’m booting from SAN (iSCSI), how much latency is acceptable? I’ve been told 40 ms is acceptable before the OS blue screens. Thanks
Good question, Thomas, thanks!
First, I’d like to thanks my colleagues and Cisco Services SAN experts, Venkat Kirishnamurthyi and Jing Luo, for their insights here.
It varies depends on the OS you are loading, the hardware it is running and iSCSI software or hardware and physical connections.
For disk timeout on Windows, it can be configured in OS. For iSCSI, It is usually set as 60 seconds.
There are some implications of this, however.
Firstly, a consistent 40 ms latency is not really optimal in any situation. Even though the SCSI timeout is typically set to 60 seconds for Windows servers, you are going to have a very poorly performing system given the 40 ms latency. The fact that you are not getting a blue screen at 40ms is the only saving grace in this situation. Plus because of the 40 ms latency you will at some point of time start running out of buffers (TCP) leading to dropped packets and this will result in re trying of I/O further leading to very poor I/O performance.
Secondly, most applications expect a sub 10 ms latency for disk access. Oracle for example expects the I/O to complete with 3- 6 ms for optimal performance. Given the application requirements, you will have application issues making it an unworkable solution.
Hope this helps!
Stephen
I have hp Eva p6500 storage I found frames errors in the San switches ports statistics from the web management interface that connected to storage array and server side so how I fix this errors
What kind of SAN switches do you have?
HP b-series 8/80
Hi Alaahegga
I’ll be publishing part 2 of my blog tomorrow and one of the issues I talk about is the architecture of SAN switches and how this can be a challenge, esp wrt limitations in the switch architecture. Unfortunately some of our competitors have architectural issues which are more susceptible to packet dropping. The Cisco Crossbar architecture is more robust in this regard than some of our competitors. I appreciate this is not what you want to hear at this point, howeve I will asked around some of my colleagues and see what experience they have (although to be fair our expertise is on Cisco SAN switches)
Stephen
Hi Alaahegga, What frame errors are you seeing? It is will tough to diagnose without more information. It could be many issues. typically we have seen CRC errors causing issues on the brocade switches. It could also be due to Head of the Line blocking issues due to congestion in your fabric due to their switch architecture. We have see CRC errors and HOL issues care much grief in customers environments before. It may or may not be the errors that you are seeing in you environment. It will be tough to diagnose without more data on the type of errors you are seeing.
You can find part of of this blog series here: http://blogs.cisco.com/datacenter/5-top-challenges-of-san-design-view-from-our-san-design-experts-part-2