Thursday, October 11, 2012

VMWare ESXi 4/5 APD Lockup Problem

Problem: You click Rescan All... in the VSphere client and the ESXi host becomes unmanageable due a dead LUN or downed path of offlined volume (this is for iSCSI, I dont know about any others if this problem still happens).  Only fix is to hard-reboot the server.

Despite this long and lengthy from VMware on how to do this cleanly (http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2004605) it still is prone to a lot of errors and most likely this will not work for your environment.  I'm not even about to connect to 6 different hosts and run all this nonsense to make sure VMware cleanly unmounts a volume.

The quick fix: Go to your Storage Adapters and click on the properties of iSCSI Software Adapter.  Click the Static Discovery tab.  Remove the dead connections.  Then you can rescan without the host locking up.  No other method has proven reliable for me other than this.

Update: this still didn't fix the issue.  The only real way to overcome this problem is to upgrade to 5.1 where they finally fixed the issue.

Wednesday, September 19, 2012

Sophos (Shh/Updater-B)

Before I go into my rant on how fucked up Sophos' Endpoint Protection management system is, let me run through how I fixed this problem with the false detection.

My policies were set to move malware if cleanup failed.  Fortunately, only a handful of computers actually were able to move some of these files before I was able to update the policy.  Also fortunate that my antivirus server was NOT running the Sophos Client so nothing on the server broke.

Immediately I added these Windows exclusions to all on-access policies:

C:\Program Files (x86)\Sophos
C:\Program Files\Sophos
C:\ProgramData\Sophos

Also changed the policies to blocked instead of move at this time.

Forced all clients to update the policy.  Next I forced the update manager to grab the fix.  Let it push out to our file server which eventually synced it up using DFS to all locations.

Now the fun..   for the clients that didn't break themselves, I let them self-update and this fixed the issue.

For the clients that did quarantine/delete Sophos' own update files (grabbed a list of them by checking which computers were not fully up to date in the console), I copied the entire C:\Program Files (x86)\Sophos\AutoUpdate directory as well as the update definition that fixes this false detection to our server.

Ran a quick script to copy back the programs and dll files Sophos removed:
xcopy "\\Server\SophosAutoUpdate\*.*" "C:\Program Files\Sophos\AutoUpdate\*.*" /y
xcopy "\\Server\SophosIDE\javab-bd.ide*" "C:\Program Files\Sophos\Sophos Anti-Virus\*.*" /y
(change to "C:\Program Files (x86)" for 64-bit OS)

Restarted Sophos Anti-Virus service (SAVservice) and Sophos Updater with Dameware and I was back in business.

For clients who are not connected, I think the best bet is to send them a script to stop Sophos Anti-Virus, then have them run an executable ZIP file to restore the AutoUpdate and java*.ide file.  I cannot imagine how larger corporations are dealing with this disaster.  For the company I am at, I was able to catch this issue when the first round of alerts started flooding in.  And luckily since we use DFS to distribute updates, I used "Previous versions" to restore back the files that were modified before the last update to stop the spread.

---

Now the rant...   Sophos' support on this problem was beyond horrible.  How this update slipped through QA is unforgivable.  In fact the only way this could have slipped through their quality control is if they didn't have quality control or testing.  Otherwise, they would have realized this update breaks their own program!

I understand some other antiviruses released bad updates, but NEVER have I ever seen one that actually detected itself as a virus.

How did they respond to this fuckup?  They issued 1 advisory which was so vague and would not fix anyone's issue unless their policy was changed to do nothing when Malware was detected (which I don't even think is their default setting).  Their support lines were unreachable from the massive number of customers calling in, their email support was non-existent, and it appears the only help available was 1 employee responding periodically on their forums.

Regardless of this recent incident, there was numerous other annoyances that aggravated me:
1. You cannot unquarantine files remotely.  You had to manually go the client computer and run Sophos from there.  On top of this, quarantine files are not moved back.  You have to sort through the log files and figure out where they came from.  FAIL.

2. Server-Client communication is sub-optimal.  The clients stay connected to the server over 2 ports at all times.  Its not a simple push/pull method, but a constant connection.  Drains server resources, and just an overall poor design that was probably meant for a network of 20 computers, not hundreds or thousands.

3. Version 10 and the bloat.  Their "web-intelligence" services (2 more services it has to run) breaks a lot of network programs.  Disabling in it the policy has no effect, the only fix is to actually set the service to disabled.  It's a broken LSA that destroyed our Sharepoint server (email notifications stopped working, SQL connections broke) and some clients were not able to browse the web or use Oracle applications.


Sophos did have its advantages back in the day, lightest and strongest Antivirus out there.  But I'm afraid its time has gone, they are not improving the product but just bloating it with useless addons -- making it an absolute disaster to manage and maintain.  I'm going to have to take a peek at Vipre, had some issues when testing it years ago but at least it was manageable, where I had the ability to release quarantined files.

Tuesday, April 10, 2012

Red Hat 5, iSCSI and multipath

So you setup multipath with iSCSI on Red Hat 5 but noticing traffic is only going out 1 interface?

The problem is iscsiadm seems to ignore the physical interface you are trying to bond to.  I think you can manually force the iface when setting up each node but when you have a storage array with 5 adapters, and your server has 3 adapters you are using, do you really want to enter 15 commands per volume?  If you are as lazy as I am, the quick fix is to edit each iface entry in /var/lib/iscsi/ifaces and add the line:
"iface.net_ifacename = eth0" where eth0 is the physical interface you are bonding.

Then its simple a matter of discovering the volumes:
 iscsiadm -m discovery -t st -p <iscsi discovery IP>

Add the node to all available ifaces in 1 shot:
 iscsiadm -m node iqn.veryveryverylongname000000666.feedge --login

Check multipath:
multipath -ll

mpath10 (2bfc5148f7267432c5d7ce900ed0e9ff4) dm-2 Nimble,Server
[size=800G][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=15][active]
 \_ 147:0:0:0 sdef       128:112 [active][ready]
 \_ 149:0:0:0 sdeg       128:128 [active][ready]
 \_ 148:0:0:0 sdeh       128:144 [active][ready]
 \_ 151:0:0:0 sdei       128:160 [active][ready]
 \_ 153:0:0:0 sdel       128:208 [active][ready]
 \_ 150:0:0:0 sdej       128:176 [active][ready]
 \_ 152:0:0:0 sdek       128:192 [active][ready]
 \_ 154:0:0:0 sdem       128:224 [active][ready]
 \_ 156:0:0:0 sden       128:240 [active][ready]
 \_ 158:0:0:0 sdeq       129:32  [active][ready]
 \_ 155:0:0:0 sdeo       129:0   [active][ready]
 \_ 157:0:0:0 sdep       129:16  [active][ready]
 \_ 159:0:0:0 sder       129:48  [active][ready]
 \_ 160:0:0:0 sdes       129:64  [active][ready]
 \_ 161:0:0:0 sdet       129:80  [active][ready]

and mount (or format) your volume:
mount /dev/mpath/mpath10 /myvolume

You can check ifconfig to confirm all the ethernet adapters bound to iSCSI have equal amounts of traffic or use iptraf to check the packets.