Search:  

 
 
   All ForumsHot TopicsGallery






how-to block ads


 
Forums » Tech and Talk » OS and Software » All Things Unix » Problem with Open Solaris + disks
Search Topic:
Uniqs:
966
Share Topic:
RSS topic:
toggle:
flat / full
normal / watch
Posting:
Post a:
Post a:
[SOLVED] Firefox 3.5 crashes when viewing fullscreen videos »
« Recommended Linux Distro  
AuthorAll Replies

galacticroot

join:2004-05-17
clubs:

Problem with Open Solaris + disks

I've got a server using an SB780 onboard SATA controller running Open Solaris. I am using it as a file server and to run several Xen VMs. My problems started when I recently added a VM to use as a mail server. I allocated a 40GB zfs volume on rpool for its disk image and installed Debian on it. Everything went as expected and I got the server set up, then the problems started.

Usually, what will happen is that the VM will crash, accompanied by messages like this on the host machine:

Zpool will typically show problems:

If I online any offline devices, clear the errors, restart the VM, and scrub the pools, everything works fine for a while, until it happens again. It seems to happen when a moderate, but not necessarily heavy IO load occurs.

I am thinking this is a problem with the disk controller, not the disks. Although the first disk gets far more errors than the others, sometimes the other disks are the ones to have failures and the first disk continues working.

This is the first time a disk actually went offline. I couldn't successfully online it without rebooting.

Has anyone had this problem before? Is there likely a software fix for this, or should I just get a new SATA controller (provided it isn't actually the disks)?


beerbum
Premium
join:2000-05-06
Reading, PA
clubs:

if you were running SCSI I'd suggest replacing the cable

can you move one of the discs (Disk 2) to a different controller? or even better attach an additional drive to a different controller.

it's possible you could be overloading the controller - something easy to do with the poor sata drivers that are packaged.

galacticroot

join:2004-05-17
clubs:

Now the other disk in rpool went offline and doesn't come back up even when I reboot for some reason. I probably have to power cycle the box, not just reboot.

I am willing to replace the controller with a PCIe one, but I'm not sure what Open Solaris supports well. Are there any PCIe SATA cards with >=4 ports that it supports well?

I am even willing to replace the motherboard if that will help. I just don't want to spend a lot of money on hardware that may not even work well with it.


koitsu
Premium
join:2002-07-16
Mountain View, CA


3 edits
reply to galacticroot
The errors indicate LBA read failures on Disk 0 and Disk 2. The fact that the errors occurred (according to your logs) within 5 seconds of one another is a little suspicious.

I'd recommend looking at SMART stats on both of these disks to see if you can discern what's going on. However, Solaris 10 doesn't offer an API for obtaining SMART data from ATA/SATA disks -- only SCSI. So smartmontools won't help you here.

SMART statistics will help determine if the disks themselves are actually witnessing bad blocks (and remapping them), or if the controller is responsible.

But you'll have to boot into another OS (FreeBSD, Linux, Windows) to get SMART statistics with smartmontools.

If you bought both ATA/SATA disks at the same time, it's possible both have problems. Otherwise, if both Disk 0 and Disk 2 are on the same physical controller (which they appear to be), I'd recommend the following:

1) Are these internal SATA drives? If so, replace the SATA cables. You should only have to do this once. If the problem recurs, it's not the cables.

2) If the SATA drives are external via eSATA, are you using a SATA-to-eSATA adapter bracket (e.g. cable runs between the onboard SATA controller to the backplane, with an eSATA connector)? If so, get rid of it -- chances are you're exceeding SATA cable length. Buy yourself a real PCI/PCIe-based eSATA controller.

3) If your motherboard BIOS has support for AHCI, enable it in the BIOS. I have no idea if the SB780 supports AHCI or not. You should always go with AHCI if given the chance, especially on server systems.

4) Opening a SunSolve case, as the problem could be Solaris having buggy support for the SB780 SATA controller (do not even for a moment think all SATA controllers are alike). My money is on this being the root cause.

zpool status indicates that the disks are literally falling off the SATA bus ("cannot open"). There's got to be other messages on your console to indicate that, not just sector read errors.

Regarding hardware compatibility:

I've never seen Solaris 10 behave flawlessly with SATA disks. At work we have Intel ICHx-based controllers with both SATA SSDs and standard drives. Our machines work fine -- no problems -- except during boot-up we see some CMD errors to the disks (iostat -e shows the same). There's nothing wrong with the SATA controllers here -- it's that Solaris is trying to issue SCSI commands to SATA disks, and the SCSISATA layer doesn't properly remap the commands (or remove ones not supported). Thus, the errors seen are false. We use ZFS and we don't have problems with the disks falling off the bus, though -- the only errors occur during boot.

If I had to make a recommendation, I'd say go with a motherboard that provides an on-board Intel ICHx controller. ICH7 would probably be best, possibly ICH9 (newer). I don't think Solaris 10 has decent support for ICH10 yet (too new).
--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.

galacticroot

join:2004-05-17
clubs:

Okay, I tried changing to AHCI mode, but the Open Solaris drivers seem to have an issue with that and can't mount the root filesystem. Returning to the original mode allows it to boot.

The hardware I find that definitely works well with Open Solaris seems to all be higher end than I can afford right now. I'm going to switch the file storage over to a separate NAS setup running OpenFiler, and replace Open Solaris with Linux on this system and run the VMs on that.


beerbum
Premium
join:2000-05-06
Reading, PA
clubs:

may I ask a stupid question.. why are you using OpenSolaris to begin with? if this is a mission critical server, heck even if it's a mission optional machine, I do not recommend running OpenSolaris.

Sun's production Solaris costs just the same and is (IMO) the more reliable route to go in a business setting.


koitsu
Premium
join:2002-07-16
Mountain View, CA

reply to galacticroot
said by galacticroot See Profile :

Okay, I tried changing to AHCI mode, but the Open Solaris drivers seem to have an issue with that and can't mount the root filesystem. Returning to the original mode allows it to boot.
This isn't a "driver issue" -- it's probably that the device label (and all underlying filesystem slices) has changed. The disks probably won't be named "c3d0s0" any more, but could be "cXd0s0" where X is some new number, or, are labelled "sdX".

We have the same problem on FreeBSD, and Windows requires an entire reinstall.

ZFS should be able to cope with the device names changing.

said by beerbum See Profile :

may I ask a stupid question.. why are you using OpenSolaris to begin with? if this is a mission critical server, heck even if it's a mission optional machine, I do not recommend running OpenSolaris.

Sun's production Solaris costs just the same and is (IMO) the more reliable route to go in a business setting.
Given that my place of employment has relied on Solaris 8 through 10 (~90% of our machines are using 10 at this point) for 5+ years now, my experience is quite the opposite. I'm talking multiple thousands of machines, all x86 (at this point), and all are mission + time-critical (all production, and are involved with VoIP + IVR; 2-3 second "stalls" or other oddities a server might encounter result in horrible caller experience, and we can't have that).

We open SunSolve cases for strange things we encounter and Sun is responsive. I'm not saying "you're wrong", I'm saying my experience is entirely different. Of course, low-level administration of devices and hardware on Solaris is significantly better on Sparc (and I do mean significantly), but x86 is standard these days.

Btrfs is the only thing on Linux that even remotely behaves like the OP's ZFS configuration. I would HIGHLY recommend the OP read the following thread (and news article!):

»Chris Mason Interview - BTRFS Founder & Lead Developer

Use whatever OS gets the job done. If that's Linux, great. If that's Solaris 10, great. If that's OS/2, I'll punch you.
--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.


beerbum
Premium
join:2000-05-06
Reading, PA
clubs:

said by koitsu See Profile :

Given that my place of employment has relied on Solaris 8 through 10 (~90% of our machines are using 10 at this point) for 5+ years now, my experience is quite the opposite.
huh??.. maybe you missed it.. the OP is running OpenSolaris (SunOS 5.11), not Solaris (SunOS 5.10).. I would not recommend anyone run OpenSolaris in a production system. Heck no admin worth anyone would recommend that.. Comparing OpenSolaris to Solaris, the SATA support is much more robust than the same in OpenSolaris.. While the new features and whatnot in OS do make it into production Solaris, one should consider OpenSolaris as a beta product.

Hell I'm pretty sure even Sun does not recommend using OpenSolaris in a production environment..

Use whatever OS gets the job done. If that's Linux, great. If that's Solaris 10, great. If that's OS/2, I'll punch you.
I started out (*nix) adminning on IBM RS2K's.. guess what I used on my peecee - yup OS/2 Warp.. In fact, my Rexx-fu is what helped me land my first gig as an admin..


koitsu
Premium
join:2002-07-16
Mountain View, CA

said by beerbum See Profile :

huh??.. maybe you missed it.. the OP is running OpenSolaris (SunOS 5.11), not Solaris (SunOS 5.10).. ...
Well colour me stupid. For the longest while now I've been under the impression that Solaris 10 (5.10) was in fact OpenSolaris. Good lord, there's something seriously wrong when an administrator of machines doesn't even know what the official title of his OS is.

I think I might save this thread to remind me of my stupid moments.

Thanks for clearing that up for me -- I appreciate it. (Damn you Sun...)
--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.

galacticroot

join:2004-05-17
clubs:

This was just a personal system which I built primarily as a NAS box, but also to run a couple VMs for various things. I was more or less running OpenSolaris strictly for the features of ZFS. I thought I would end up adding a lot more space and using some other features of ZFS which I never ended up using.

I will look at btrfs. It sounds like it could be very nice in the future.

For now, OpenFiler looks like it will work well enough for NAS, and Linux or BSD will be good for running the VMs.

galacticroot

join:2004-05-17
clubs:

I finally managed to move the files over to an Openfiler box and I just installed Debian on the system that has OpenSolaris.

I used smartmon tools to check both hard drives. Disk 1 is fine, but disk 2 has quite a few reallocated sectors (currently at 1996). I will definitely replace it if I get any more reallocated sectors. I suspect that the read failures created by disk 2 were not being handled well by OpenSolaris (or rather the SATA driver).

Linux seems to handle the errors correctly, although I haven't tried it out with the Xen VMs yet (I'm currently transferring the images).
-
Forums » Tech and Talk » OS and Software » All Things Unix[SOLVED] Firefox 3.5 crashes when viewing fullscreen videos »
« Recommended Linux Distro  


Saturday, 28-Nov 16:25:02 Terms of Use | Privacy Policy | Hosting by www.nac.net - DSL,Hosting & Co-lo | feedback | contact
over 10 years online! © 1999-2009 dslreports.com.republican-creole
page compression OFF
Most commented news this week
· [122] Time Warner Cable Fires Broadside At Broadcasters
· [112] New AT&T Ad Campaign Hits Back At Verizon
· [96] Apple Joins AT&T Verizon Snark Fest
· [87] New Bill Takes Aim At Higher Verizon ETFs
· [76] TiVo Sees Record Customer Losses
· [70] Verizon CEO: Hulu Will Be Dead Soon
· [69] In-Flight Internet Headed For Bumpy Landing?
· [62] Thanksgiving Open Thread
· [62] Weekend Open Thread
· [40] EFF Wages War On Fine Print
Most people now reading
· ToC 4th boss - Preliminary Strategy for Twin Valkyr [World of Warcraft]
· Why would I want an e reader? [General Questions]
· [Newsgroups] Newzleech down? [Filesharing Software]
· Windows 7 boot manager editing questions [Microsoft Help]
· how to use the 2nd line with phone hooked to the 1st line? [VOIP Tech Chat]
· RG Firmware update to VDSL2 this morning [AT&T U-verse]
· 3.x Feral Druid - Bear Tanking Guide [World of Warcraft]
· Using AirMax to provide triple play services? [Wireless Service Providers]
· Motorola 5100 Modem [Comcast HSI]
· [ PVP] 3.2 DK PvP D/W Spec... [World of Warcraft]