Closer inspection of the logs has uncovered an increasing amount of IO errors pointing to bad sectors on the hard disk - see below for the log files of this problem getting worse & worse to the point that now the hard disk has entered read only mode - something linux does when a hard disk is about to die, giving you a chance to back everything up.
- Code: Select all
[430271.152234] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[430271.152234] end_request: I/O error, dev sda, sector 255332439
[430271.152234] Buffer I/O error on device sda1, logical block 31916547
[430271.152234] lost page write due to I/O error on sda1
[430271.152234] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[430271.152234] end_request: I/O error, dev sda, sector 256955511
[430271.152234] Buffer I/O error on device sda1, logical block 32119431
[430271.152234] lost page write due to I/O error on sda1
[430271.152234] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[430271.152234] end_request: I/O error, dev sda, sector 256957567
[430271.152234] Buffer I/O error on device sda1, logical block 32119688
[430271.152234] lost page write due to I/O error on sda1
[430271.152234] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[430271.152234] end_request: I/O error, dev sda, sector 256961903
[430271.152234] Buffer I/O error on device sda1, logical block 32120230
[430271.152234] lost page write due to I/O error on sda1
[430271.152234] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[430271.152234] end_request: I/O error, dev sda, sector 256965887
[430271.152234] Buffer I/O error on device sda1, logical block 32120728
[430271.152234] lost page write due to I/O error on sda1
[434695.204042] ata2.00: exception Emask 0x0 SAct 0x60 SErr 0x0 action 0x6 frozen
[434695.204106] ata2.00: cmd 61/08:28:27:1a:06/00:00:0d:00:00/40 tag 5 ncq 4096 out
[434695.204107] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[434695.204162] ata2.00: status: { DRDY }
[434695.204193] ata2.00: cmd 60/60:30:a7:00:e9/01:00:02:00:00/40 tag 6 ncq 180224 in
[434695.204194] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[434695.204254] ata2.00: status: { DRDY }
[434695.204285] ata2: hard resetting link
[434700.564034] ata2: link is slow to respond, please be patient (ready=0)
[434705.212021] ata2: COMRESET failed (errno=-16)
[434705.212059] ata2: hard resetting link
[434710.572034] ata2: link is slow to respond, please be patient (ready=0)
[434715.220034] ata2: COMRESET failed (errno=-16)
[434715.220073] ata2: hard resetting link
[434717.948042] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[434717.964271] ata2.00: configured for UDMA/133
[434717.964271] ata2: EH complete
[435852.000319] sd 1:0:0:0: [sda] 312579695 512-byte hardware sectors (160041 MB)
[435852.000352] sd 1:0:0:0: [sda] Write Protect is off
[435852.000355] sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
[435852.000392] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[436077.501144] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436077.501144] end_request: I/O error, dev sda, sector 218842543
[436229.712047] ata2.00: exception Emask 0x0 SAct 0xf0003 SErr 0x0 action 0x6 frozen
[436229.712117] ata2.00: cmd 61/00:00:27:1e:4d/04:00:0d:00:00/40 tag 0 ncq 524288 out
[436229.712119] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[436229.712174] ata2.00: status: { DRDY }
[436229.712205] ata2.00: cmd 60/00:08:bf:dd:0b/01:00:0d:00:00/40 tag 1 ncq 131072 in
[436229.712206] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[436229.712260] ata2.00: status: { DRDY }
[436229.712293] ata2.00: cmd 61/08:80:27:22:4d/03:00:0d:00:00/40 tag 16 ncq 397312 out
[436229.712294] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
...
[436534.792120] end_request: I/O error, dev sda, sector 223251087
[436534.792120] Buffer I/O error on device sda1, logical block 27906378
[436534.792120] lost page write due to I/O error on sda1
[436534.792120] Buffer I/O error on device sda1, logical block 27906379
[436534.792120] lost page write due to I/O error on sda1
[436534.792120] Buffer I/O error on device sda1, logical block 27906380
[436534.792120] lost page write due to I/O error on sda1
[436534.792120] Buffer I/O error on device sda1, logical block 27906381
[436534.792120] lost page write due to I/O error on sda1
[436534.792120] Buffer I/O error on device sda1, logical block 27906382
[436534.792120] lost page write due to I/O error on sda1
[436534.792120] Buffer I/O error on device sda1, logical block 27906383
[436534.792120] lost page write due to I/O error on sda1
[436534.792120] Buffer I/O error on device sda1, logical block 27906384
[436534.792120] lost page write due to I/O error on sda1
[436534.792120] Buffer I/O error on device sda1, logical block 27906385
[436534.792120] lost page write due to I/O error on sda1
[436534.792120] Buffer I/O error on device sda1, logical block 27906386
[436534.792120] lost page write due to I/O error on sda1
[436534.792120] Buffer I/O error on device sda1, logical block 27906387
[436534.792120] lost page write due to I/O error on sda1
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436534.792120] end_request: I/O error, dev sda, sector 223252111
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436534.792120] end_request: I/O error, dev sda, sector 223253135
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,S
...
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436534.792120] end_request: I/O error, dev sda, sector 223252111
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436534.792120] end_request: I/O error, dev sda, sector 223253135
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436534.792120] end_request: I/O error, dev sda, sector 223254159
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436534.792120] end_request: I/O error, dev sda, sector 253476007
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436534.792120] end_request: I/O error, dev sda, sector 253493375
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436534.792120] end_request: I/O error, dev sda, sector 253493391
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436534.792120] end_request: I/O error, dev sda, sector 253493911
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436534.792120] end_request: I/O error, dev sda, sector 253494015
[436558.316730] Aborting journal on device sda1.
[436558.316730] ext3_abort called.
[436558.316730] EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
[436558.316730] Remounting filesystem read-only
[436558.575000] journal commit I/O error
[436558.576658] EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted
[436604.610295] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436604.610303] end_request: I/O error, dev sda, sector 257197647
[436604.610340] __ratelimit: 395 messages suppressed
[436604.610343] Buffer I/O error on device sda1, logical block 32149698
[436604.610377] lost page write due to I/O error on sda1
[436604.610657] __journal_remove_journal_head: freeing b_committed_data
[436604.610662] __journal_remove_journal_head: freeing b_committed_data
[436604.610669] __journal_remove_journal_head: freeing b_frozen_data
[436604.610671] __journal_remove_journal_head: freeing b_frozen_data
[436604.610673] __journal_remove_journal_head: freeing b_committed_data
[436604.610712] journal commit I/O error
(END)
The databases, all user folders & game servers have been backed up to a system on my network.. this includes imagehost - everything - it took ages.
When I reboot the server shortly, it will most likely go offline for a good hour or so as it attempts to repair the disk. Given it has gone into read only, there is a high chance it may not come back at all.. which leaves me with a lot of work to do when it comes to restoring everything how it was - but we have a full backup - it can be done.
Assuming everything goes well, we should see the server come back AOK - the forums should go back online, everything... but that just puts us back where we were last weekend - we still have a failing hard disk - all that is bringing us back is fsck (linux's scandisk) repairing the damage the failing mechanism is doing. Long term, as in, Tuesday, we'll have them clone the server off to a new system.. relaxing in the knowledge that we've made our own manual backups of stats, forums, databases etc.. so no matter what, we aren't gonna lose stuff.
Cross your fingers with me folks.. this is gonna take a lot of work and a little bit of luck.
More info as it emerges.
UPDATE
Support chat log today at 9:30am
- Code: Select all
Alvin Sim: Hi. Welcome to Web24 Live Support. How may I assist you today?
Mouldy (Mr. Mouldy): Ticket ID: ZXZ-716200 i need to check on the status of my ticket, I’m hoping you can tell me an ETA on when we are back up
Alvin Sim: one moment bringing it up
Mouldy (Mr. Mouldy): our dedicated has been down since Friday nite
Alvin Sim: it's with the dedicated server guys at the moment, but let me check
Alvin Sim: I can call through for a power reset if you wish
Mouldy (Mr. Mouldy): it’s a problem with the hdd hardware
Alvin Sim: No problem, we can get the replacement in, but are you happy with us just rebooting it for now and see if it comes up? I can organize a disk replacement in the mean time
Mouldy (Mr. Mouldy): sectors of the hdd are not reading and the IO wait time is up to 90% yes ok
Alvin Sim: oh, for server downs, you can call emergency support (option 5) btw
Alvin Sim: just a fyi
Mouldy (Mr. Mouldy): phone is dead
Alvin Sim: one moment, I'll call through now
Alvin Sim: I meant for the reboot
Mouldy (Mr. Mouldy): ok thanks
Alvin Sim: ..in progress
Mouldy (Mr. Mouldy): ya il try and ping it till it comes up
Alvin Sim: I'm waiting a callback from the DC now
Alvin Sim: (still waiting callback confirming reboot)
Mouldy (Mr. Mouldy): ya im pinging and nothing so far
Alvin Sim: ok, callback's given, I'll give it a few moments to boot up
Alvin Sim: if not I'll have to organize someone to go out there
Mouldy (Mr. Mouldy): ok
Alvin Sim: is it firewalled?
Alvin Sim: just did a ping, no response
Mouldy (Mr. Mouldy): no
Mouldy (Mr. Mouldy): ya i know
Alvin Sim: I'll have to organize one of the dedicated server guys to go out there for this
Mouldy (Mr. Mouldy): thanks
Alvin Sim: I'll reply to the ticket when I hear back from them
Mouldy (Mr. Mouldy): i called Saturday morning they said they will get onto it, but nothing was done
Mouldy (Mr. Mouldy): ok thanks
Alvin Sim: did you know who you spoke to?
Alvin Sim: I would like to follow up with the person where possible. We do take server downs seriously
Mouldy (Mr. Mouldy): i cant remember he sounded sleeper then I was lol
Mouldy (Mr. Mouldy): ok, I’ll go, and wait for the ticket reply
another update
a email from our host
- Code: Select all
[font=Verdana, Arial, Helvetica][size=2][size=2][size=2][size=2][size=2][size=2][size=2][size=2][size=2][size=2]Hello Daniel,
The actual server only has one hard drive, without any RAID, which means it's not a matter of going down to replace a faulty hard drive. A rebuild is required to get it up and running. At this time, the earliest we can arrange is for tomorrow. As a moving forward plan, I would highly suggest looking at at least a Software-RAID option to prevent downtimes like this.
I'll put this ticket on hold until tomorrow when the dedicated server people action it.
Yours sincerely,
Alvin S[/size][/size][/size][/size][/size][/size][/size][/size][/size][/size][/font]
we will be upgrading to dual Hdd in a raid
yay!!!
- Code: Select all
[font=Verdana, Arial, Helvetica][size=2][size=2][size=2][size=2][size=2][size=2][size=2]Hi Daniel,
I have built a clean install of Debian 5 x32 with the same root password and network configuration as your server which failed.
If we can read any data off your failed drive, I will attach it as a usb disk for you. However, I can't promise that this will be possible.
I will be in touch when I have more information for you.
Please let me know if I can offer any further information or assistance.
Your sincerely,
Levi S[/size][/size][/size][/size][/size][/size][/size][/font]