AGS RESTORED (Hard Disk had entered read only then died)

Technical discussion relating to drugcrew.com & the forums
Bookmark and Share

AGS RESTORED (Hard Disk had entered read only then died)

Postby [DRuG]NikT on Sat Apr 03, 2010 2:48 am

It seems the problems experienced last weekend at http://aussiegameserver.com, which were ultimately resolved by an update and a reboot, have returned.

Closer inspection of the logs has uncovered an increasing amount of IO errors pointing to bad sectors on the hard disk - see below for the log files of this problem getting worse & worse to the point that now the hard disk has entered read only mode - something linux does when a hard disk is about to die, giving you a chance to back everything up.

Code: Select all
[430271.152234] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[430271.152234] end_request: I/O error, dev sda, sector 255332439
[430271.152234] Buffer I/O error on device sda1, logical block 31916547
[430271.152234] lost page write due to I/O error on sda1
[430271.152234] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[430271.152234] end_request: I/O error, dev sda, sector 256955511
[430271.152234] Buffer I/O error on device sda1, logical block 32119431
[430271.152234] lost page write due to I/O error on sda1
[430271.152234] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[430271.152234] end_request: I/O error, dev sda, sector 256957567
[430271.152234] Buffer I/O error on device sda1, logical block 32119688
[430271.152234] lost page write due to I/O error on sda1
[430271.152234] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[430271.152234] end_request: I/O error, dev sda, sector 256961903
[430271.152234] Buffer I/O error on device sda1, logical block 32120230
[430271.152234] lost page write due to I/O error on sda1
[430271.152234] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[430271.152234] end_request: I/O error, dev sda, sector 256965887
[430271.152234] Buffer I/O error on device sda1, logical block 32120728
[430271.152234] lost page write due to I/O error on sda1
[434695.204042] ata2.00: exception Emask 0x0 SAct 0x60 SErr 0x0 action 0x6 frozen
[434695.204106] ata2.00: cmd 61/08:28:27:1a:06/00:00:0d:00:00/40 tag 5 ncq 4096 out
[434695.204107]      res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[434695.204162] ata2.00: status: { DRDY }
[434695.204193] ata2.00: cmd 60/60:30:a7:00:e9/01:00:02:00:00/40 tag 6 ncq 180224 in
[434695.204194]      res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[434695.204254] ata2.00: status: { DRDY }
[434695.204285] ata2: hard resetting link
[434700.564034] ata2: link is slow to respond, please be patient (ready=0)
[434705.212021] ata2: COMRESET failed (errno=-16)
[434705.212059] ata2: hard resetting link
[434710.572034] ata2: link is slow to respond, please be patient (ready=0)
[434715.220034] ata2: COMRESET failed (errno=-16)
[434715.220073] ata2: hard resetting link
[434717.948042] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[434717.964271] ata2.00: configured for UDMA/133
[434717.964271] ata2: EH complete
[435852.000319] sd 1:0:0:0: [sda] 312579695 512-byte hardware sectors (160041 MB)
[435852.000352] sd 1:0:0:0: [sda] Write Protect is off
[435852.000355] sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
[435852.000392] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[436077.501144] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436077.501144] end_request: I/O error, dev sda, sector 218842543
[436229.712047] ata2.00: exception Emask 0x0 SAct 0xf0003 SErr 0x0 action 0x6 frozen
[436229.712117] ata2.00: cmd 61/00:00:27:1e:4d/04:00:0d:00:00/40 tag 0 ncq 524288 out
[436229.712119]      res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[436229.712174] ata2.00: status: { DRDY }
[436229.712205] ata2.00: cmd 60/00:08:bf:dd:0b/01:00:0d:00:00/40 tag 1 ncq 131072 in
[436229.712206]      res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[436229.712260] ata2.00: status: { DRDY }
[436229.712293] ata2.00: cmd 61/08:80:27:22:4d/03:00:0d:00:00/40 tag 16 ncq 397312 out
[436229.712294]      res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
...
[436534.792120] end_request: I/O error, dev sda, sector 223251087
[436534.792120] Buffer I/O error on device sda1, logical block 27906378
[436534.792120] lost page write due to I/O error on sda1
[436534.792120] Buffer I/O error on device sda1, logical block 27906379
[436534.792120] lost page write due to I/O error on sda1
[436534.792120] Buffer I/O error on device sda1, logical block 27906380
[436534.792120] lost page write due to I/O error on sda1
[436534.792120] Buffer I/O error on device sda1, logical block 27906381
[436534.792120] lost page write due to I/O error on sda1
[436534.792120] Buffer I/O error on device sda1, logical block 27906382
[436534.792120] lost page write due to I/O error on sda1
[436534.792120] Buffer I/O error on device sda1, logical block 27906383
[436534.792120] lost page write due to I/O error on sda1
[436534.792120] Buffer I/O error on device sda1, logical block 27906384
[436534.792120] lost page write due to I/O error on sda1
[436534.792120] Buffer I/O error on device sda1, logical block 27906385
[436534.792120] lost page write due to I/O error on sda1
[436534.792120] Buffer I/O error on device sda1, logical block 27906386
[436534.792120] lost page write due to I/O error on sda1
[436534.792120] Buffer I/O error on device sda1, logical block 27906387
[436534.792120] lost page write due to I/O error on sda1
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436534.792120] end_request: I/O error, dev sda, sector 223252111
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436534.792120] end_request: I/O error, dev sda, sector 223253135
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,S
...
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436534.792120] end_request: I/O error, dev sda, sector 223252111
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436534.792120] end_request: I/O error, dev sda, sector 223253135
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436534.792120] end_request: I/O error, dev sda, sector 223254159
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436534.792120] end_request: I/O error, dev sda, sector 253476007
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436534.792120] end_request: I/O error, dev sda, sector 253493375
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436534.792120] end_request: I/O error, dev sda, sector 253493391
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436534.792120] end_request: I/O error, dev sda, sector 253493911
[436534.792120] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436534.792120] end_request: I/O error, dev sda, sector 253494015
[436558.316730] Aborting journal on device sda1.
[436558.316730] ext3_abort called.
[436558.316730] EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
[436558.316730] Remounting filesystem read-only
[436558.575000] journal commit I/O error
[436558.576658] EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted
[436604.610295] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
[436604.610303] end_request: I/O error, dev sda, sector 257197647
[436604.610340] __ratelimit: 395 messages suppressed
[436604.610343] Buffer I/O error on device sda1, logical block 32149698
[436604.610377] lost page write due to I/O error on sda1
[436604.610657] __journal_remove_journal_head: freeing b_committed_data
[436604.610662] __journal_remove_journal_head: freeing b_committed_data
[436604.610669] __journal_remove_journal_head: freeing b_frozen_data
[436604.610671] __journal_remove_journal_head: freeing b_frozen_data
[436604.610673] __journal_remove_journal_head: freeing b_committed_data
[436604.610712] journal commit I/O error
(END)


The databases, all user folders & game servers have been backed up to a system on my network.. this includes imagehost - everything - it took ages.

When I reboot the server shortly, it will most likely go offline for a good hour or so as it attempts to repair the disk. Given it has gone into read only, there is a high chance it may not come back at all.. which leaves me with a lot of work to do when it comes to restoring everything how it was - but we have a full backup - it can be done.

Assuming everything goes well, we should see the server come back AOK - the forums should go back online, everything... but that just puts us back where we were last weekend - we still have a failing hard disk - all that is bringing us back is fsck (linux's scandisk) repairing the damage the failing mechanism is doing. Long term, as in, Tuesday, we'll have them clone the server off to a new system.. relaxing in the knowledge that we've made our own manual backups of stats, forums, databases etc.. so no matter what, we aren't gonna lose stuff.

Cross your fingers with me folks.. this is gonna take a lot of work and a little bit of luck.

More info as it emerges.

UPDATE

Support chat log today at 9:30am

Code: Select all
         
Alvin Sim: Hi. Welcome to Web24 Live Support. How may I assist you today?
Mouldy (Mr. Mouldy): Ticket ID: ZXZ-716200 i need to check on the status of my ticket, I’m hoping you can tell me an ETA on when we are back up
Alvin Sim: one moment bringing it up
Mouldy (Mr. Mouldy): our dedicated has been down since Friday nite
Alvin Sim: it's with the dedicated server guys at the moment, but let me check
Alvin Sim: I can call through for a power reset if you wish
Mouldy (Mr. Mouldy): it’s a problem with the hdd hardware
Alvin Sim: No problem, we can get the replacement in, but are you happy with us just rebooting it for now and see if it comes up? I can organize a disk replacement in the mean time
Mouldy (Mr. Mouldy): sectors of the hdd are not reading and the IO wait time is up to 90% yes ok
Alvin Sim: oh, for server downs, you can call emergency support (option 5) btw
Alvin Sim: just a fyi
Mouldy (Mr. Mouldy): phone is dead
Alvin Sim: one moment, I'll call through now
Alvin Sim: I meant for the reboot
Mouldy (Mr. Mouldy): ok thanks
Alvin Sim: ..in progress
Mouldy (Mr. Mouldy): ya il try and ping it till it comes up
Alvin Sim: I'm waiting a callback from the DC now
Alvin Sim: (still waiting callback confirming reboot)
Mouldy (Mr. Mouldy): ya im pinging and nothing so far
Alvin Sim: ok, callback's given, I'll give it a few moments to boot up
Alvin Sim: if not I'll have to organize someone to go out there
Mouldy (Mr. Mouldy): ok
Alvin Sim: is it firewalled?
Alvin Sim: just did a ping, no response
Mouldy (Mr. Mouldy): no
Mouldy (Mr. Mouldy): ya i know
Alvin Sim: I'll have to organize one of the dedicated server guys to go out there for this
Mouldy (Mr. Mouldy): thanks
Alvin Sim: I'll reply to the ticket when I hear back from them
Mouldy (Mr. Mouldy): i called Saturday morning they said they will get onto it, but nothing was done
Mouldy (Mr. Mouldy): ok thanks
Alvin Sim: did you know who you spoke to?
Alvin Sim: I would like to follow up with the person where possible. We do take server downs seriously
Mouldy (Mr. Mouldy): i cant remember he sounded sleeper then I was lol
Mouldy (Mr. Mouldy): ok, I’ll go, and wait for the ticket reply




another update

a email from our host

Code: Select all
[font=Verdana, Arial, Helvetica][size=2][size=2][size=2][size=2][size=2][size=2][size=2][size=2][size=2][size=2]Hello Daniel,

The actual server only has one hard drive, without any RAID, which means it's not a matter of going down to replace a faulty hard drive. A rebuild is required to get it up and running. At this time, the earliest we can arrange is for tomorrow. As a moving forward plan, I would highly suggest looking at at least a Software-RAID option to prevent downtimes like this.

I'll put this ticket on hold until tomorrow when the dedicated server people action it.


Yours sincerely,
Alvin S[/size][/size][/size][/size][/size][/size][/size][/size][/size][/size][/font]


we will be upgrading to dual Hdd in a raid



yay!!!

Code: Select all
[font=Verdana, Arial, Helvetica][size=2][size=2][size=2][size=2][size=2][size=2][size=2]Hi Daniel,

I have built a clean install of Debian 5 x32 with the same root password and network configuration as your server which failed.

If we can read any data off your failed drive, I will attach it as a usb disk for you. However, I can't promise that this will be possible.

I will be in touch when I have more information for you.

Please let me know if I can offer any further information or assistance.



Your sincerely,

Levi S[/size][/size][/size][/size][/size][/size][/size][/font]


"But my head's all messed up, so you better driive brother"
User avatar

[DRuG]NikT
[DRuG] cofounder & your host

Status:
Check out the downloads and members areas on drugcrew.com

[DRuG] cofounder & your host
[DRuG] coleader
[DRuG] member
DRuG server admin
[AGS] member
]DR[ member
 
Posts: 2532
Joined: Sat Jul 28, 2007 10:39 am
Location: Melbourne, Victoria, Australia


Re: AGS OUTAGE: Hard disk in server has entered read only

Postby [DRuG]Mortal on Sat Apr 03, 2010 5:28 am

oh noes! Is this why TS is down too? :o
User avatar

[DRuG]Mortal
[DRuG] member

Status:
I can't believe we will have to wait until 2014 to play GTA V on PC.

[DRuG] member
 
Posts: 298
Joined: Sun Aug 05, 2007 9:22 am
Location: United States


Re: AGS OUTAGE: Hard disk in server has entered read only

Postby [DRuG]NikT on Sat Apr 03, 2010 5:36 am

Yep


"But my head's all messed up, so you better driive brother"
User avatar

[DRuG]NikT
[DRuG] cofounder & your host

Status:
Check out the downloads and members areas on drugcrew.com

[DRuG] cofounder & your host
[DRuG] coleader
[DRuG] member
DRuG server admin
[AGS] member
]DR[ member
 
Posts: 2532
Joined: Sat Jul 28, 2007 10:39 am
Location: Melbourne, Victoria, Australia


Re: AGS OUTAGE: Hard disk in server has entered read only

Postby [DRuG]Mortal on Sat Apr 03, 2010 7:11 am

Good luck man, that sounds like a huge mess :/ will watch for news.
User avatar

[DRuG]Mortal
[DRuG] member

Status:
I can't believe we will have to wait until 2014 to play GTA V on PC.

[DRuG] member
 
Posts: 298
Joined: Sun Aug 05, 2007 9:22 am
Location: United States


Re: AGS OUTAGE: Hard disk in server has entered read only

Postby Phoenix on Sat Apr 03, 2010 10:46 am

Ah dang, does that also effect the LE, DR and M4D servers? 'Cause they're all down also.
Thanks for the heads up Nik.

Oh and I blame Comatose :P He said the server would be shut down as an April Fools joke, and this happens xD
Last edited by Phoenix on Sat Apr 03, 2010 11:25 am, edited 1 time in total.
User avatar

Phoenix
[LE] member
[LE] member
DRuG server admin
 
Posts: 9
Joined: Sun Feb 07, 2010 7:32 pm


Re: AGS OUTAGE: Hard disk in server has entered read only

Postby AFX on Sat Apr 03, 2010 10:48 am

Thanks for the info Nik, I hope you don't have to go through too much work :P
User avatar

AFX
[LE] member
[LE] member
 
Posts: 17
Joined: Sun Mar 29, 2009 10:21 pm
Location: Batemans Bay


Re: AGS OUTAGE: Hard disk in server has entered read only

Postby [DRuG]NikT on Sat Apr 03, 2010 4:16 pm

All servers, data - stats, files etc etc (including LE) were backed up to my link immediately prior to me taking down the server... so we should be OK.

Looking at my console now after a long sleep, I'm afraid to say we still have timeouts - it appears the server indeed hasn't come back up - so we need to raise a support ticket to get them to take a look.

Big thanks also go out to Sk0t - the only grey area for me was "how do you back up an SQL database server when the disk is in read only".. he was able to tell us exactly where the raw files for the database all sat, also offered to assist when we're restoring. For anyone wondering on my reasoning for signing up specific members, this is another example of people's potential living up to my expectations/predictions.. Skot, you're one skilled mofo.. thanks!

Now for a little history lesson...

In AGS and many other clans, hard disk/server death is something they've come to live with - an unavoidable fault that occurs once every 6 or so months.. Usually all data is lost & the crew start again from scratch. I would like to bring your attention to the fact that DRuG's been around now for ~9 years... we still have all our data. Hopefully this outage will be the first for AGS that is followed by a full restoration. Still - we have to wait for web24 to acknowledge the fault, which may not be until Tuesday. In the short term, perhaps go play on FDM samp in the UK, I'm sure most of the crew will be on there.. my teamspeak is also back up *albeit branded with Mouldy's mum's WOW clan shizzle* - see you there... nik.ath.cx:9987.


"But my head's all messed up, so you better driive brother"
User avatar

[DRuG]NikT
[DRuG] cofounder & your host

Status:
Check out the downloads and members areas on drugcrew.com

[DRuG] cofounder & your host
[DRuG] coleader
[DRuG] member
DRuG server admin
[AGS] member
]DR[ member
 
Posts: 2532
Joined: Sat Jul 28, 2007 10:39 am
Location: Melbourne, Victoria, Australia


Re: AGS OUTAGE: Hard disk in server has entered read only

Postby sk0t on Sat Apr 03, 2010 5:23 pm

cd /var/lib/mysql :)

Thanks for the mention, I'm glad the backups turned out alright like I'm sure I've said if you need any help let me know :P
User avatar

sk0t
]DR[ member

Status:
Wait, sk0t here... No WAY!

]DR[ member
 
Posts: 26
Joined: Thu May 07, 2009 3:01 pm


Re: AGS OUTAGE: Hard disk in server has entered read only

Postby Tyoson on Sat Apr 03, 2010 5:54 pm

Bugger that sucks. Good on you Nikt and sk0t for fixing it on a long weekend :) 3 days M4D server has been down this month I should ask for 30 cents back :P Naaaah shit happens.

Keep up the good work and Happy Easter. I'm away till Monday.
User avatar

Tyoson
[M4D] member
[M4D] member
 
Posts: 6
Joined: Sat Apr 03, 2010 4:54 pm


Re: AGS OUTAGE: Hard disk in server has entered read only

Postby [DRuG]Mortal on Sun Apr 04, 2010 6:39 am

wOOt TS works again indeed, im on there alone, fapping and that.

Go Sk0t gogogo
User avatar

[DRuG]Mortal
[DRuG] member

Status:
I can't believe we will have to wait until 2014 to play GTA V on PC.

[DRuG] member
 
Posts: 298
Joined: Sun Aug 05, 2007 9:22 am
Location: United States


Next

Return to Technical support

Who is online

Users browsing this forum: No registered users and 1 guest

cron