Details on Asset Server Issue

Friday, December 29th, 2006 at 12:08 PM by: Pathfinder Linden

Thursday morning, Linden Lab operations did short-notice maintenance on the cluster of machines that we refer to as the “asset server.” During this downtime, we replaced a bad battery in one of the nodes in the cluster. To do this, we removed the node from the cluster, replaced the battery, and set the node to rejoin the cluster. When the node rejoins the cluster, it has to go through a restriping process that takes a few days to complete.

Thursday night, around 11:00 PM PST, we started to see slowness on the asset server and hear reports of problems rezzing objects. Investigation showed some errors in the logs on the asset server, and we immediately notified the hardware manufacturer. Once everyone was off the grid, we decided to see if a clean restart of the asset server would return the system to a clean state, but this failed. Further diagnosis of the problem showed that while the cluster was restriping from the node maintenance of Thursday morning, a different node in the cluster experienced multiple drive failures. This made the restriping process fail and hung the cluster in a state where it could not handle the load of serving assets for Second Life. The hardware manufacturer logged into the asset server and worked with Linden Lab’s operations group throughout the night to bring the system back online at 10:45 AM Friday morning.

Whenever the grid has trauma, there is possibility of content loss as information in transit may not completely save to the asset servers. This recent event was no exception. However, of the files that did make it to the asset server, only 4 were damaged due to the problems we had with the hardware.

Linden Lab is continuing to work with the hardware manufacturer to see how we can improve the robustness and redundancy of the cluster, so that this kind of failure will not happen again.

[12:28pm PST] A clarification on the “bad battery.” This was not a UPS-type battery, but rather an internal battery that supported a battery-backed NVRAM card.

[2:37pm PST] A clarification on the term “node.”  By node, we are referring to a component system (computer with drives) within the cluster.

(BTW, Thanks to the Residents who asked for these clarifications in the comments.  Helps me be as clear as possible in this update!)

182 Responses to “Details on Asset Server Issue”

  1. 1 Peekay Semyorka Says:

    Thanks for the report.

  2. 2 Chase Cournoyer Says:

    Yes, definatly thank you for clearing that up.

    Hate to say this now…its taking a crap again. Things poping up all in-world, scripts can’t save, cannot rez, it might just be the load of people re-rezzing things, but I would check into it if there is a Linden reading this blog.

  3. 3 Josh Says:

    come on…wth man
    too many users??

  4. 4 Kitty Tully Says:

    I appreciate the problems you are having. But. . . . my sim still won’t show on the map; my clothing will not download; Search doesn’t work; my friends list is empty; my group list is empty . . . could you maybe try that restart again? Or do something different?

  5. 5 Thorsen Lightcloud Says:

    Does this cause an inability to attach or rez some objects? I’m having that problem now.

  6. 6 Jay Zon Says:

    Ok thx ,
    I just been ingame..
    Quick Report:
    * No money Loading
    * No gruop Chat Loading
    * No Profiles Loading
    * Talk in main chat Lags
    * When i relogged , i lsot my Pre-Login Stuff.

    No Complaint , jsut iinform (you know anywayz , but ok..)

    Grtz ~JayT~

  7. 7 Jason Hashimoto Says:

    Thanks for the update ;)

  8. 8 lucifer ludd Says:

    Yes, there are still problems…
    1. Search is down.. my search just keeps searching forever with no results.
    2. Can not teleport anywhere
    3. map does not load

  9. 9 Lydia Cremorne Says:

    Thank you so much for letting us know what’s been going on. Really appreciated :)

  10. 10 Mike Leidesdorff Says:

    Tks for the info and working in the computer busness i know one law it is the law of murphy (when someting suck it suck good)

  11. 11 Weston Antwerp Says:

    Since the system became available again all of my $L are gone!
    Is this a known issue?

  12. 12 Catherine Cotton Says:

    Thanks for the report but I experienced teleport issues right off the bat.

  13. 13 Jesse Malthus Says:

    I’m sorry that this happened to you guys, random hardware failure is Not Fun At All.
    However, why do your asset servers have batteries (re Scheduled Emergency Maintenance)? Are you referring to the UPS on the node or something else?

  14. 14 Sticky Says:

    ummmmm where to start :(
    Crashed twice opening my inventory ( and i’ve only been on 5 mins)
    My AV is invisable
    No money
    Lots of RED lag
    Do I have to report all of these in a bug report? Hmmmmmmmmm
    I’ll come back when i find any more (thats if i can stay logged on long enough to find any)

  15. 15 Dovic Battery Says:

    i do hope that battery is Duracel ;) hee.. na seriously i just tried to rez a new item i purchased and i got the [12:04] Attempt to rez an object failed. message and so far it has failed to show in my inventory.. is it gone for good ?? ..or maybe it’s just frozen in time waiting for a backlog of stuff to be returned to others before i get it back ?

  16. 16 Vivianne Draper Says:

    Wow awesome report. Thank you.

  17. 17 Ephemeral Flan Says:

    I wonder if the damaged files were some ones inventory?

  18. 18 Kathrine Redgrave Says:

    I found the items that went missing last night, but it is still not letting me rez…

  19. 19 Jaime Says:

    Thanks LL for the info. But being a bit tech challenged what is the “asset” server? and what are “nodes?”

  20. 20 raymond figtree Says:

    Ok, but the Grid is reopened and items are still not rezzing. Also I lose any non copy items I rz. Be careful folks!

  21. 21 Sticky Says:

    Nodes is what you blow when you get a cold ;)

  22. 22 Strife Onizuka Says:

    maybe an automated system needs to be in place to notify you when there is a critical change to the cluster logfile.

  23. 23 Pathfinder Linden Says:

    The Grid is still stabilizing after being turned back on, while thousands of Residents are logging in at once. We had such hiccup a short while ago, not related to the asset server at all. The Grid is now smoothing out.

  24. 24 Pathfinder Linden Says:

    @Jesse Malthus

    “I’m sorry that this happened to you guys, random hardware failure is Not Fun At All.
    However, why do your asset servers have batteries (re Scheduled Emergency Maintenance)? Are you referring to the UPS on the node or something else?”

    It was an internal battery that supported the battery-backed NVRAM card. Sorry for the confusion (I’m updating the blog entry to clarify).

  25. 25 Bigus Hock Says:

    Veritas Backup Exec or similar, use in day disk backups to offsite server & only syn changes. There’s stuff that does that (good for protecting against multiple drive failures)………….

  26. 26 JohnFerdinant Richmond Says:

    Checked today the Linden Blog, figured out that it was not only my subjective findings, but SecondLife went offline in December of almost every three days.

    Besite the fact of losing money and time, it brings a lot of frustration to those who try to set things up and start a running business. Clearing things doesn’t help here further, there will be always some reason why they have to take the servers down.

    To clear it a bit: the average electricity bill of the most of us might be some 50 - 60 USD / Month. Its a small part of that what a lot of us pay to keep their the business system and the programs runnning at SecondLife.com. Imagine your reaction when your electricity deliverer company took electricity of the grid every three days for maintainance or other problems, with or without notification. I think the reaction does not need any further clearification.

    For a lot of us, our stakes are already to high to leave the game. But I hope that SecondLife starts to understand that the situation as of now is unaffordable!

    JohnFerdinant Richmond.

  27. 27 Broccoli Curry Says:

    A critical part of Second Life’s underlying technology… all that trouble because of a FLAT BATTERY?

    Good grief.

    Broccoli

  28. 28 Detonator Scofield Says:

    Good job..

    Had the same issue twice with a system with 3Ware/AMCC controllers, multiple clusterdrives failing at once destroying entire arraycontents.. Switched to newer 3Ware-controllers, replaced Maxtor DM drives with Seagate (this saved most headaches AFAIK) and now have 2 hotspares per controller.. Not a single drive has failed during the past 10 months.

    CTO top-15,000 website @ 36 machines (all 3Ware/AMCC).

  29. 29 Marcus Reisman Says:

    Ahh good times. I wrote a bunch of the network and journaling kernel code for the Isilon IQ so I feel your pain. Hope it’s treating you well otherwise.

  30. 30 Jesse Malthus Says:

    Thanks Pathfinder. A few more questions (if you don’t mind): What 4 files were damaged? How important were they? Did you get them fixed?
    Sorry, I’m just full of questions today :D

  31. 31 Psion Says:

    Thanks for the update Linden Labs. Somehow even the failures are fascinating.

  32. 32 Ian Banach Says:

    in anyway, what about indemnizations by sell not selled, traffic lost, and so on during grid down?

  33. 33 Ledeanna Evans Says:

    Seems the grid is still unstable right now inventory transfers and such are not working right (this was like 15 min ago) so I had to cancel my 2 classes and reschedule for tomorrow hopefully the grid will be stable before my classes tonight.

  34. 34 Erbo Says:

    Ah, that’s the kind of battery I thought you were talking about…one of those little lithium coin-batteries that many people think of, somewhat inaccurately, as the “CMOS battery.” Definitely important; without it, the node would lose its BIOS configuration next time it lost power…

    Is there any possibility of implementing a fail-over system for these nodes, having some additional nodes on hot standby that can take over in the event of a fault on the primary nodes?

    Jaime: The “asset server” is the server, or set of servers, that holds all the objects, textures, scripts, and other things that make up the “world” and all our inventories (which are collectively referred to as “assets,” a term from game development) and supplies them when they’re needed by the sim servers or the client. When you’re dealing with a “cluster” of servers that works together on a common task, the individual servers that make up the cluster are referred to as “nodes” of the cluster.

  35. 35 Geo Claxton Says:

    So in other words, there was an opportunity to warn us that the grid would be taken down last night?

    It isn’t the failure that upset me…it was the lack of warning before everyone was booted from the grid. Would 1 or 2 minutes have hurt?

  36. 36 Todd Baxter Says:

    Thnx for the info. I’m still having problems rezzing object frm my objects list. Made a table, put it in my inventory, now I can’t get it back to finish working on it, and my search isn’t working.

  37. 37 Todd Baxter Says:

    Also, wasted money trying to upload textures that won’t show up.

  38. 38 Harley Mosuke Says:

    Alright, lez rock n roll again. Thanks for the staz update, is much appeciated.

  39. 39 Sliv Says:

    Hey Dudes,

    thank you very much for you report and doin the keep_up_to_date entries on the blog. And thank you very very much for you hard work. - So it was a night job for you and daytime here in germany. I don’t know how often we click “reload” on your blog to see, if we can log in again … your hard night, our hard day ;-)

    Thank you & Cheers,

    Sliv

    btw. are there any plans for a UniversalBinary Version for my Intel-Mac? :)

  40. 40 Andromeda Quonset Says:

    I can fully appreciate the problems involved in shutting down a machine to replace a battery for NVRAM, and the delays in getting the drive re-striped. While I don’t know the details of the particular NVRAM-card involved, it seems to me that it behooves you and the hardware manufacturer to come up with a way to replace the said battery while the machine is hot, then you don’t have the delays in waiting for the re-striping. When a machine is running, the NVRAM shouldn’t be drawing power from the battery, anyway.

  41. 41 Rex Cronon Says:

    Hello
    I have some bad new. I am logged in now, and each time i try to rezz something from my inventory to the ground I get “[12:41] Attempt to rez an object failed.”
    I tried this on the “Mature Sandbox” and now on SWT(sandbox weapon testing), and i get the same message. I am NOT the only one, others have the same problem. In both sims the number of objects was under 2000.

  42. 42 Chrystal Fontaine Says:

    Ok … i understand u have had problems - but i mean, this is startin 2 tick me off a lil.
    I log on and what do i get? Lag lag lag … THEN theres no L$ in my account, im walkin around butt naked (thats when i can actually walk that is!) theres nothing in my invent, cant TP, and to top it off chat lags and i crash around every 4/5 mins or so.

    please please PLEASEEEEEE can u sort this!?!?!?

    Fankoo :D

    (very stressed out 2day so i apologise >.

  43. 43 Soj Says:

    dont know if it is important, but as I sit here reading the blog, before I log in… there are repeated and quick messages and screen flickers saying “Reloading.”

    This has happened every few seconds. Is it a bad sign?

  44. 44 Marcus Reisman Says:

    (Hmmm, first comment didn’t show up. Maybe I have to be registered.)

    Multiple hardware failures, good times…

    I wrote a bunch of the network and journaling kernel code for the Isilon IQ.
    Hope it’s treating you well otherwise. Cool to know my code is behind the scenes at LL

    Erbo: It’s not the CMOS battery, its the battery for the NVRAM card for the journaling system on the storage cluster. Doing journaling on a flash drive is much faster (and theoretically less failure prone) than writing transactions to disk. When you write a file to the storage cluster, it gets written to the NVRAM on all the boxes in the cluster the file will live on (plus parity) and then later is written to the actual drives asychronously.

    Not sure why it failed; but QC on these things isn’t entirely foolproof.

  45. 45 Pathfinder Linden Says:

    @Jesse Malthus

    “Thanks Pathfinder. A few more questions (if you don’t mind): What 4 files were damaged? How important were they? Did you get them fixed?
    Sorry, I’m just full of questions today :D”

    Those 4 files were assets. 3 inworld objects and 1 old simstate. Statistically incredibly small given the total number of objects/simstates (about 300 million), but obviously important to the Resident(s) who were affected by their loss. They are corrupted, and probably not recoverable.

  46. 46 Spank Lovell Says:

    Guys, I appreciate the report and the work both you and your hardware suppliers have put into this failure.
    I’ve been in a similar situation so I know how stressful this can be.

    Happy new year guys :)

  47. 47 Vincento Vanguard Says:

    Still having problems? *click* for Kirra! She may need it.

  48. 48 Shirley Marquez Says:

    They said that the storage array is doing a re-striping process now, and it will take several days. (In other words, it’s copying data from the existing drives to the new ones to reestablish data redundancy.) That likely means that asset server performance will be below normal until the process completes. Unfortunately, there isn’t anything LL can do to speed it up, other than kick us all out of the grid. Letting the asset server do nothing but data replication would likely speed up the process, but I will take a few days of somewhat slower than normal rezzing rather than a couple of days of no SL at all.

  49. 49 Fred Yosuke Says:

    Are you considering ways for residents to back up important information to their own machines and import it back to SL, so that in case of catastrophic failure of your storage, all would not be lost?

  50. 50 Bucky Barkley Says:

    Unbelievable. A predictable failure that countless startups have had to grapple with. Where is the emergency plan? SL needs to get professional.. it is not a groovy experiment with 100,000 users any longer.

    This is not supposed to happen with a professionally run service with 2M+ residents. It reminds me of the Tanya Harding incident at the Olympics (the broken shoelace).

    Does LL want to be taken seriously by business? What are they going to do to get to that point?
    It’s not going to happen with posts that say “give your First Life a chance!” or “Holiday Support Hours” when there are none to be had.

  51. 51 Pathfinder Linden Says:

    @Geo Claxton

    “So in other words, there was an opportunity to warn us that the grid would be taken down last night? It isn’t the failure that upset me…it was the lack of warning before everyone was booted from the grid. Would 1 or 2 minutes have hurt?”

    We will always warn everyone as far ahead of time as possible when we’re going to log all Residents out of Second Life. However, in this case (11:13pm PST on Thursday), it was an immediate emergency. Taking 1 or 2 minutes more could have had catastrophic consequences, so the switch was flipped right away. I know it was jarring and a shock, and I’m sorry we had to do it. But we simply couldn’t wait given the circumstances.

  52. 52 Dillon Morenz Says:

    Thanks for the clarification Pathfinder. Must admit, most information prior to this post was a bit vague and some were reading all kinds of ominous things into it, but it’s great to get a clear understanding of circumstances in the end. It also helps us appreciate the effort you all put in at such ungodly hours..and at this time of year too. Great to have the grid back. =)

  53. 53 Boss Melnitz Says:

    The outage description Haiku:

    So many issues,
    Brought on by Energizers.
    I hate that bunny!

  54. 54 Sakura Hilra Says:

    [Mike Leidesdorff Says:
    December 29th, 2006 at 12:17 PM PST
    Tks for the info and working in the computer busness i know one law it is the law of murphy (when someting suck it suck good)]

    LOL, never heard of such law. Anyway, I am still invisible. It is very tough for me since I am also a dancer. In the mean time, I tell my customers to use their imagination that I am a very beautiful dancer…get horny such and such. Oh well, there goes my career down the toilet.

  55. 55 Marcus Reisman Says:

    Testing 123.. none of my comments show up :(

  56. 56 Peekay Semyorka Says:

    @Broccoli: Pathfinder can correct me if I’m wrong, but the problem’s root cause isn’t the nvram battery, but multiple drive failures which unfortunately happened while (another) node was rebuilding.

    Technicalities aside, all the recent problems mean LL has much work to do in repairing the damage to the SL community’s faith and patience. Needless to say there is a strong negative sentiment among SL users right now, and justifyably so.

  57. 57 Magna Peart Says:

    This wouldn’t have to do with the teleporting problems too? I can’t even cross sims now. I shouldn’t have to relog to get anywhere. When are you going to actually get people to WANT to use your product, eh?

  58. 58 jesrad Says:

    “Further diagnosis of the problem showed that while the cluster was restriping from the node maintenance of Thursday morning, a different node in the cluster experienced multiple drive failures.”

    Talk about some bad luck. That’s why we used to sacrifice a virgin or two from time to time, back in the IT dept of [undisclosed], to be sure the Gods were appeased and wouldn’t send us this kind of catastrophe. Fortunately virgins (male ones) come a dime a dozen in IT.

  59. 59 stormthunders Says:

    Thank you for the post and question-answering.

  60. 60 Pathfinder Linden Says:

    @Peekay Semyorka

    “@Broccoli: Pathfinder can correct me if I’m wrong, but the problem’s root cause isn’t the nvram battery, but multiple drive failures which unfortunately happened while (another) node was rebuilding.

    Technicalities aside, all the recent problems mean LL has much work to do in repairing the damage to the SL community’s faith and patience. Needless to say there is a strong negative sentiment among SL users right now, and justifyably so.”

    Peekay, you are correct. And I totally agree we have a lot of work ahead of us to make SL as stable and reliable as possible. That is a big focus for everyone at LL right now, and we’re working on it as fast as we can. We also need to grow as a company, so if anyone wants to join the LL team to help with these challenges, please check out http://lindenlab.com/employment :)

  61. 61 Catherine Cotton Says:

    Thanks for answering some concerns Pathfinder. Anyway to tell who was affected by the four files?

  62. 62 Guitarhero Dougall Says:

    So in essence then Pathfinder, will there be two asset servers that need to re-stripe now?
    If so then it may take a day or two to fully re-stripe them hence the issues we are seeing with TP’s, lag and so on?
    Thanks Pathfinder

  63. 63 Betty Eros Says:

    Patience people, patience.

    Pathfinder, I thank you for all your work.

  64. 64 Ziffle Babblebrobs Says:

    Keep up the good work - sorry for all the people that dont understand what it takes to run a data center complain. Hang in there!

  65. 65 Xess Dix Says:

    I appreciate the update, but there’s a few sugesstions I have, as I am not fully vested in 2L, and want it to succeed.

    First, and foremost, when you have created an economy, and that economy crashes (as it did last night), you have to have a back up plan. Governments (which the Lindens are) do this all the time, and handle it in a number of ways. As we don’t have military coups on the grid, then there’s the issue of devaluation, or major corrections in the market. I would very strongly suggesst a goverment incentive for residents and businesses who lost income during the downtime (yours truly is out a few thousand hard earned L$ that I doubt I’ll ever get back as it was paid out just moments before the log out)

    Secondly, as in rl, you have created an economy, with a market exchange, banking, and in some areas finance, but do not have any method or mechanism for insurance,w hich is a cornerstone of the economy in rl. in an offline forum i’d like to offer some options on that one.

    thirdly, while restoration of the grid was paramount, and there were quick notices throughout the night, the customer service needs to be addressed. Flip comments like “get some sleep” don’t fly in the world of commerce. engage us as partners with you in the 2L endeavour and we will be your champions.

    finally, while i appreciate your efforts (I’m certain it was “all hands on deck” since the crash, most businesses that don’t return to full strength within 48 hours and have a mechanism for goodwill in retaining customers, fail within a year. I have tons of stats on that. I really want you to succeed and strongly suggesst a goodwill plan is implemented immediately. luckily as the new year approaches, you have a perfect time and place to implement.

  66. 66 Daring Petrichor Says:

    Great to get real communications about whats going on at SL. Keep up the great work.

  67. 67 John Horner Says:

    Not quite sure if I understand all that has happened but from the viewpoint of an end user when I log in, I do not “download” my Av or clothes and there is an absense of user generated sounds, textures, and general land.

    Regards

    John

    PS first log in for around 18 odd days as I have been on holiday

  68. 68 Jon Schack Says:

    I still think you guys are doing a great job, thanks for the technical update!!

    *Refrains from mentioning the Backup Solution provider he works for other than it used to start with “A” and now starts with “Q”*

    Contact me if you want more details

  69. 69 Aufalond Holder Says:

    It does sound like u need a sort of SWOT analysis and a sort of six sigma approach to your IT infrastructure and support even though the initial investment might be a stinger, with numbers growing it must be done.

    Hire Erbos.

  70. 70 Pathfinder Linden Says:

    @Catherine Cotton

    “Thanks for answering some concerns Pathfinder. Anyway to tell who was affected by the four files?”

    Unfortunately, no. That data is part of what was corrupted.

    @Guitarhero Dougall

    “So in essence then Pathfinder, will there be two asset servers that need to re-stripe now?
    If so then it may take a day or two to fully re-stripe them hence the issues we are seeing with TP’s, lag and so on?”

    I’m not sure about the re-striping question, since ops is working on that right now. But I do know that issues with TP and lag and other Grid weirdness are temporary issues caused by the fact that we just turned the Grid back on a couple hours ago. Like I said before, this typically happens when the Grid is opened up after being down for whatever reason. It typically takes a while for this bird to get off the runway and up to a stable cruising attitude before passengers can walk around the cabin freely without getting a bit jostled by turbulence. I apologize, and things should be back to normal shortly.

  71. 71 lulu gallacher Says:

    So these hardware problems should be of no consequence to such an important computer system like this!
    With all the contingency you Linden guys have built into the system and the religious backup/ archiving that is going on daily…if not 6 hourly, then these normal hardware faults, and hard drive failures should make no difference to the running of SL.

    Now if there is no hardware contingency or archiving going on, then where is your basic funding/ finances, going pray?
    I hope for all our sakes all the other battery backed up rams in the SL SYSTEM are in thier operating dates?
    It would be basic negligence if none of these simple low cost IT systems/ processes are not in place!

    Glad u got “it” fixed though….hoping for no re-occurrences and SL keeps working at least until the end of the holidays….….Huggs Lulu

  72. 72 Moose Maine Says:

    Having dealt with clustered nodes and HD’s that wont stripe correctly, I feel for the lindens that were sitting on pins and needles! Those kinds of server problems always creep in on fridays and it’s not unusual for a rapidly expanding company to run into that all the time. You’d be suprised how many Fortune 500 companies engineers deal with that same issue behind the curtains everyday. Three things to note. 1. It’s obvious that the lindens have a good maintenance contract in place where that have support 24/7 - that costs big money. 2. No matter how many backup’s a company has - and how often there tested there will always be some data loss. 3. Give these folks some credit on keeping their users updated, that very seldom happens in corporate america, you’d be suprised how many time it’s said ‘ Sorry the servers must have burped!’ instead of the truth. Kudos to the Lindens for keeping things out in front and above board!

  73. 73 Pathfinder Linden Says:

    @Marcus Reisman

    “Testing 123.. none of my comments show up :(”

    Sorry about that. Our spam filter held your comments for some reason. I’ve tweaked the spam filter, so all your comments are now posted.

  74. 74 Simo Voss Says:

    I got one thing to say on this matter. If a servers internal battery fails (the NV ram/bios battery) the server is too old for an application like this with the traffic that is being pounded at it. I suspected a RAID array or similar had gone down which also points to age. It seems like you scraped through by the skin of your teeth this time.

    Maybe its time to start throwing money at the servers and upgrade.

    It would be interesting to hear the actual technical report of the drives/cluster that went down. SATA? SCSI? makes models etc. In the past year i have had soo many Fujitsu drives pack up on me and also Maxtor SATA drives.

  75. 75 Boss Melnitz Says:

    Okay…I’ve always suspected (as have many others) that I have my head up my arse most of the time. But now SL is PROVING it…

    http://davidk.vox.com/library/photo/6a00c2251d355f549d00cd9707b6184cd5.html

    What do I do now? File a “Butt Report”?

  76. 76 Dina Says:

    I’d like to suggest a return to the old world computing model. RAS. Reliability, accessibility and security.

    If we missed any of those targets by more than 5 minutes a year all hell broke loose and we were reporting to the CEO. We ran computers called mainframes which still run most of the worlds large corporations. One large carefully guarded computer instead of 300 small ones.

    Todays computing model is one of chaos that barely works. And its very expensive.

    Almost all the people in the computer business these days were were weaned by Microsoft. Microsoft, that has us all convinced that up times greater than 10 minutes a day is acceptable.

    OK so maybe I’m being a bit sarcastic but I think that generally computing systems are not being built to meet he RAS model that was so very successful for a very long time.

    In the short time I’ve been on SL I’ve seen nothing but a string of software and hardware glitches. Maybe LL can examine their overall computer strategy from a different perspective and come up with a more reliable model for both hardware and software.

  77. 77 Marcus Reisman Says:

    Thanks for the fix, Pathfinder.

    Simo: the battery in question is for an add-on NVRAM based flash drive for the journaling system in each node in the NAS cluster - NOT the BIOS/CMOS battery. When a file is written to the cluster, a transaction is written to the flash drive on each box and then later asynchronously written to the actual disks. This is crucial for performance and reliability reasons. Doing QC on flash drives is not foolproof however. I used to work for the company that builds these asset servers so I know them very well.

    LL is not rolling their own NAS solution, it is a product from http://www.isilon.com

  78. 78 Teravus Ousley Says:

    Thanks Pathfinder, you are doing an amazing job as the comm-monkey :). 4 files being corrupted out of 300 million is a highly successful recovery operation. Congratulations on that.

    One company that I worked for had an event like this and it took them one whole week to recover from it.

  79. 79 Harald Nomad Says:

    In response to statements about the failure of small, cheap parts causing major headaches: Arie Luyendijk lost one Indy 500 due to 50 cent spark plugs.

    Regarding business in SL and in general: Rule #1 is to know the environment in which you do business. If a company choses to do business in a warzone, it should count on its warehouse being bombed. Uhm, no, I’m not comparing SL to a warzone…

    Any company making money on sales of audio/video cannot complain about “loss of profit” due to illegal copying. It’s a fictive, statistical, non-existing loss. Businesswise it makes more sense to count on 50% “unaccounted for” distribution. That’s the environment such a company works in. Can try to improve, but cannot complain.

    Business in SL? Try looking at it as a business in a settlers town - can’t count on the cavelary to show up every time the bandits hit. Can’t cry loss of profit every time a house burns down. It’s the nature of the environment you do business in.

    So it never gets better? Of course it will. It may be wise to not hold your breath though. The power company, as someone referred to, has lots of other companies to look at, take their good ideas, improve them, do better. A company pioneering in a totally new environment has to learn by experience - unfortunately mostly bad experiences.

    Come to think of it: if a competing company normally makes more profit than I do, and due to circumstances both our businesses are down, I lost less then them, and therefore my company did better! So there.

    As to employment, Pathfinder: applying is one thing, getting a response is another ;)

    And to all of you who wonder why a battery can cause problems: when was the last time you replaced the batteries in your smoke/fire/co detectors?

    Happy New Year! :)

  80. 80 Xavier Tosung Says:

    Thanks for the report LL

    As a senior engineer for a UK blue chip company myself. I have enountered exactly the same problems with clusters and there inate ability to stab you in the back at 2am :)

    No amount of hardware upgrades or money can fix it when the cluster service itself fails, and those who are biching off may do better to read a manual on how difficult life can be when the s**t his the fan on a cluster.

    Well done and great effort by all involved to restore service, ignore thoese who are dumb and just compain

  81. 81 Seola Sassoon Says:

    Kudos for letting people know the details!!!

    It’s about time!

    This lil bit of info, while I have no freaking clue what the issue is, puts out publicly that you really DO know the problem and I can see it in print. I’ve been praying LL gets more open like this. I hope this line and type of communication continues when issues arise! :)

    That aside, thanks for busting your booties during the holiday season when I’m sure many are on vacation!

  82. 82 AllieKat Stovall Says:

    @ Simo:

    The problem is with the load on the cluster, ever since SL broke 1 million residents, have been having small(and sometimes big) hiccups. the upgrade kind of needed to be done before then. however no one knew what this load would be back before all the good press started. 19,000 logged in at the moment, and apparently its fixed until the next node decides to give up the ghost. i do agree an upgrade of the asset server cluster would be a forseeable and wothrwhile expense. but that could mean a full day of no SL activity at all, and im not really sure that is cost effective. unless they upgraded one node at a time, but that is a lot of downs and ups throughout the day and night.

    Erbo, i couldnt have said that better myself.

    as a veteran of a central failure myself i know what they were going through in that room last night. i dont wish that on anyone. its definately a PITA.

    Kudos to the Linden Team for getting it squared away so quickly, i would assume that system is 10 times more complex than the system i work with on a daily basis, so i believe this incident has given me a new found respect for the LL team.

  83. 83 Catherine Cotton Says:

    “Pathfinder Linden Says:
    December 29th, 2006 at 1:39 PM PST

    @Catherine Cotton

    “Thanks for answering some concerns Pathfinder. Anyway to tell who was affected by the four files?”

    Unfortunately, no. That data is part of what was corrupted.”

    Thanks for the swift answers Pathfinder :) have a happy new year. (If the Lab releases you to do so :D )

  84. 84 rebecca proudhon Says:

    Thanks for this clear report. I like knowing what is really going on and I like to see straight forward reports like this, which is more information then usual. I am fully aware how messy things can get with machines and am usually amazed and marvel SL. works as well as it does, even though it gets frustrating because I wish I could fix it myself. It would be great to get more technical reports like this. I tend to be hardware oriented and like getting a better picture of the hardware set ups with SL. That was interesting.

  85. 85 Catherine Cotton Says:

    “Harald Nomad —And to all of you who wonder why a battery can cause problems: when was the last time you replaced the batteries in your smoke/fire/co detectors?”

    I don’t have a staff of ppl who’s job it is to make sure that the batteries have been replaced in a timly matter. If I did however, and the house burnt to the ground as a result of not doing their job; I can pretty much promiss, I would have one less employee on staff. ;)

  86. 86 Argent Stonecutter Says:

    What do you mean by “nodes” here? Individual disk shelves?

  87. 87 Morwen Bunin Says:

    Thanks Pathfinder (and other Lindens) for all the work you have done today. Thank you for answering the serious questions.

    *claps happily in her hands for all the work done by Lindens*
    A very good 2007 to all Lindens!!!!!!!

    Morwen.

  88. 88 pusha vodopan Says:

    I am new to responding to the blog here; so forgive if I ask a silly question and yes I did read a good chunk of the above.

    So, I am uploading some clothing templates and it seems that they are not becoming usable textures after they upload.

    Is this a temp problem that will sort itself out, or do I have to re-upload these files I am working on after the grid has smoothed itself out ?

    Thanks in advance - and LL sometimes a little or a lot of pain is the touchstone of good positive growth. Hang in there we know your doing
    your best.

    Pusha

  89. 89 Aaron Edelweiss Says:

    Thanks guys :). I was one of those posting for more info. Thanks for listening, and thanks for taking the time to answer. Not only does it make me feel better knowing what the problem was, it makes me appreciate that you’re working hard over there.

  90. 90 Pathfinder Linden Says:

    @Argent Stonecutter

    “What do you mean by “nodes” here? Individual disk shelves?”

    By node we mean a component system (computer with drives) within the cluster.

    I’m going to clarify that on the original blog post. Thanks.

  91. 91 Gatz Morang Says:

    This is hands-down the best downtime explanation I’ve seen from any virtual world. Thank you for not glossing over the details and for not talking down to your audience!

  92. 92 Scalar Tardis Says:

    I offer this up as ideas and food for thought for those unfamiliar with the complexities of high-end server storage. The backend storage mechanisms can be extremely complicated and totally non-obvious to someone who has never dealt with this sort of thing before. :)

    ,

    Last I recall, SL has something on the order of 25 terabytes of asset data, and this number just keeps on growing with each passing day.

    Assuming LL is using multiple drive arrays with RAID-5, the parity adds 33% more redundant data to allow the array to survive a single-drive failure without downtime. This tacks 8.25 TB of redundancy onto the overall storage needed for the assets, for a total of 33.25 TB.

    Maximum sustained Fast/Wide SCSI U320 drive write speed is around 100 megabytes/sec. and so it will take (33,250,000 megabytes / 100 meg/sec) 332,500 seconds to duplicate one array onto another, which is 92.36 hours, or 3.85 days.

    ,

    The biggest hard drive on the market is the 750 gig SATA-300 Seagate, which would require 45 drives in the RAID array to handle everything, though at 7200 RPM this drive isn’t intended for heavy server-grade use and is probably not what LL is using.

    If server-grade SCSI U320 or fiber-channel drives are used, the biggest is 300 gig at 15,000 RPM, increasing the array size to 111 drives… not including any free space needed for new incoming assets.

    With such high-performance drives costing $1000+ apiece, there’s easily $150,000 tied up in one array in just the drives alone, and not including mounting, power, backup power, communications, and the 24×7x365 hardware support.

    ,

    RAID-5 can survive only one drive failure. A hotspare drive can be available online to replace the failed drive, but it takes time for the array controller to recreate the lost data on the failed drive.

    The hotspare will not go fully online to provide redundancy, until all the data across all remaining array drives has been read and computed to rebuild the failed drive onto the hotspare.

    If a second drive fails before the failed drive can be rebuilt, then the array has no more redundancy available, and data is permanently destroyed.

    RAID-5 only protects against a single drive failure. Usually that is good enough because the statistical likelihood of two drives failing in rapid succession is very small. But it can happen, as in this case.

    ,

    There is a new standard available to provide further protection known as RAID-6. It can tolerate two drives failing and still not lose any data. However, the number of controllers that offer this level of data protection is very small at this time.

    Another more complex route is the mirroring of two RAID-5 arrays (RAID 1 set of two RAID-5). This allows the mirror set to survive any two drives failing in one RAID-5 array since the data is also duplicated across the two mirrors. On the downside it doubles the total number of drives used and there is a performance hit from duplicated the data across two duplicate RAID-5 arrays.

    ,

    Other situations can cause multiple drive failures. I don’t know about you, but I have never seen a rack chassis capable of holding 111 hard drives all within a single frame with a single power supply and single connector cable.

    Instead the drives may be held across multiple chassis/cases, each holding a number of drives. These chassis have interconnect cables, and they each have power supplies. The inter-chassis communications cables themselves may have a central controller.

    If for some reason a cable is disconnected or damaged, or the power supplies fail or are disconnected somehow, or the central drive controller fails, then all the drives beyond that cable, or in that chassis, or on that controller become inaccessible and the array fails.

    ,

    Knowing all this, it’s hard to guess LL’s setup, but with the clustering it sounds like each member has its own array, which is mirrored across each other cluster member. So each has its own array and each should have a duplicate copy of everything.

    It’s also possible for there to be a single huge drive array which multiple servers share and access simultaneously, though from LL’s own description of the need for remirroring here, it doesn’t sound like the asset servers share just one array.

    Using multiple arrays provides parallel access and multiple-read support for many people at once. (The one array route would mean that some cluster members may occasionally be stuck waiting for access while another is busy with the central array.)

    It may take time for new data coming onto one cluster array from SL residents to be duplicated onto another cluster member array, and perhaps the cluster with the corrupted array could not get out the new data to be duplicated on the others before it failed.

    .

  93. 93 AishaDracogryph Says:

    Hmm I’m glad you guys were on top of this and had everything fixed as soon as possible. also it shows a great professionalism that you made certain to give a full explanation of the problem along with technical details.

    It seem relying on a “battery” that must be replaced, for something that can cause such problems is not the ideal. Perhaps witching over to some from of static memory would be a good solution? Though I am not sure witch type of static memory would be best.

    another consideration (depending on just how the battery needs to work) would be to have switch to come kind of battery that can be charged while in use. thus solving the problem.

  94. 94 L.W. Says:

    As far as this issue is concerned, I have worked in the industry and understand that things happen, it seems like you did your best to resolve this issue.

    BUT…

    while the grid was down, I took the opportunity to check out the new release on the beta grid.
    I am not impressed, in fact it is scary.

    Old problems not fixed, for example: it is no longer possible to rotate around your avitar and see it from all angles when you are editing appearance - just to name one.

    New problems introduced, for example: serious movement and texture issues.

    Plus, even though there were few people on the beta gris and only a handful in my immediate area, lag was still a big problem as always, and of course slow-rezzing.

    It is obvious to me that the employees of LL work very hard and probably put in a lot of hours trying to get it right.

    It is just as obvious to me that the problem lies with the management of LL, and how they prioritize projects.

    I have seen it before in other companies, shove new things out as fast as you can, even if they don’t work, and make your employees scramble to try to put out fires constantly.

    I’m sure they feel they have some justification for doing this (which escapes me), but what they end up with is burnt-out employees and pissed-off customers.

    THERE NEEDS TO BE A MUCH BIGGER FOCUS ON TESTING!!!!!

    If seems as if LL is understaffed, they need to find money in the budget to hire testers instead of relying on users.
    I have experience, let me work from home and I will spend hours and hours testing new releases, BEFORE they are rolled out to the live grid. But you need numerous people dedicated just to testing and a responsive development/maintenance team to implement bug fixes that the testers report.

    Without makes changes of this sort, your business will suffer, people are only willing to put up with so much.

    I don’t make my living from SL, I just play it like a game, and find it quite addictive, but it is starting to get old with all the problems.

    I generally don’t whine, this is meant as constructive criticism.

    To the LL employees: your hard work is appreciated by some of us anyway.

    To the LL management: please do better - DELAY THE NEW RELEASE UNTIL IT HAS BEEN PROPERLY TESTED! (and get more testers!)

    To everyone: Happy New Year!

  95. 95 pusha vodopan Says:

    @ Scalar Tardis - ty for that. Certianly helps put some perspective on things. Seeing as we all “own” a piece of this thing in a way it is nice to have a general idea of the scope of what this failure and recovery means.

  96. 96 Max Says:

    Change the hardware vendor.
    Go for a relialable and availiable system.
    Rather than doing business on a gaming platform you should enable fun on a business platform.
    11h outage - some companies will be out of business!
    Max

  97. 97 kalisten Says:

    Linden Crew -

    Thanks for this kind of detailed report. As a hardware maintenace / network admin / sysop myself, I really appreciate the visibility. :) I know it’s hard for folks that don’t know the complexities and SOPs for RAID, etc. to appreciate the way things go sometimes, but this kind of report is awesome for us folks that do. Good work. Thanks!

  98. 98