Second Life 1.18.5 Server Deploy Post-Mortem
Tuesday, November 13th, 2007 at 12:27 PM by: Joshua LindenThe Second Life 1.18.5 Server release included updates for several systems, including new python libraries, backbones (a piece of infrastructure which handles a variety of services, such as agent presence and capabilities, and proxies data between systems), and simulators. The deploy as planned for November 6th did not require any downtime – all components could be updated live. We planned to perform the rollout per our patch deploy sequences: updating central systems one by one, then simulators.
Read on for the day-by-day, blow-by-blow sequence of events which followed…
Tuesday, November 6th
Prior to the 1.18.5 Server deploy, at around midnight (all times are Pacific Standard Time) we suffered a VPN outage to our Dallas co-location facility, which caused many regions to drop offline. The system recovered on its own after about an hour, and our ISP’s initial investigation pointed to hardware issues with the network infrastructure.
Starting at 10:00am we began the actual update of the servers to the Second Life 1.18.5 code. We started by updating the “backbone” processes on central machines one by one, such as login servers, tackling the “non risky” machines first. At 11:00am we got to the “risky” machines, which handle agent presence (i.e. the answer to “is so-and-so online?”) as well as several other key services. Closely monitoring the load on the central database (which usually shows increased load when something goes wrong) as well as internal graphs which closely track the number of residents online, we started making updates. Everything seemed to be going well.
Towards about 11:15am the various internal communication channels lit up with reports of login errors. We stopped updates of these central systems (7/8ths of the way through) and started to gather data. We have seen this problem in the past when hardware issues or bugs caused the “presence” servers to spin out of control, but this time there were no obvious failures; for unknown reasons they weren’t responding to requests from the login servers. Hoping for a quick-fix (i.e. a simple configuration change that could be applied live) we spent about 30 minutes trying to determine the cause, then gave up and rolled back to the previous code.
(Fortunately, in this case, a rollback was straightforward, and simply resulted in “unknown” agent presence for about 10 minutes. Rollbacks are not always so easy - see below!)
Simultaneously, logins to jira.secondlife.com and wiki.secondlife.com failed. These were due to the update as well (but, as it turned out, for different reasons). Once the dust had settled on the rollback it was easy to roll back one more machine to restore these logins.
Completely unrelated to the update, the database load on the central systems required us to pause the Tuesday stipend payouts, delaying the payouts for several hours. (As more and more residents have joined Second Life, and the central systems have grown busier, the time taken for stipend payouts had crept up to 24 hours. The code responsible for the process has been rewritten and the November 13th run completed in just 3 hours.)
Wednesday, November 7th
Several Lindens continued the investigation, and determined a source of the issues seen on Tuesday: the “agent presence” system was updated to use object pools to increase performance, but the number of objects in the pool was set too low. After some work, we were able to replicate this failure in test environments to verify the fix. The updated code was re-distributed to the machines making up the service, and we prepared to try again on Thursday.
(Little did we know that the insufficient object pools were merely a symptom, not the root cause.)
Thursday, November 8th
Once again unrelated to the software update, the hardware work originally scheduled for Oct 31st was finally done. Unfortunately, the addition of new hardware to the asset cluster didn’t go as smoothly as planned - as old hardware was removed the “fail over” appeared to, er, fail. From approximately 10:15am through 10:50am, assets could not be saved. This also caused login failures: when a resident logs off, the simulator needs to upload the attachments as assets before that resident can log in again, and the simulators were stuck waiting.
After the asset cluster was happy again, we proceeded with the 1.18.5 Server update. The first half of the central systems were updated by 12:00pm. We paused to ensure that the system was behaving as expected, then continued at about 12:30pm completing the updates. Shortly thereafter, as the number of online residents passed 46,000, the servers began failing in a new way. Although most of Second Life was functioning properly, many logins were slow or failed and some group chat failed as well. We diagnosed the problem as an unrecognized dependency – the central backbones were assuming that the simulator backbones would close a connection, but the simulator backbones (which had not yet been updated) were assuming the central backbones would close it instead. This wasn’t a problem in test environments or before concurrency passed some threshold, because the connections would close automatically; they would just not close fast enough to keep up once more residents were online. Once this root cause was identified (by about 2:15pm) we were able to change the code in the central backbones to resume closing the connections, since that was a faster fix. Restarting the central backbones did cause residents to appear offline for a short period of time, which was unexpected (and is being investigated).
Starting after 3pm we initiated a rolling restart to update the simulators as well to complete the update, a process which took about 5 hours. During a rolling restart, in order to reduce network traffic and load on central systems, the service is in an unusual state – regions are not allowed to move to new simulators in case of a crash. Additionally, the “geographic” restart (where regions restart in a wave traveling North to South), crash reports sent by simulators contain bogus data. (The code has been updated but old processes are still running.) This unfortunately makes detection and diagnosis of issues problematic. There was anecdotal evidence that some regions were crashing a lot, but we were unable to verify that this was not simply due to bad hardware until after the process was complete.
After the post-roll cleanup, it became clear that this was not an anomaly. A few contingency plans were discussed, including rollbacks for specific regions, but we were primarily in a data-gathering phase.
Friday, November 9th
As sleepy Lindens stumbled back into work, one incorrect (but ostensibly harmless) idea was tried; unfortunately, due to a typo, this accidentally knocked many residents offline at around 9:40am. Shortly thereafter, more testing including complete rollbacks on simulator hosts showed that the new code was indeed the culprit, but it took a while longer to identify the cause. By 12:00pm the investigation had turned up a likely candidate – and an indication that a simple widespread rollback of the code would not, in fact, be safe or easy!
The crashing was caused by the simulator “message queue” getting backed up. A server-to-viewer message (related to the mini-map) was updated and changed to move over TCP (reliable, but costly) instead of UDP (unreliable, but cheap and fast). On regions with many avatars, this would cause the simulator to become backed up (storing the “reliability” data) and eventually crash. We have a configuration file switch that allows us to toggle individual messages from TCP to UDP on the fly, but while testing we discovered a second issue – another file necessary for the UDP channel needed to be updated, and it could not be changed on the fly, and if we flipped the switch back from TCP to UDP the simulator would crash. (The TCP to UDP update on-the-fly worked, which is how we were able to do the rolling restart in the first place.)
By testing on individual simulators, we were able to confirm that by switching back to UDP the problem was eliminated, although this required stopping the simulators before throwing the switch. We co-opted an existing tool used for “host-based” rolling restarts (which had been used once in the past), and had it shut down simulators on each host (doing several in parallel), update the two configuration files, and restart the simulators. After significant testing, we used this tool to perform another rolling restart of the service, which was completed by 11pm on Friday, including subsequent cleanup.
Saturday, November 10th
Unrelated to the deploy (but included here to clear up any confusion), on Saturday at 5:20pm we suffered another VPN outage, which resulted in hundreds of regions being offline for just under two hours. The cause was due to the expiration of a certificate used for the VPN. We replaced the certificate, and our DNOC team brought the affected regions back up.
What Have We Learned
Readers with technical backgrounds have probably said “Well, duh…” while reading the above transcription. There are obviously many improvements that can be made to our tools and processes to prevent at least some of these issues from occurring in the future. (And we’re hiring operations and release engineers and developers worldwide, so if you want to be a part of that future, head on over to the Linden Lab Employment page)
Here are a few of the take-aways:
- Our load testing of systems is insufficient to catch many issues before they are deployed. Although we have talked about Het Grid as a way to roll out changes to a small number of regions to find issues before they are widely deployed, this will not allow us to catch problems on central systems. We need better monitoring and reporting; our reliability track record is such that even problem such as login failures for 1/16th of residents aren’t noted for a significant period of time.
- When problems are detected, we don’t do a good enough job internally in communicating what changes went into each release at the level of detail necessary for first responders to be most effective.
- Our end-to-end deployment process takes long enough that responding to issues caused during the rollout is problematic.
- Our tools for managing deploys have not kept pace with the scale of the service, and manual processes are error prone.
- Track date-driven work (e.g. certificate expiry) more closely; build pre-emptive alerts into the system if possible.
- Be more skeptical about doing updates while the service is live, especially when involving third-party providers


November 13th, 2007 at 12:38 PM
Thank you. We just like to know whats going on. This is welcome.
November 13th, 2007 at 12:46 PM
Yes, thank you for keeping us informed. I am glad to see that you are learning from this rather shoddy performance. Blame assessment accomplishes little but fixing the problems does.
In my operation we try to always keep in mind the old saying: Dont let your reach exceed your grasp. Linden Labs needs to engrave that everyones forehead.
November 13th, 2007 at 12:51 PM
Awesome play by play, thanks so much for the communication! Makes me feel more on the team (don’t take that as willingness to run cable though…

November 13th, 2007 at 12:52 PM
Gutsy story, nice to hear that you are at least honest enough to confirm that you have not kept pace with the scale of the service as it is now.
I’m awaiting anxiously to hear when you have reports about possible solutions to prevent the lessons learned in the future ;o)
November 13th, 2007 at 12:52 PM
Thanks for the timeline. This sort of communication helps us residents better understand the frustrations faced by the teams working to improve the infrastructure. We know you are putting your best efforts into it.
Keep up the good work.
November 13th, 2007 at 12:53 PM
Out of interest do you publish a list of which sims are on what server locations? I’m curious since I live in on a corner joining two of them and have a tough time crossing both and usually have to relog to put all my attachements back in order. But in any case, I ‘d be interested to know where I live physically.
November 13th, 2007 at 12:54 PM
This is the sort of transparency we need. Thank you for letting us all know.
We’re all stakeholders in this, now onto the hard bit - learning from it…
November 13th, 2007 at 12:55 PM
Well if that means I will be finally able come back to SL for more then a few minutes, I truely hope you take lessons from this.
That said, /tiphat for coming out as you do right now about the mistakes made.
November 13th, 2007 at 12:55 PM
thx for the info, thats what we all need!
alot ppl in game cant understand how it works at all but with few of this post he will at long time understand how it works and know what linden do to run it well.
regards
November 13th, 2007 at 12:55 PM
Hehe what an conffesion, yes it was a rough week but messing with hardware and updates, (dont hit Murphy’s Law there! )
And to be honest Lindens I think lot’s off residents will like this reporting. And to al this who say hmm other worlds come up,
Sure still years behind on what is Sl today. And is a super complicated system to keep running. And pls dont forget if one router
burns out on your path to the servers, just bad luck but not the fault off LL. Guys Sl still rocks and kuddo’s to the concierge team!
November 13th, 2007 at 12:55 PM
It’s a really good breakdown of events, and it’s reassuring to see that you are attempting to evolve procedures and protocols to grow in line with the scope and complexity of the Grid/SL
November 13th, 2007 at 12:57 PM
“Well, duh…”
Got knocked offline by a typo!
Sounds like you have an ‘interesting’ week to look back on…
Wonderful post, grid still alive: job well done. Go have a beer on me.
November 13th, 2007 at 12:57 PM
Given that the connection between the two facilities has repeatedly failed and seriously impacted SL, you should also be considering redundant connections that use different paths/certifications/technologies/etc. This way if one fails you still have the other. This is obviously a major choke point for SL.
November 13th, 2007 at 1:00 PM
Sounds like everay rolling restart is a different breath taking story…
Thanks to show what happens behind the curtains a little bit.
November 13th, 2007 at 1:01 PM
‘Dont let your reach exceed your grasp” is a phrase guaranteed to deliver nothing but diminishing returns, one of the worst maxims ever uttered IMO…
Reach for the stars, see where u get to…lag be damned…
November 13th, 2007 at 1:01 PM
1. Thank you for the above post mortem. it actually is somewhat of a relief to see that you are looking to your mistakes and trying to rectify them. Having had to update distributed systems, I understand it can be, well, interesting, the way they can find failure modes that are seemingly non-deterministic and related more to the price of chocolate beans in some remote village in the Andes Mountains.
2. (and this is not an accusation it really is a serious inquiry) do you practice roll outs on some system - i.e. create a deployment checklist and then build up a (mini) system that is a duplicate of the main grid, give the checklist to someone who was not involved in its creation, and have that person do the checklist to catch any deployment checklist errors? I realize this would NOT have caught some of the above problems, but it is a question for future deployments that may be just as complicated.
November 13th, 2007 at 1:02 PM
Thanks Joshua!
This is good communication!
We really appreciate!
Sincerely,
Lukas Mensing
November 13th, 2007 at 1:05 PM
“believe only half of what you see and none of what you read…”
November 13th, 2007 at 1:06 PM
I’m not sure whether to laugh or cry after reading all that……
November 13th, 2007 at 1:06 PM
Thanks a lot for this extensive post mortem. Its approved that you go to such extents to inform us of just what went wrong. Also it is good to read that you draw conclusions from last week´s issues and devise a set of useful pointers to keep in mind for now
November 13th, 2007 at 1:06 PM
Great job of tracking what your doing. A year ago, there used to be massive in-world notices when problems are encountered. Many of us fought the system all weekend, but didn’t notice any linden (”hey we’ve got a problem, and we know about it”’s…). I think many of us would have just stayed offline and out of your way had we known.
With that being said, and having worked in those type of server rooms, kudos to those that went with no sleep this weekend. We feel your pain and really do love you! You ARE doing a great job!
November 13th, 2007 at 1:08 PM
Oh, Linden dear Linden, poor dear system…
Slow down, as every single one of your users has been telling you for over a year now. Sort out the old effing errors before developing super new services that create new ones.
Looks like you’re beginning to notice the message your community is sending you.
Thank you!
November 13th, 2007 at 1:11 PM
You know what? I know you folks have a lot on your plate. But I have NEVER seen any company have so many connectivity problems, so many server outages, so much downtime or so much whine whine whine problems. I’m sorry, but you folks are charging more for your service than ANY hosting company in my experience.
I know a guy runs a little back-street server shop. On his wall is a great big sign: “100% Uptime guaranteed or your money back”. How does he do it? Redundant servers and mirror backups, just like any reliable service. When was the last time you saw Google offline? How about Yahoo? Microsoft? Quake? Unreal? (Yeah right, like Linden Lab has more cutting edge activity than any of those companies).
When you charge more for an island than it takes to buy a new car… I have very little sympathy for downtime or excuses for such. You’re playing with the big boys now. Time to act more like pros than game-playing teens. Sorry, no kudos for this one folks. When you charge $5000+ a year for virtual space, you have an obligation to your customers to get it right.
I’m not trying to bust your chops here. Oh wait, yes, I guess I am in this case. We’ve had enough excuses. If I may suggest: stop playing games with new “features” and get to work stabilizing your platform.
November 13th, 2007 at 1:20 PM
well first off, TY JOSHUA LINDEN!!!! havent seen this kind of real transparency for…. well ever! WTG
and now for the bitching that you should know youre gonna get:
” * Our load testing of systems is insufficient to catch many issues before they are deployed. Although we have talked about Het Grid as a way to roll out changes to a small number of regions to find issues before they are widely deployed, this will not allow us to catch problems on central systems. We need better monitoring and reporting; our reliability track record is such that even problem such as login failures for 1/16th of residents aren’t noted for a significant period of time.
* When problems are detected, we don’t do a good enough job internally in communicating what changes went into each release at the level of detail necessary for first responders to be most effective.
* Our end-to-end deployment process takes long enough that responding to issues caused during the rollout is problematic.
* Our tools for managing deploys have not kept pace with the scale of the service, and manual processes are error prone.
* Track date-driven work (e.g. certificate expiry) more closely; build pre-emptive alerts into the system if possible.
* Be more skeptical about doing updates while the service is live, especially when involving third-party providers”
well you said it yourself…. FIX IT
youve done good so far… also read the blogs? maybe just a little?
**on review of some older blogs i found posts by Lindens…. the helpful ones were from Joshua and Jack Linden. WTG
November 13th, 2007 at 1:21 PM
Thanks for the transparency… this helps us a great deal to know that you are willing to own errors and learn (we hope) from mistakes. The more you are this open the more we can respect you. It’s been said elsewhere, but I still feel that solving old problems is more important than adding new features. Good luck and thanks for keeping us informed.
November 13th, 2007 at 1:23 PM
“and the simulators were suck waiting.” (our simulators suck?)
I guess that was a Freudian slip then, rather than just a normal typo.
Thanks for the info, even though I’m not a techy I believe I can understand what happened, good post.
November 13th, 2007 at 1:26 PM
At any point, was any Linden heard to declare they’d chosen the wrong week to stop sniffing glue?
(For the unfamiliar: http://www.imdb.com/title/tt0080339/quotes)
Seriously, thanks for the report. It’s a welcome insight into a fur-raisingly complex process.
November 13th, 2007 at 1:26 PM
You Lindens are simply amazing! I have never seen a company so candid!! Any change in so complex a system is difficult, and you are definitely improving with each major change. I SO wish there were a position available, especially work-from-home, that I were qualified to fill!!!
November 13th, 2007 at 1:28 PM
Very honest post. I really appreciate it a lot to here the whole truth and nothing but the truth coming from Linden Labs for a change.
It’s also a very good assessment of your current weaknesses. Acknowledging them is the first step in solving them. There is no point in being in denial about them or covering them up. The residents will see right through it if you cover it up. Honesty is appreciated as can be seen from most of the replies to this post.
Again, great job guys.
November 13th, 2007 at 1:29 PM
@23
Indeed, Linden Lab *does* have more cutting edge activity going on than at Yahoo, Quake, Unreal, etc. Of course, ‘cutting edge activity’ is a subjective term, so we can both be right & wrong at the same time
I’ll say this though: When I’m not playing Second Life, I’m playing Everquest. You know… that old MMORPG that’s been around since 1999?
I play on the Nameless server. At least once a week, the whole server crashes… sometimes for hours… and not a word from Sony. At least every other day, one of the regions I’m in crashes. For months, I’ve been plagued by the infamous “West Bug” - which…. you guessed it…. makes it so my entire screen blanks out when I look West. (Its really great trying to grind out xp in a dungeon when you can’t look West)
Everquest doesn’t even begin to have the dynamic content that Second Life does…. and inherently with dynamic content, comes system challenges.
My point is: Second Life, and Linden Lab by no means have a corner on the market of system instability. Everquest, which is run by Sony - also has its share of system challenges. They can’t even fix a West Bug that affects anyone running nVidia 7xxx cards after months of complaints.
I think Linden deserves some Kudos for their post-mortem here. That aint being a cheerleader - its giving credit where credit’s due
November 13th, 2007 at 1:30 PM
im sure stabilizing is the main thing ll already does if u read clearly what these updates and rolling restarts and new hardware were supposed todo yes new addons/functions for sl made sl mayby less stable i noticed this too but also made sl much broader to mass public and the more people interessed into sl and ll the more it will benifit Our vr world called sl
im sure
Ty Linden*labs for keeping us residents updated about whats giong on and good luck to us all (seems u could use it most)
November 13th, 2007 at 1:31 PM
It seems to me that you could have used the het-grid capability to test-update the simulator changes, since the simulator-central-server protocol was flexible enough that you were running the old simulator code with the new central-server code.
Perhaps it would be advisable to treat central server updates as separate operations from simulator updates?
November 13th, 2007 at 1:33 PM
I really appreciate you being candid and telling us directly what happened. I can only speak for myself and some of the residents who have echoed my opinion before that as long as you tell us what’s going on we’ll be much happier. I fully understand how complex a system you are working on and with this many people using it an it being updated all the time stuff is bound to happen. Again thanks for sharing it with us straight up.
November 13th, 2007 at 1:37 PM
In my company, we follow the simple rule “Do No Harm”. But usually what we do ends up necessarily destructive, and we end up in the “Rob Peter to Pay Paul” mode. This usually upsets the people on the short end of the stick, and since the customer is always right, we put our nose to the grindstone and present them with a “You Scratch My Back We’ll Scratch Yours” proposition. Time is money and the early bird gets the worm, so it’s our “Early To Bed Early To Rise, Makes A Man Healthy, Wealthy, and Wise” approach that allows us to stay ahead of the pack. Sure, we’re busier than a long-tailed cat in a rocking chair factory, but if we can build a better moustrap, people will beat a path to our door. Although we can lead a horse to water, we can’t make him drink, we can still make bacon while the griddle’s hot. Since the pen is mightier than the sword, you must keep in mind that those who live by the sword also die by the sword. Always remember; “Only You Can Prevent Forest Fires”.
November 13th, 2007 at 1:42 PM
Thank you and burn in hell quite while im opening my shop. Damned.
November 13th, 2007 at 1:49 PM
Joshua, thanks so much for this.
You come across as an intelligent group of people, making it up as you go along (in the best possible sense). And explanations like this are welcome — partly because as a resident it’s good to know what went wrong, and partly also because it’s good to know that this level of careful thought goes in to the post-mortem analysis.
November 13th, 2007 at 1:54 PM
Thank you very much for the explanation. Its appriciated . Thank you ..
Keep on doing the good work. you get a lot more understandings when you explain things
November 13th, 2007 at 1:54 PM
So how long did it take to figure out….
Its always interesting how you ALWAYS blame the hardware!!!(Getting rather old and lame, find a new scapegoat).
November 13th, 2007 at 1:56 PM
Thank you Joshua!!! We need more posts like these on the blog!
Having had to go back to 1.18.3 (1.18.4 crashes in less than 60 seconds for me), I hope that 1.18.5 will work better :).
November 13th, 2007 at 2:03 PM
Good work guys , thanks for the information, its great to be kept informed , thanks
November 13th, 2007 at 2:05 PM
Please do te same for the platform and client - DUH
November 13th, 2007 at 2:10 PM
If only LL communicated this professionally concerning the six-month group notice failure ‘critical’ issue. *sighs*
November 13th, 2007 at 2:17 PM
it would be really nice if we didnt have to download the game.
it would cool,like runescape,you dont have to download anything.
because i cant play the game because it wont download the new version
November 13th, 2007 at 2:18 PM
this is great…now can you tell us why myself and many many others can not teleport without crashing since downloading the new viewer? it’s not our caches, it’s not our firewalls, it’s not our attachments. it never happened before the new viewer. help us please.
November 13th, 2007 at 2:19 PM
I’ve always thought that LL needed someone who’s responsibility was to review changes and write a change log.
November 13th, 2007 at 2:22 PM
Thanx for sharing the recent updates, interesting read… and bravo to the “what we have learned” chapter, i hope you can sort these things out, it would certainly improve all our experiences!
November 13th, 2007 at 2:22 PM
Good work LL! And thank you for the explanation of the work you have provided us. Things have come a long way since 12/18/2002 ^^
November 13th, 2007 at 2:23 PM
Josh - really appreciate some contents of your posting. But do not appeciate the header…it is not “post mortem” really - for your users (that is “we”
- that is just the very normal daily experience - no matter what happens at yours’ in the background. Can only say - as far as I am concerned - the days when everything runs smoothly and without any problems belong to the days, which I experience to be exceptional. When I experience a day that runs smoothly, it makes me suspicious even. The issues to happen neither do not really change - nor the stories around them (except some technical details). If you should ever get tired in writing such postings, for that very part (to get tired of writing such postings) you shall get my full understanding. I think to simple maybe (Europe here) - but a thing people pay money for is supposed to work - andthe moment, when it dies not, that is supposed to be the exception, not the rule. Good luck further - Lara
November 13th, 2007 at 2:33 PM
Cut the handwringing and get Windlight and New Search out!
November 13th, 2007 at 2:36 PM
A few quick answers:
Dreft Jurack asks, “do you publish a list of which sims are on what server locations?”
No, as this is something that may change without notice. However, as a convenience to ourselves, we do maintain regional clustering - sim1000 and sim2000 are more likely to be in different physical locations and/or communicate over different VPNs than sim3456 and sim3457. (You can look at Help > About in the viewer to see which sim is currently running the region you’re in. But remember, that can change!) We also try to ensure that adjacent regions are in the same colo facility when possible to alleviate region crossing time issues.
Takat Su asks, “do you practice roll outs on some system - i.e. create a deployment checklist … to catch any deployment checklist errors?”
Yes, albeit without the level of detail you suggest due to time and resource limits. (Did I mention we’re hiring?) I’m normally intimately involved in the server deploys and create said checklists and get them reviewed, with rollback clauses in many cases. In this case, I was actually the person pushing the buttons for most (but not all) of the steps involved. (Did I mention we’re hiring?)
Dekka Raymaker points out a typo - fixed!
Argent Stonecutter suggests, “you could have used the het-grid capability to test-update the simulator changes”
That’s true. The extant het-grid capability, however, is just that - a capability. We are not yet at the level where we have good tools for using it; those are being developed. In practice, our use of het-grid has been prone to operator error and so we have not used it as much as we could like. Getting over that hump to catch issues earlier is becoming high priority.
Argent continues, “Perhaps it would be advisable to treat central server updates as separate operations from simulator updates?”
Yes - in fact, for future “live” deploys I’m planning to do centrals one day and simulators the next, since it’s no longer practical to do everything on one day. This provides opportunities for additional testing in this state.
November 13th, 2007 at 2:48 PM
[...] post a more detailed explanation of the issues that occurred with server updates this week (EDIT: posted here) . — Joshua [...]
November 13th, 2007 at 2:51 PM
How long until the servers (my guess would be asset servers, b ut what do I know) are fixed to the point where I don’t have attachments locking up on me when I teleport? If it’s not my AO that freezes on me, it’s one of my HUDs, if not not that, the hair doesn’t rez, or the shoes won’t rez, or I can’t remove items that DO give me issues.
It’s getting REALLY told to teleport, relog, teleport, relog……..
November 13th, 2007 at 2:52 PM
err, getting really “OLD” to teleport, relog, ………
November 13th, 2007 at 2:52 PM
We need more blogs like this one. Then release them as a pocket book named “Linden Lab: What Really Happened Behind The Curtains.” It would make a kick-ass thriller to read during my next vacation. LOL
November 13th, 2007 at 2:56 PM
How about retasking your VPN to the backup/load balancing role it is meant for and getting a reliable dedicated connection to the co-lo?
November 13th, 2007 at 2:59 PM
Guys
This is what we want to see:
“What Have We Learned”
Thankyou.
yes - there are a lot of “well duh”s in there. Hell - there always are when you are doing this sort of thing. What I feel in the past is that while you may have learned things - they may not have been made public or possibly unlearned very quickly. This was - shall we say - “sub-obtimal”.
What I appreciate in the above is that you have given facts, related those facts to what was going on, reflected on them and come up with ways forward. Good.
I’ve often run folks off site not becuase of mistakes - but becuase they didn’t learn from them - or even give indication that they could learn from them.
Mistakes happen and systems need to grow and develop. This I think most of us can understand and, to some extent. appreciate. But to not listen to feedback and to not reflect on those mistakes - well - that is plain bad. But…
This shows that things are moving in the right way.
Thanks for your report and I hope (only sort of) to hear more of these detailed post-mortem’s later
Also - don’t forget to pot the things that went well. That is equally important. The residenents need to know that every update also has it’s up sides - so do you.
Please keep this up. It’s appreciated.
November 13th, 2007 at 3:01 PM
[...] Second Life Blog Second Life 1.18.5 Server Deploy Post-Mortem Quote from the site - The Second Life 1.18.5 Server release included updates for several systems, [...]
November 13th, 2007 at 3:03 PM
The explanation is nice but still lacks many, Many things, such as:
Warnings… It amazes me and many other sim owners and members how a company that can pop up a Blue window to every single person online, doesn’t bother to warn its users of things that can be effecting their income, their profiles, their visitors, land titles and descriptions.
How an update has:
Deleted or seems to have deleted scripts from vendors.
Has Groups tools working when they feel like working.
has stopped some objects from accepting payments.
How the last rolling restart was a Rollback and changed many things a user may have done before hand.
So SL users go and check all those things NOW !!!
Too many are tired of hearing:
We’re sorry for the inconvenience
and
Lets Restart your Region
November 13th, 2007 at 3:03 PM
Briliant blog going, about all your hardware problems, is it for the sim servers or other servers, if it were me i would find a decent stable hardware setup then use that with everything i use, that way hardware problems would occur alot less
November 13th, 2007 at 3:05 PM
@Joshua Linden…
“Yes - in fact, for future “live” deploys I’m planning to do centrals one day and simulators the next, since it’s no longer practical to do everything on one day. This provides opportunities for additional testing in this state.”
Nooooo….
Leave a day inbetween at least. Let things settle. Let the team take a step back and appreciate what they’ve done and also get some rest.
Updates are stress. Too much stress degrades performance. Not enough testing leads to wrong assumptions (witness above).
Leave at least a day for the dust to settle…
please…
you know it makes sense…
And di we say - thanks for listening to us?
November 13th, 2007 at 3:05 PM
Thanks Joshua for the candid and honest appraisal of what went wrong. It’s good to see you recognise your weaknesses.
November 13th, 2007 at 3:06 PM
Great Job of explaining in detail what happened, what you did about it and the resolution. Excluding 36, (you need a new video card) I think this is the most positive blog response I have seen.
This is bleeding edge stuff, and I respect that you guy are being honest and saying it.
And I think that the other person wanted to simply know if his sim was on a blade that was active at which colocaation facility.
November 13th, 2007 at 3:09 PM
TY for the info, I understood about half of it but I’m learning fast! Much appreciate these kind of posts though.
November 13th, 2007 at 3:18 PM
so how about a refund for the sim owners
November 13th, 2007 at 3:20 PM
Very thoughtful for you to take the time to explain and keep us up to date.
November 13th, 2007 at 3:20 PM
How do you mean, good news? Compliments on that change of the unwritten rules
November 13th, 2007 at 3:23 PM
hmm i cant post my bug
November 13th, 2007 at 3:24 PM
OMG OMG BUG!!!!!!!!!!!
i cant place objects from my inventory on the ground. the object doesnt place, and the selection particles apparently go from the center of my HUD to wherever i try to put the object. ive sent multiple bug reports on this and a camera bug that focuses on the HUD instead of inworld. im posting here cause Joshua replies to posts and i dont trust the bug reporting process.
November 13th, 2007 at 3:26 PM
@23 [..]I know a guy runs a little back-street server shop. On his wall is a great big sign: “100% Uptime guaranteed or your money back”. How does he do it? Redundant servers and mirror backups, just like any reliable service. When was the last time you saw Google offline? How about Yahoo? Microsoft? Quake? Unreal? (Yeah right, like Linden Lab has more cutting edge activity than any of those companies). [..]
You’re having a laugh aren’t you?
Google - Gmail often goes down, Google chat about once a week.
Yahoo - Mail services extremely unreliable and they’ve spent two years trying to get their new email service out of beta and failed.
Microsoft - Where do you want me to start? Windows update server very intermittent, downloads have failed twice this year telling all customers that they were unlicensed pirates.
Ebay - Take most of the UK site down every friday morning for maintenace. (or they used to - I’ve given up using it on friday.
And if you think Linden Labs aren’t cutting edge then please let me know what is. MMRPG systems like this are where it’s at. The others are just dumb websites.
On a better note - Thanks Lindens for posting a very detailed and mostly helpful blog entry. It’s what everyone has been crying out for for ages. MORE PLEASE.
One thing that would make it better, please post inworld notices when these issues are occuring. It used to happen and is very helpful.
November 13th, 2007 at 3:30 PM
¬¬ respond to my bug #69
November 13th, 2007 at 3:30 PM
Thank you… finally I see some light, after beeing so long in the dark!
November 13th, 2007 at 3:33 PM
Thanks.
More like this.
November 13th, 2007 at 3:36 PM
I’m normally a little critical of what gets posted here - but I have to say that I’m impressed with this level of confession that things simply didn’t go right, and a list of lessons learned from this entire fiasco. Now, as long as someone actually DOES something with this list and it doesn’t just become something that someone - a year from now - looks back and says “DUH! We should have seen it then!”.
I am very glad that we are being given a blow by blow of what happened - we, the every day users and merchants, are some of your largest investors when you consider what we do to stimulate the ecomony. I do think we deserve some insight to what is going on with our investment. As one person put it, the cost of owning a sim “cost more than owning a new car” - with this kind of investment, we do deserve some feedback as to what is going on - and some accountability. I finally feel like we have seen some accountability in this post. Now, if someone actually takes the next step and implements change to do something with that accountability, things might actually get better.
I also agree with another poster that you guys REALLY need to listen to us more - We tell you things are wrong and we get brush offs and “the company line” answers. A lot of us feel like our ability to interact at all with LL has been taken away. Live help gone and any support at all is hidden behind the web site somewhere. We realize that this entire thing is a work in progress - but if you don’t listen to the people who are USING the product on a day to day basis, how can you possibly judge how it’s working? For a LONG time now, people who use the product have been saying “fix stability” and “fix what’s wrong BEFORE you implement more features”. Maybe stop announcing these “features” LONG before they are ready so people will stop hounding you to get them out BEFORE YOU FIX THE UNDERLYING PROCESSES THAT WILL SUPPORT THEM! If those are not stable, you cannot possibly expect that anything will be MORE stable with the new features on top of them. Pushing to get them out is only going to make underlying, broken or crippled processes worse.
… and to Slartibartfast Magicthise… wtf???
November 13th, 2007 at 3:41 PM
I would love to work for LL. I’m a single 35 y/o guy with a programming degree who lives and breathes all things virtual and have no ties that bind. So when you guys gonna send me a plane ticket? I can be there yesterday…
I’m actually checking out the job postings now to see what I might be interested in doing.
November 13th, 2007 at 3:46 PM
The explanation was nice Jishua, not really honesty more a confession. Everyone inworld new that release sucked but only after 2 weeks ,, a belated confession?
Hiring staff? try and hire some competent IT Project Managers, The IT project Management protocols have been around for decades, sadly it seems noone in LL actually knows how to apply them. Seems way to many techos running around and no real IT managers. The company may have expanded, but its mamagement practices havent. No 23 was right. when you charging big dollars to “play” in your world. you better start giving value for money. When the first viablle competion comes along, a lot of Lil Lindens will be unemployed as people leave in droves.
Another old saying is “to little, to late”
November 13th, 2007 at 3:52 PM
chrism mollor comments, “don’t forget to post the things that went well.”
I refrained from doing so, but since you’re asking:
* Despite all the difficulties, at least 20,000 residents were online at all times throughout the debacle.
* During the worst point of the update (Thursday 1pm), we still had nearly 50,000 residents online.
* Most residents and most regions were unaffected by any of the updates (apart from presence glitches and region restarts)
* Where practical, changes were rolled back as soon as the problems became clear and the rollback was deemed safe.
Compare this to monolithic updates in the days of yore, where following a 6-hour downtime the login storm would crush the servers resulting in 5 hours of follow-up work to restore the service, and days of lingering bugs.
See, we do learn and improve things!
chrism mollor continues, “Leave a day inbetween at least. Let things settle. Let the team take a step back and appreciate what they’ve done and also get some rest.”
That’s a great sentiment. At least for now it’s not practical as we do want to keep the platform moving forward. We ship software to get fixes out and improve infrastructure, and we do want to get it out as fast as is practical. We also have a very small team doing the deploys as well as many other tasks, and there is pressure to get things done and out of the way. We’re working to evolve the platform so that development and deployment of different components is even further decoupled, but it’s gonna take a while to get there.
Yami Katayama writes, “We tell you things are wrong and we get brush offs and “the company line” answers. A lot of us feel like our ability to interact at all with LL has been taken away.”
Linden Lab has grown a lot in the past few years, and that can lead to challenges. When Linden was smaller, you’d probably have your issues heard directly by a developer who was involved in a change and could make a fix. Now there are more developers and a lot more support personnel trying to help and speed things along. We get more done, but this comes at the cost of not being as personally involved in every issue. (When I take the time to write things like this blog post or this comment, I’m not spending the time to fix some script issues or help two teams coordinate their projects.) It also means that feedback tends to be diluted. Trust me - there’s very little “company line” here. If you feel you’re not getting a straight answer, is probably because the person posting doesn’t know the problem as well as you do!
(I’m not sure if that helped or even made any sense.)
November 13th, 2007 at 3:53 PM
@30 You make good points. All of them valid. Sony isn’t exactly known for friendly customer service (I found that out when I purchased a Palm PDA. Haven’t seen those on sale at Best Buy in a long time. LOL. But… Sony doesn’t charge the price of a car to use their system either, do they? I can play Unreal and Quake free, for the cost of the software. When people are paying $295 a month for a piece of virtual land, I expect it to run like the new car it’s costing to operate. If I bought a car and the engine died 5 times a day and the car wouldn’t start for 45 minutes several times a month and I constantly had to shut it off and restart it just to get it to drive above 20mph… I think I’d be on the dealer’s doorstep the same way I’m on Linden Lab’s doorstep here.
The basic problem is that LL is trying to shove too much into one box. There are thousands of people on SL who would be perfectly happy with fewer toys and a stable platform. If people a year ago had been given the choice between flexis and sculpties… or a lag-free crash-free platform… which do you think they would choose? If given a choice now between a stable platform and windlight… which would you choose?
Shoot, if given a choice between GROUP TEXT CHAT working and the next gizmo LL has in mind to lag the system even more… know what I’d choose? Know what I’d rather have than the next toy? GROUP NOTICES actually getting to every member in the group. That would be amazing.
@70. Don’t even get me started on Micro$oft. I’ll back you all the way there. I sometimes think Philip Linden is an alt for Bill Gates. LOL. BUT, as far as Google and Yahoo and Quake and Unreal and the others go… wha? Sorry dude, I’ve been a heavy user of the internet for years, and while there are occasional glitches in ANY system, I have never, ever seen Google totally offline, or their mail servers borked for any longer than a few minutes over a spread of several month. And I assure you that Google is just as bleeding edge as Linden Lab– if not moreso (when was the last time we saw Linden Lab draw in satellite information and display it on a world-wide map system… with users being able to construct buildings on that map?). As for Unreal and Quake etc, I never, ever lost an Unreal feed in the middle of a game. Not once. And that’s with about 20 players battling 30 or 40 npc monsters in real time and so much fireworks it was hard to see which was player and which was monster.
So, we have three avatars standing around on a sim chatting, and wham, suddenly they can’t move, chat dies, and the sim crashes. Oh wow, real server taxing there…
If Linden Lab charged $49.95 a month to host a sim (or even $95 a month) I don’t think I’d be saying much. But at $295.oo a whack plus a $1,650 setup fee to stack 4 sims to a server… at those prices I expect that engine to run smooth.
* llTargetOmega and llSetRot hasn’t worked for months. LL knows this and has failed to fix it.
* Group IMs been borked for months. LL is aware of that and has failed to correct the problem. That’s simple CHAT people. That’s not bleeding edge. Chat rooms have been doing it for decades.
* Group Notices aren’t getting out to all members. What does that take that is bleeding edge? Looking up a list of names and sending a notecard to each one? Wow, that’s a tough one. It’s understandable why it would be taking tech support almost a year, and they still haven’t got that fixed. Accessing a datafile and transfer of a notecard must be a real nightmare.
Would make Google Quake in their boots, it would be so Unreal. XD
November 13th, 2007 at 3:56 PM
Thanks for the info Joshua, many of us are starting to see the complexity of the world we “live” in. With so many intertwined systems it is obvious even small changes and human error can have big repercussions.
One thought could be a presence of in world people, either Linden staffers or responsible volunteers (say drawn from Jira regulars) who can monitor in world and report weirdness.
So many times we see the common issues before a major outage, even a few minutes advance notice that things are not right might help.
This could well naiive, but sometimes I wonder if the Lindens are ever in world any more.
November 13th, 2007 at 3:56 PM
Sadly, I’m left to this arena for this. Apologies in advance…
On the SL website, any attempts at logging in brings me right back to “Resident log-in” … no errors, just that. I enter the correct first and last name and correct password and blammo, back at the beginning. I can log into the game with no problems (same information), just not into the site to check transaction history. Anyone else?
PS - Great job on the info <–see? I can be on topic
November 13th, 2007 at 4:05 PM
[...] Second Life was freezing my computer but today it’s running fine. What happened? I guess only God and Linden Labs know the answer [...]
November 13th, 2007 at 4:05 PM
Update: My nephew changed the date on my computer to the 30th of November. Apparently that was the problem — If anyone ever has that issue, check your computer’s date setting!
November 13th, 2007 at 4:06 PM
This is the kind of post I want to see more of.
November 13th, 2007 at 4:09 PM
I do have one overwhelming question…
Why then are the Lindens not yet open sourcing the server as they (if I’m not mistaken) state they plan to do?
I’ve been very active in the open-source environs for years and when a project is made available online there appears to be no end to what can be accomplished.
Take for example the OpenSim project that aims, openly, to replicate everything that is done here in Second Life. The thing is, at the rate they are travelling with their development, they will undoubtedly surpass LL in a few year’s time and you can bet there will be a mass exodus when that time comes.
I ask why LL doesn’t open source right away? Rather than hiring a bunch of IT guys they can get countless tons of free help from all over the globe to update, moditfy, tweak and polish the Second Life server software and it will also remove all manner of complaint coming from the general populus.
It makes me think too because I own 9,704 Sq. M. of land “so far” and that’s $40 a month on top of my $9.95 premium. When you consider $50 a month that’s $600 I’m paying in a year right now.
I know of other 3D worlds where user-created content is incorporated, like City of Heroes, and they only charge a flat $15 a month no matter what you build or how large your group is.
I see the OpenSim project becoming the personal web servers of tomorrow where 3D “sites” will exist all over the place that one can fly through.
So my stance at present involves deciding if I’d rather consider working for LL on a closed-source project that is bug-ridden, or devote my free time to growing the OpenSim project and doing my best to make it a bug-free reliable and steadfast virtual world server that will knock the pants off LL in a few years…
Face it - competition is bound to be ferocious very soon. I mean, if LL boasts 10 Million registered users then it’s obvious someone somewhere is going to financially back a competitive environment unless LL opens the source to their server…
Just my 2 cents…
November 13th, 2007 at 4:15 PM
While I can appreciate your retrospective, saying nothing during the operation was inexcuseable. The inability to teleport and hardcrashing as result coupled with the absence of updated reporting, caused me to revisit my own system, causing me to needlessly reset my router and and make a tech support call to my ISP.
If folks at LL were as forthright as they were at sweeping things under the carpet, you might actually gain some support. Unlike this way, which is far too “after the fact” PR.
November 13th, 2007 at 4:15 PM
Interesting, and appreciated. On your first two bullets under Take Aways:
I have been saying exactly this same thing about how the client and servers updates are performed since late 2007. While it is sad to know that we as residents, some quite skilled, are not paid any heed, it is gratifying to note that LL is now truly aware of these problems, and may actually do something about them to make the updates and upgrades a smoother path for us all, residents and Lindens alike.
And just an FYI, as a Director Level Project Manager, well versed in software deployments, far more mission critical than SL, I have applied, several times. LL didn’t even have the grace to say “Thanks but no thanks”. Just ignored.
DRD
November 13th, 2007 at 4:19 PM
yah Information
Thanks for sharing
Just a thought - better ‘live news’ at the time of the problem do people with a problem know the limitations and reduce their own expectations?
It would resolve a lot of cursing and teeth gnashing, mine are getting worn down, see?
November 13th, 2007 at 4:19 PM
Speaking as an experienced SW QA engineer, I’d be happier if the take-aways from the post-mortem were more action oriented. Stating them simply as lesson