Archive for the 'Operations' Category

We’ve just updated the Quality Metrics page, and the numbers show what you already know: April was not a good month for Second Life Grid availability. Our internal outage tracking tool estimates that about 630,000 usage hours were lost to global system failures over the course of the month, which is about 1.9% of the total (up from 0.06% in February and 0.22% in March), and resident surveys clearly indicate great unhappiness coinciding with these failures. (We define lost usage as how much time Residents would have spent logged in but did not, due to Grid failures; it is meant as a global availability metric and does not cover local failures like sim crashes, inventory problems, and the like. See actual[black] vs predicted[blue] concurrency graph excerpt, right.) I’d like to address the causes for this, and what we are doing about it in general terms.
(more…)
Linden Lab Production Operations has open positions for Production Operations developers and systems engineers in Australia, Singapore, the United States, and United Kingdom. The Production Operations team is responsible for ensuring that the Second Life grid, the world’s largest collaborative real-time development environment, is up and running.
Linden Lab Operations is a Debian Linux shop. We rely extensively on OSS, and our in-house systems are usually written in Python or PHP. Our team is made up of folks who have been involved in large-scale grid management and site operations for years.
We’re looking for people who can rapidly pinpoint and diagnose network failures, deployment issues, and performance bottlenecks, who can also create tools which will improve grid stability. Production Operations works extensively with the Concierge, System Engineering, Governance, I-world, and Development teams to triage and respond to grid problems; therefore, the ability to communicate effectively with techies and non-techies is critical. The successful candidate will have substantial *nix experience and script-fu, familiarity in managing large system installations, and no fear of complex, dynamic systems.
If this sounds like you, please click here and submit your resume for one of the “Production Operation” postings (Developer or Systems Engineer).
[07:54 AM - Resolved] The database upgrade has now been completed. Thank you for your patience whilst this was going on. - Matthew
[07:05 AM - Update] The database upgrade is now under way. - Matthew
As part of a plan to increase the performance and stability of Second Life, we will be upgrading one of our central database systems on Wednesday, April 9th between 7:00 a.m. and 8:00 a.m. As a side affect, the following services will be impacted or disabled:
Second Life Functions:
- Logins
- Teleporting
- L$ Transactions
- Profiles
This upgrade is one more step towards improved performance and reliability of Second Life. We appreciate your patience during this hour.
[Completed 4:30 p.m. PST - Kate] Operations has concluded changes and restoration of group and profile services. Please clear group cache via the debug/Advanced menu or relog to have group and profile services restored.
As part of a plan to increase overall stability of the grid today during peak usage hours, our operations team will make some changes at approximately 1:00pm Pacific which will reduce overall database load and create a more reliable experience for everyone. As a side effect of these temporary changes, some group and avatar profile services will not be available.
Specifically:
- Avatar profile information will not be transmitted to the viewer. This affects both floating and embedded profile windows.
- General group information (name, charter, etc.) will not display
in floating or group embedded group info windows.
- Groups will not show their member lists.
- Group owners and officers will not be able to eject group members.
- Group proposals will open the UI, but will fail to create.
- About Land will show 0 for traffic. (Please note: this is temporary, and impacts only the display of traffic, not the recording of it.)
Please note that all of these effects are temporary and will return to normal behavior when we re-enable these services later today.
We will update this blog post to indicate when these services are re-enabled.
Over the next week or two, we will be making some changes to the database cluster that we believe will significantly reduce the effects of peak loading that many of you have experienced over the past several weeks. The mitigating measure we’re taking above is something we will only use until those permanent changes are in place.
Thanks for your patience as we work to improve the Second Life experience..
[Completed 4:30 PM PST] Operations has concluded changes and restoration of group and profile services. Please clear group cache via the debug/Advanced menu or relog to have group and profile services restored.
Update 12:13 PM PST: We will begin these group changes earlier than originally expected and will commence momentarily.
As part of a plan to increase overall stability of the grid today during peak usage hours, our operations team will make some changes at 1:00pm SLT that will reduce overall database load and create a more reliable experience for everyone. As a side effect of these temporary changes, some group and avatar profile services will not be available.
Specifically:
* Avatar profile information will not be trasmitted to the viewer. This affects both floating and
embedded profile windows.
* General group information (name, charter, etc.) will not display
in floating or group embedded group info windows.
* Groups will not show their member lists.
* Group owners and officers will not be able to eject group members.
* Group proposals will open the UI, but will fail to create.
* About Land will show 0 for traffic.
Please note that all of these effects are temporary and will return to normal behavior when we re-enable these services at approximately 4:30pm SLT. At that time you should either relog or run Client -> Clear Group Cache (a Debug option) in order to refresh group behavior.
We will update this blog post to indicate when these services have actually been disabled, and again when they are again re-enabled.
Over the next week or two, we will be making some changes to the database cluster that we believe will significantly reduce the effects of peak loading that many of you have experienced over the past several weeks. The mitigating measure we’re taking above is something we will only use until those permanent changes are in place.
Thanks for your patience.
[RESOLVED 04:21 AM PST] The faulty server is back in line, and you should see no more problems.
*****
We have seen reports that a number of our residents see inventory related problems, such as slow or no loading, problems in picking things up etc. Those same residents might find themselves unable to log back in if they left Second Life.
The underlying problem here is one of our asset servers. Our Ops Team is aware of the situation and working to resolve it as quickly as possible.
[UPDATED 2:11 p.m. Pacific --teeple]
The tests are over. Please remember to relog or clear your group cache to restore normal group functionality. Thanks!
In order to test load mitigation strategies, the Operations Team will be disabling multiple in world functions for 30 minutes, starting at 1:30 p.m. Pacific
Specifically:
- Profile information will not load. This affects both floating and embedded profile windows.
- General group information (name, charter, etc.) will not display in floating or group embedded group info windows.
- Groups will not show their member lists.
- Group owners and officers will not be able to eject group members.
- Group proposals will open the UI, but will fail to create.
- About Land will show 0 for traffic. This is temporary.
At the conclusion of the test, you’ll need to either relog or run Client -> Clear Group Cache (a Debug option) in order to refresh group behavior.
We will conclude these tests as quickly as possible, and apologize for the inconvenience they cause.
[ALL CLEAR 18:08 PST] We thank you for cooperating while we worked on our database. We have put the tools away and the database is humming again.
[REOPENED 17:32 PST] We have reports of transactions not completing. One of our databases is being checked and we will give an all clear as soon as possible. Until then it is best to refrain from transactions including land and L$ transfers.
[ALL-CLEAR 15:05] The database is happy again and the all-clear has been given. Inworld transactions and purchases can once again commence.
[UPDATE 14:48] Ops is working on the problem, and the database is again improving. Please continue to refrain from making inworld transactions however, until the all-clear is given again.
[REOPENED 14:30] The database is once again experiencing difficulties. Please refrain from inworld purchases and transactions until further notice. We will post progress reports as they become available.
[ALL-CLEAR 13:55] The issue has been addressed and the database is happy again. Inworld transactions and purchases can commence again.
[UPDATE 13:20] The issue has been identified, ops is working on it, and the database is improving. Please continue to refrain from making inworld transactions however, until the all-clear is given.
We are currently experiencing an issue affecting inworld transactions. Please refrain from making any purchases or transfers. We are investigating and will post additional information as it becomes available.
Further to the work carried out in late February to move some of our simulator nodes to new IP addresses, we now need to change the IP addresses for parts of our network infrastructure. This will mean 5 to 10 minute network outages for around a thousand regions at a time, spread over the next few days, as they become “disconnected” and then “reconnected” to the Second Life grid network. Note that regions will not be shut down unless necessary, but we will alert all affected regions when their times come.
(Updated 2008-03-19 03:32am PDT) We will start doing this from 3am-8am 5am-8am PDT (12noon-3pm GMT) Wednesday 19th March. Due to earlier, unrelated problems with our network infrastructure, we are pushing this work back 2 hours on Wednesday; work will commence at 3am PDT on subsequent days until completion.
When we modify the configuration of the affected network systems, it will result in 5 to 10 minutes downtime to the simulator nodes. While the work will be performed over the next two or three days, it will only affect a certain number of simulator hosts at a time, not the entire Second Life grid. If you are connected to an affected region while it is being worked on, an alert will be issued within that region, and once “disconnected”, you may still appear connected to the region, but you will be unable to teleport, rez or chat. Once the regions become “reconnected”, if you are still in that region, you will see an alert saying the outage for that region is over, and you should be able to continue activities as normal.
We’ll keep this post up-to-date with the details as we progress. (Complete 07:40am PDT) The parts of this process which affect your experience in-world have now ended and we’re marking this resolved here in the blog.
(Updated 2008-03-19 08:25am PDT) This has been delayed due to the earlier network issues, but we are pressing ahead with one round of region outages. This should take no longer than 5 to 10 minutes. We’ll update here when we’re done. (Complete 09:05am PDT) Work complete for this round; further rounds will not take that long!
(Updated 2008-03-20 05:15am PDT) Here we go again; another round of region outages as we reconfigure more equipment. (Complete 05:45am PDT) There are some issues with our process which we’re ironing out; the remaining outages will be quicker. We will continue with this process at around 06:15am PDT.
(Updated 2008-03-20 06:25am PDT) Second batch of outages for this morning. (Complete 06:35am PDT) We will continue with this process at around 07:00am PDT.
(Updated 2008-03-20 07:30am PDT) Third batch of outages for this morning. (Complete 07:41am PDT) We will continue with this process tomorrow from 03:00am PDT onwards.
(Updated 2008-03-21 04:33am PDT) Good morning, Second Life! It’s time for another round of network infrastructure updates. (Updated 2008-03-21 04:37am PDT) Ook! Something’s not right; we’re holding off on this right now. No regions have been affected at this time, and we’ll update here when we’re good to go again. (Updated 2008-03-21 07:47am PDT) We’re calling off any updates this morning; we will continue with this process tomorrow (Saturday) from 03:00am PDT onwards.
(Updated 2008-03-22 06:20am PDT) We are continuing our network reconfiguration work. The outage earlier this morning was due to some unexpected issues related to the reconfiguration. Again, we have reviewed the process and further outages will be planned and stable. (Complete 07:40am PDT) The parts of this process which affect your experience in-world have now ended and we’re marking this resolved here in the blog.
[RESOLVED Mar 21 2008 5:09 PDT] - The issues should all be resolved now. Please contact support if you experience any related problems.
[UPDATED Mar 21 2008 3:03 PDT] - Our teams are continuing to address these issues; however, we have no new information to provide at this time. Please continue to watch this blog for further updates.
We are aware of multiple issues affecting the whole of Second Life and are working to resolve them as quickly as possible.
Please refrain from uploading textures or making any transactions for the time being. You may notice problems with logging into Second Life, grey textures/avatars, inability to Teleport, failing transactions.. among other things.
We will update this blog as soon as we have information to update with.
|
150