Teeple Linden's Blog

[CLOSED] Re-enabling Services

Friday, April 11th, 2008 by: Teeple Linden

[3:04 PM Pacific] All services have been restored. If you find yourself without group lookups or any of the other functionality in the second list below, please either clear your group cache (instructions are toward the bottom of this post) or relog to obtain full services.  Thanks!

[1:20 PM Pacific] Operations is re-enabling the first list of essential In World and Web services below.

The second list, part of our short-term mitigation strategy, will be returning to full functionality over the next 60 to 90 minutes. If that changes, we’ll let you know promptly.

Please be patient when logging in over the next few minutes. As is the case after any full restart, the login queues will be congested. Frustrating though this has been, please try to avoid the temptation to cancel and restart your login, and just ride the process out to keep your place in the queue.

[12:50 PM Pacific] In addition to the reduction of services outlined below, Operations will be immediately disabling the following services for at least 30 minutes:

  • Logins
  • Lindex
  • In-world transactions
  • Portions of Website functionality, including access to account pages and the support portal.

We’ll keep you updated every 30 minutes at most on this, and announce a return to service ASAP.

In order to increase overall stability of the grid today during peak usage hours, our operations team has disabled a set of in world functions to reduce overall database load and create a more reliable experience for everyone. As a side effect of these temporary changes, some group and avatar profile services will not be available.

Specifically:

  • Avatar profile information will not be transmitted to the viewer. This affects both floating and embedded profile windows.
  • General group information (name, charter, etc.) will not display
    in floating or group embedded group info windows.
  • Groups will not show their member lists.
  • Group owners and officers will not be able to eject group members.
  • Group proposals will open the UI, but will fail to create.
  • About Land will show 0 for traffic. (Note: This is temporary, and will not cause the loss of traffic captured throughout the time the total is unavailable to display.)

Please note that all of these effects are temporary and will return to normal behavior when we re-enable these services later today. At that time you should either relog or run Client -> Clear Group Cache (a Debug option) in order to refresh group behavior.

We will update this blog post to indicate when these services have actually been disabled, and again when they are again re-enabled.

As previously blogged, these steps are part of a temporary mitigation strategy until additional hardware can be installed, along with other scaling refinements planned for the very near future.

Thanks for your patience.

[RESOLVED] Two Gremlins Bite the Dust

Thursday, April 10th, 2008 by: Teeple Linden

[10:52 PM Pacific --teeple]

Operations has just addressed an unexpected performance hit on the central database cluster.  For a very few minutes, logins were slow or stalled, and various in world functions such as land and L$ transfers and map/search lookups were inhibited for many residents.

We were just finishing a blog post to report the service disruption when Ops found and corrected the root cause.

In addition,  Ops has corrected a misconfiguration one one of our two new firewalls which was causing timeouts on approximately 50% of attempts to hit our homepage for the past few hours, as well as inhibiting some percentage of inworld LindeX orders and land transfers.

While inconvenient and frustrating, we believe that all stalled transactions resulting from the two gremlins this evening have reverted harmlessly…that is, that no residents have suffered incomplete transactions.  If you have experienced a loss this evening, please contact us via the support portal.

Thank you for your patience, and we apologize for the inconvenience.

[Completed 4:30 p.m. PST - Kate] Operations has concluded changes and restoration of group and profile services. Please clear group cache via the debug/Advanced menu or relog to have group and profile services restored.

As part of a plan to increase overall stability of the grid today during peak usage hours, our operations team will make some changes at approximately 1:00pm Pacific which will reduce overall database load and create a more reliable experience for everyone. As a side effect of these temporary changes, some group and avatar profile services will not be available.

Specifically:

  • Avatar profile information will not be transmitted to the viewer. This affects both floating and embedded profile windows.
  • General group information (name, charter, etc.) will not display
    in floating or group embedded group info windows.
  • Groups will not show their member lists.
  • Group owners and officers will not be able to eject group members.
  • Group proposals will open the UI, but will fail to create.
  • About Land will show 0 for traffic. (Please note: this is temporary, and impacts only the display of traffic, not the recording of it.)

Please note that all of these effects are temporary and will return to normal behavior when we re-enable these services later today.

We will update this blog post to indicate when these services are re-enabled.

Over the next week or two, we will be making some changes to the database cluster that we believe will significantly reduce the effects of peak loading that many of you have experienced over the past several weeks. The mitigating measure we’re taking above is something we will only use until those permanent changes are in place.

Thanks for your patience as we work to improve the Second Life experience..

[2:30 PM --Chiyo] It is taking a while for the system to recover from the stress of today’s earlier problems. That means for the next few hours you may experience intermittent problems with teleports, rezzing objects, scripts, transactions, appearance etc.

This also affects many regions that are still offline and have not restarted as of yet. The current stressload on the database from everyone’s return is delaying this process. Please be patient as things work themselves out to return to normalcy. This process may seem slow at the moment but it is progressing as fast as we can make it happen. Thanks : )

[12:55 PM --teeple] We’re open. The outage has been successfully addressed by our service provider. Traffic has been diverted around a faulty router within a major internet carrier’s facility. Please be patient logging in for the next few minutes until the initial surge of logins has been processed. If possible, ride the login process out, rather than quitting and restarting, in order to keep your places in the login queue.

Some regions are still being returned to service, and Operations and Concierge will be working with those as quickly as possible.

[12:06 PM --teeple] Our upstream provider is continuing work to resolve the outage.

[11:01 AM --teeple] As Jack remarked earlier, this is an extraordinarily prolonged and unusual disruption. Many high-profile services and corporations are feeling the pain this morning, and full scale efforts are underway to isolate and fix the root cause.

[10:11 AM] We are still trying to put our fingers down on the cause for the problem, please bear with us a while longer! We apologize again for this interruption of your weekend plans!

[09:07 AM] At the present time, in order to resolve the current operational state of Second Life, we will need to force log out all users currently still logged in. We apologize for the inconvenience and hope to have the issues resolved as soon as possible. Updates will be posted here. -Lotte

[08:30 AM] Our Ops team is still working with the technicians from our ISP to fix the networking problems. Stay tuned for updates! -Lotte

[07:30AM 04/05/08 REOPENED] It seems the problem is even harder to nail down than we earlier supposed. We need to close logins again to go back to diagnostics. -Lotte

[3.30 AM] [RESOLVED] - The doors are open and we are back. It turned out there were multiple issues with our ISPs network which have been worked around for now. The network provider is working to resolve the issue permanently - Matthew

[2.30 AM] Folks, it’s been a rough night so far and we apologise for the lack of service right now.

Like most network providers today, the ISPs who provide bandwidth for Second Life actually bundle many network links into large virtual links between their data centers. Traffic for many customers then flows across these bundles. Starting at approximately 19:30 PST, some of the special routing that handles aggregation of these bundles at the ISP level malfunctioned causing us severe packet loss of over 50% on the portion of traffic going to Linden Lab’s data centers. We had no option but to disable logins.

Since this is a highly unusual failure, it’s been complex and time consuming to diagnose. We are working with our network provider’s engineers towards a solution and will be back up just as soon as we can.

– Jack Linden

Resident logins have been disabled until our toplevel routing issues have been resolved.

[UPDATED 2:11 p.m. Pacific --teeple]

The tests are over.  Please remember to relog or clear your group cache to restore normal group functionality.  Thanks!

In order to test load mitigation strategies, the Operations Team will be disabling multiple in world functions for 30 minutes, starting at 1:30 p.m. Pacific

Specifically:

  • Profile information will not load. This affects both floating and embedded profile windows.
  • General group information (name, charter, etc.) will not display in floating or group embedded group info windows.
  • Groups will not show their member lists.
  • Group owners and officers will not be able to eject group members.
  • Group proposals will open the UI, but will fail to create.
  • About Land will show 0 for traffic. This is temporary.

At the conclusion of the test, you’ll need to either relog or run Client -> Clear Group Cache (a Debug option) in order to refresh group behavior.

We will conclude these tests as quickly as possible, and apologize for the inconvenience they cause.

As originally reported here, our auction pages will be off line for scheduled maintenance this evening from 8 p.m. until Midnight Pacific.

We apologize for any inconvenience this essential event may cause.

As blogged on Friday, our call center is undergoing six hours of maintenance beginning at 2 p.m. Pacific today. During the maintenance time, callers may be entreated to leave messages or asked to call again later, depending on which inquiry type is out of direct service at the time of their call.

We apologize for the potential inconvenience to residents seeking support today, and have made every attempt to schedule this work as tightly as possible to minimize disruptions.

Posted in Customer Service |

As reported Thursday, our support phones will be down for two hours beginning at 4 a.m. Pacific today. Any calls received by the system during the maintenance window are subject to sudden disruption. We apologize for the inconvenience. 

Our call center will be operating at reduced capacity this coming Monday 31 March between 2pm and 8pm Pacific time.

During that maintenance window, phone inquiries may be directed to leave messages for callback, or to call again later, depending on the inquiry type.

This downtime is unavoidable, but has been scheduled as tightly as possible to minimize inconvenience to our residents.

We appreciate your patience during this essential upgrade to our support services.  We’ll post a reminder Monday morning Pacific Time, and inform you promptly of any changes to the maintenance window.

Our auction facility will be off line for four hours of scheduled maintenance next Wednesday, 2 April, from 8pm until midnight Pacific time.

We apologize for any inconvenience this event may cause.

– teeple