Frank Ambrose (FJ Linden)'s Blog

FJ Linden here, with my monthly grid update.

It’s been a good stretch of grid stability over the last month, with one very poor day in the mix.  Some central database issues and then a Level 3 outage in the middle of the month cascaded into a series of problems, although we were able to isolate and fix them in just over 3 hours.  However, that event only served to reinforce just how important it is to bring LLnet online, and quickly.  On that topic, I’m pleased to start this month’s updates with the status of LLnet.

LLnet 30 Days Ahead of Schedule
LLnet, our private fiber optic ring, is a good 30 days ahead of schedule. This network, which will privately interconnect our datacenters, will allow us to move away from VPN reliance. “LLnet” fiber facilities have been delivered into our 3 data centers, and are currently in the configuration and testing phase with the routing infrastructure.  This work should be concluded by the end of this week, and we will then start full testing in a production environment.  We want to move as quickly as possible, but also do not want to destabilize the grid for the sake of speed, so we will take most of December to finish production testing, and begin cutover of live traffic in late December or early January.  We have thousands of machines across the data centers, so the cutover process is expected to take about 60 days, but we have been very good (so far) at beating our projected dates.

HTTP Dataserver
On the infrastructure project front, we’ve completed most of the HTTP Dataserver project to migrate all C++ mysql traffic from mysql protocol to http(s). This project will allow us to move farther away from VPN dependency as well as off of MySQL wire protocol over the WAN, to better enable tracking and monitoring of queries. We expect to be through testing in the next week.

Agent Inventory Services
Agent Inventory Services is scheduled to be deployed with the server code update in January.  This is one of the ongoing projects to address inventory issues for Residents.

These projects are both designed to provide more reliability, especially as it relates to inventory delivery and database queries, by better handling messaging across the databases and simulators, as well as back to the viewer. I intend to use my December/January post to talk about our strategy for inventory services, our storage strategy and our thoughts on our data architecture.

My primary goal has always been to improve grid stability and reliability and we are making great strides on that front. We’re not through the woods yet, but I want to re-emphasize how important I believe it is to address “foundational” issues that have the potential to cause huge impairment (like network problems), and then decide how we scale other components of the infrastructure.

Finally, I have made some internal organizational changes over the past month, that I hope will begin to drive more specialization in some key areas.  This included adding a new network director, and more focused team leads managing databases, asset management, and data services.  My belief is that, in addition to sound technical strategy, we need the right organizational alignment and specialized technical skills to achieve long term stability and scalability on the grid.

Links:

Second Life Grid Status Reports

RSS Feed for SL Grid Status Reports Page

Second Life Grid Status via Twitter

Service Disruptions Wiki page

FJ Linden here, to report on the latest Ongoing Updates from the Grid.

As I promised in my first post, this will be a regular monthly communication to keep all of you up to date on our efforts to improve grid stability and reliability. I’m finishing up my 3rd month at the Lab and have some significant progress to report.

I’m happy to report that we have an approved plan to move away from VPN reliance. We’ve finalized a design and chosen facility and equipment partners to build and deploy a private fiber optic ring to interconnect our datacenters. “LLnet” will be the designation of our private network and we have established an aggressive timeframe to activate it. I’m pushing hard to bring LLnet online by the end of this year (’08), and begin a phased migration off of the VPN’s immediately after. Given the amount of traffic to move, I would estimate completion of this project by February or March of ‘09 at the latest. So we have a light at the end of the tunnel on one of our biggest stability issues.

(more…)

Hello, I’m Frank Ambrose, the Senior VP of Global Technology, and I’d like to take this opportunity to let you know about some of the work we’re doing on the Second Life Grid.

By way of introduction, I’m a recent hire here at the Lab, having joined to lead our global technology team. Specifically I’ll be focused on grid infrastructure and our stability initiatives. As noted in the press release, I come to the Lab from many years at AOL (and prior to that MCI), where I experienced the kind of explosive growth, global scale and inherent stability challenges we face here at Linden Lab.

More than anything else, my tenures at those companies taught me the direct relationship between platform stability and user experience. I’m looking forward to applying that lesson, and a host of others, as we work to maintain, build and improve this complex virtual world. I am keenly aware of the pain that any service outage can cause and am both excited and confident that Linden Lab has focused the right resources to achieve this critical objective.

Given the complexities in our architecture, our stability efforts span many individual areas, most of which were detailed by Ian Linden’s May posting. Some areas will be addressed through short-term initiatives, while others will require significant re-architecture, software changes and new physical hardware. Throughout it all, we’re committed to making the transition to a more stable world as seamless and transparent to you as possible. To that end, members of my team will be using the blog regularly to provide updates on plans and progress towards meeting our stability goals.

As part of our wider stability plan, we’re targeting 4 major infrastructure points both with long-and short-term goals: Intra-Grid Network, Asset Storage Cluster, Central Databases, and Host/Transit Data Services. The strategy is to develop and deploy near-term solutions to improve stability, while looking more broadly at our architecture (hardware, software, networks, etc). In the near term we’ve got a number of projects in flight to address some of these problem points. A couple of examples are:

- Asset collection. We’re collecting many assets that are on our storage clusters, but are rarely (if ever) accessed. These assets take up critical space on the clusters and potentially degrade performance and stability as we hit volume thresholds. We’ll be moving these files to different storage mechanisms and, while they will still be easily accessible, it will help us to avoid pushing the limits of our existing storage clusters, while still preserving all existing assets in a reliable storage environment.

- Reducing the need for VPN connections.  Since we don’t encrypt communication between simulators and our databases, there needs to be a safe means to communicate across data centers and so we use VPN connections. The connections don’t scale well and can be unreliable, so establishing a new communications mechanism, that is both safe, scalable and reliable, is another short-term project.

These projects are just a sampling of the work that is currently being done to improve stability, and I’ll be reporting on their progress, as well as other short-term projects, in the coming months.

We have a lot of work to do but be assured that we have the right resources and internal focus to achieve our stability goals. From personal experience, I’ve encountered many equally complex challenges, especially in my time at AOL, and these problems are all solvable with the right level of attention and technical talent. We certainly have both, now we will start delivering.