So! We had another database issue yesterday. Basically, Informix on PaddyM crashed again in a similar fashion as it did last week, and thus there was some rebuilding to be done. Fortunately, no data was lost - just having to drop and rebuild a couple of indexes, and run a bunch of checks that took a while. The server is back up and running now.
During the time we were dealing with the issue, Oscar also crashed. Once again, no lost data, but there was some slow recovery, and we'll have to resync the replica database on Carolyn next week during the standard outage.
Naturally, this has raised some concerns about the recent spate of server/database issues we've been experiencing. However, let me assure you that this is not a sign of an impending project collapse. These are simply normal issues that occur from time to time. There may have been some bad timing or perhaps a little bit of poor planning involved, but other than that, it's nothing more serious.
Recently, Paddym has been experiencing multiple failures, and it appears that all of them were caused by a single faulty disk. This disk is no longer part of the RAID configuration, but I failed to remove it last week when it could have potentially prevented future crashes.
The mysql database also experienced some issues, which are slightly concerning. However, I believe that these problems are largely related to the database's exponential growth in size, as there are many user/host rows that never get deleted and thus reach functional mysql limits. Although we can certainly tune mysql for better performance, it's important to keep in mind that due to the recent paddym issues, the assimilator queue may become inflated with waiting results, causing the database to expand beyond its normal size by up to 15%. To help alleviate this problem, my colleague Dave and I plan to start removing old/unused user/host rows from our database.
In summation, these problems are incredibly simple and manageable in the grand scheme of things. I'm pretty sure once we're beyond this cluster of headaches it'll be fine for the next while. But it can't be ignored that all these random outages are resulting in much frustration/confusion for our crunchers and there is always room for improvement, especially since we still aren't getting as much science done as we would like. So! How could we improve things?
More Servers: The Obvious Solution?
The need for more servers may seem like the most straightforward solution, but there are some challenges that come with this approach. Firstly, we currently have a shortage of IP addresses at our colocation facility (we were given a /27 subnet). Obtaining additional addresses would require significant bureaucratic effort and time to process. As a result, we cannot simply throw in new systems and expect them to work immediately.
However, there are workarounds available while we wait for additional IP addresses. Nonetheless, having more servers also means increased management responsibilities. In the past, we have experienced situations where "solutions" aimed at improving uptime and redundancy actually led to reduced performance. Therefore, it is essential to have a clear plan before investing in older servers.
Moreover, an update to our server "wish list" is long overdue. A comprehensive review of our current hardware requirements and potential future needs can help guide our decisions on whether to acquire additional servers or explore other options such as cloud services or virtualization.
In conclusion, while the idea of acquiring more servers may appear simple, it is not without its challenges. We must be mindful of the potential consequences of such a move and ensure that we have a well- thought-out plan in place before making any decisions.
My general anxiety level is high, and I think it's mostly due to the fact that we currently have very little storage capacity. We don't even have a single terabyte of usable archive (i.e. not necessarily fast) storage, let alone several hundred or so terabytes of usable SSD storage like we would need for our production databases. And even if we did have that kind of storage capacity, it might not be enough to solve all our problems. So, what could we possibly do to alleviate some of my anxiety? Well, there are a few things that come to mind.First of all, we could try to get some more and faster storage. If we could get a few hundred usable TB of archival (i.e. not necessarily fast) storage and, say, 50-100TB of usable SSD storage - all of it simple and stupid and easy to manage - then my general anxiety level would drop a bit. We actually do have the former archival storage already, which we can use for storing old Sun disk arrays that some of us are throwing away. One such system has 48 1TB drives in it, and we're actually starting to incorporate it into our general framework as part of our efforts to migrate the Astropulse db from one system to another, for example. But having a lot of super fast disk space for our production databases wouldn't necessarily solve all our problems, but it would still be
awesome
. The problem is that these kinds of disks are incredibly expensive, especially when you consider how much storage capacity they provide compared to less expensive alternatives like hard drives or tape drives. So, it's not clear whether it would be worth the incredible high SSD prices just to improve our storage capacity. In conclusion, while we do have some options for improving our storage capacity, it's still unclear whether this will be enough to alleviate all our storage-related worries. However, we should continue to explore all possible solutions to this problem as best we can, in order to ensure that we have the resources we need to run our experiments successfully.
Here are some key points to consider:
1. Databases
The current database systems that we have in place, such as MySQL and Informix, work well for our needs. However, there is ongoing research being conducted by Dave with the goal of migrating key parts of our science database into a cluster/cloud framework or another similar technology. This is an exciting development that could potentially lead to improved lookup speeds, similar to what Google and Facebook offer.
2. Manpower
While we currently have the necessary resources to support our SETI@home project, we are faced with a number of other demands on our time and attention. This includes ongoing developments related to the databases we are using and potential migrations to new technologies. As a result, it might be beneficial to consider hiring additional personnel if necessary to help manage these projects and ensure that they are moving forward smoothly.
Overall, while there are challenges associated with managing multiple projects simultaneously, it is important that we continue to invest in new technologies and explore new ways of doing things. With the right approach, we can achieve our goals and make significant progress in our efforts to understand the universe and search for extraterrestrial intelligence.
As I said before way back when, every day here is like a game of whack-a-mole, and progress is happening on all fronts at disparate rates. I'm not sure if any of this sets troubled minds at ease. But that's the current situation, and I personally think things have been pretty good lately but the goodness is unfortunately obscured by some simultaneous server crashes and database headaches.
- Matt