Minor Update: So yesterday we had to tweak some resources server side, despite fixing the 503 with more resources things were still a bit sluggish. We boosted a few more resources (expanded the cluster itself) and now the system seems quite responsive as well.

Please let @freemo know if you expiernce sluggish behavior or errors.

Only thing left to fix that is broken is some UI issues with links. Some of that will actually expose features that are technically working like editing in place and translation capabilities. So as long as nothing else is broken that is more pressing we should have that fixed up in short order and then we can start working on the next version again and get out of cleanup mode.

So seems the upgrade went ok, a few small issues we should have fixed up pretty quick. Here are the broad points.

* This was our first big move over to an entirely new architecture meant for quick upgrades with no down time. Ironically this meant a 2 day down time but the advantage is future upgrades should have very minimal downtime now.

* We Did 5 upgrades in one go, lots of new features as a result, mostly all the new stuff from mastodon combined with the features QOTO already had.

* The new setup was having some 503 errors at first, and it was a out of memory issue. We have since significantly boosted our resources. The system is now much faster and responsive.

Some issues that remain, and when we will have it fixed:

* There are a few small UI issues, the link to your own profile points to the wrong link, and the "local timeline" link is missing to bring it up. We should have those back in place very shortly (probably a day or two).

* There were 503 issues periodically but it should now be fixed, please let me know if anyone sees a 503 again.

* Editing posts in place (instead of delete & edit) now works but it is missing from the menu, so we need to add that.

* Translation features should also be in place, but lacks the link.

If anyone notices any other issues message @freemo . In the meantime give us a day or two and we should have these small issues all resolved, thanks.

Just an FYI, on Thursday (2 days) we will be re-attemping our 5-in-1 update. It will be a very major update so expected as much as 3-days downtime as we are finally moving to the new architecture. We **will** be up at the other end with hopefully everything working and a few new features.

Seems our load balancer is a little under powered and its stalling momentarily from time to time. Most of you probably didnt even notice it but some of you may have.

I am going to upgrade that today to the next level. May result in up to 5 minutes down time (the new architecture will have multiple load balancers).

Either some of the fixes I did yesterday wound up addressing the latency problem we had, or it was intermitten. But everything has been operating fine for the last 12 hours or so. I will continue to monitor the situation but for now everything is in the green on

Assuming no more problems trigger we and the team are continuing on building our beta environment which will enable for smooth update role outs int he near future. We have about 5 updates we developed in house we want to apply in the near future.

One problem fixed, another crops up...

So while that last fix did make things better, there seems to be a bit of lingering slowness for a few things, but things are mostly working.

I worked all day(and its midnight now) to bring up some very extensive monitoring tools for the team for us to help diagnose the issue.

I will work through the weekend to help improve things further, though I may need to get some sleep before I can fully resolve this. At least things are working other than a slight bit of lag. I will keep everyone on updated.

Today we had some problems on the server and it was slower than usual. Had to do with a mismatch in versions on docker engine. I managed to keep the server running while i diagnosed the problem, but it was noticably slow with the occasional 404.

Now that i found the problem I am applying it but you should already notice things are mostly back to normal. In a few minutes everything should be fully responsive again.

Sorry to everyone on for the recent difficulties, this was a very big migration for us and we are doing a LOT of work.. so breaks a few things up front but in the long term it will mean more updates, faster updates,and a more stable scalable system.

We still have about 5 updates we are planning to apply soon, we are just perfecting the environment first and setting up a beta environment, so stay tuned.

is back up after a short downtime. As far as I can tell the fix went smoothly. Hopefully that will address the last of the problems from migration.

In about 10 minutes will be going down shortly in an attempt to fix a 16G table that may be at the root of one small lingering problem post migration. Luckily we have good backups and the table can always be recreated from scratch.

So should be back up shortly hopefully with the last needed fix in place and we can start the upgrades soon.

Health checks and internal networks are now in place. It went smoothly!

This should ensure that if our system crashes **for any reason** it will automatically restore itself. Should give us better uptime in the future.

QOTO Announcements & Polls  
We are going to try to add some healthchecks to our containers to help protect them in case they become unresponsive. No one on #QOTO should notice...

We are going to try to add some healthchecks to our containers to help protect them in case they become unresponsive. No one on should notice any real downtime unless i break something. If I do downtime should be mere seconds as I bring it back up.

Just letting people know in case anything goes wrong, shouldnt be noticed if things go right..

after another short down time everything is fixed! had to recreate an index that got lost in the migration which was slowing down the DB and everything else. No amount of resources was going to help.

But it is fixed now all ques are empty or very close to it. we will now downgrade the DB to a more sane level now that it is fixed (we upgrades shortly to maintain the system). But it will sill be a pretty hefty system for us, and we can always scale back up when needed.

TL;DR everything is wording fine now.

PS we are now going to work on a staging environment to test upgrades so we can safely start moving the main server through the upgrade cycle. stay tuned.

So we found the real problem haunting us. Turns out we didnt even need the bigger database. There was just an index that got dropped during the migration. We are working now to put it back in place. At which point things should be back up to their normal speed.

I just upgraded the DB server to x2 the CPU. This seems to fix the underlying issue. The pull queue (not related to most things) is now recovering as well.

So reviewing everything the next day at it seems almost everything is back and working with one exception that wont effect things too visibly.

One of our low priority sidekiq queues, the pull queue is backlogging now. All other queues are staying ahead of the curve.

This largely deals with pulling in remote media so you may see the occasional dead image. It is partly working though. We think the problem is with ulimits and working on it, shouldn't effect main operations and hopefully will be fixed soon. I will keep everyone updated.

It appears now that the backlog is resolved pages load quickly and images can be uploaded with little delay.

There seems to be one minor issue behind the scenes I need to tweak but overall it does appear to be working.

If you find any lingering issues please report them to one of the admins.

🎓 Doc Freemo :jpf: 🇳🇱  
Image uploads are now immediate.

One last update before I go to bed and disapear for 12 hours.

The backlog has went from 1.2 mil at its peak earlier today to 0,4 mil now after we reconfigured things. It is steadily going down and should have everything up to working order before I get up.

One or two people were able to get images loaded after a VERY long wait. So while images still arent working it seems very likely related to the backlog. In a few hours when the backlog clears I expect image uploads should work again. If not I will check what the problem is in the morning.

Other than that most things appear to be working and everything should be functional soon.

The back log is about 2/3rds complete. This afternoon it peaked at 1.2 million and now it is ~0.4 million. I jut moved to pg_bouncer to sped that up a bit. Looks like I more than doubled the process time. Almost there.

@QOTO Recognizing all your hard work. I'm sure it isn't easy running a large mastodon instance on what's probably a mostly thankless job. So, on that note... thank you. It is certainly appreciated.

Woot is now 50% through catching up on the sidekiq backlog!

Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.