This is just a quick post to the world about some little software things that have recently caused me a lot of headache.
On Sunday, I had to handle the first FreeBSD kernel panic I've seen in a while. Somehow, this panic was caused by booting up the Syncthing binary inside a jail. Haven't resolved it, but that happened and I'm honestly afraid to try again.
The day before, I was doing some package upgrades. Standard stuff across some jails, such as upgrading Nginx, Redis, Vim (where needed ;]), etc. In all jails, upgrades succeeded with no weird signs. That is, until the system panicked and rebooted a few times. Awesome.
Sentry's Celery workers were absolutely running out of control, Nginx wasn't servicing some pages, SSO absolutely refused to log me in, the list kept growing by the minute. Shit. Get some
htop running in there to see what was happening...and of course, nothing looks too far out of the ordinary.
There's a slightly frustrating cyclical dependency going on in my infrastructure. SSO protects all the things that are a) important, and b) served by Nginx. I was getting notifications from the alerting system I set up to ping Consul, but I can't get to the Consul WebUI because it's protected by LSSO, which requires Redis, which means that if Redis disappears.....No more access to metrics monitoring, Consul health checks, etc.
The main thing that tipped me off was the fact that the SSO auth page was returning
500 Internal Server Error. This should never happen with the SSO unless the OAuth server is down, and I can definitely see
osiris in a
ps aux output...A simple test from the OAuth jail showed me that--(gasp!)--Osiris can't hit Redis, which it's using to cache tokens. This just got 10x more awesome.
I went back to the host and hit the Redis jail at 500mph. Redis is definitely down. phew, that's a relief. At least it's not something worse, right!?
service redis start and we're back in business!
Then, the machine just started laughing right in my face. Parts of Sentry, LSSO, etc., started failing again. Can't access Consul, can't access metrics. Well, what the hell! With another
ps, I see that Redis is not running. "What...! I just started that!"
Weirdly enough, it looked like somehow Redis was getting killed off and then it was trying to persist the whole database to RDB and AOF. Well...wait, what? I disabled AOF and RDB when I was tuning the instance for use with Sentry. After hopping back in the Redis instance, I noticed a few fun things with the config:
- AOF settings were commented out.
- RDB settings were commented out.
- Memory settings were commented out.
- This was a distributor config file.
Somehow, during the package upgrade, my Redis config got overwritten with the default dist config. Disable AOF and RDB, set a memory limit, and get the max-memory policy set to
allkeys-lru. After starting Redis, it actually stayed living, which is strange, but I'll live with it.
After a bit, I noticed that the RAM usage on the machine was absolutely ridiculous. Trusty old
htop pins Redis as the problem. Well...time for another adventure.
There are only three pieces of my infrastructure that use Redis:
- Sentry (and Celery)
Osiris only drops data in when it receives a token request, so it can't be that.
LSSO stores session keys, cross-domain keys, session checkins, but all of those have expire times set.
That means it has to be Sentry or Celery. I routed all of the Celery brokering data into a separate database and watched the datasets. db1 (LSSO) was empty. db2 (Celery) remained relatively okay in size. db0 (Sentry) was skyrocketing. Redis' memory usage was jumping up by about 100 MB every ten minutes. After letting it run for a bit, I decided it was time to stop and flushed the 1.5 GB worth of data out of the database. What the hell is generating this much data!? I toyed with Sentry, I toyed with Celery, I looked at the data and it just wasn't making sense.
I don't recall why these settings values were set the way they were, but my Sentry config had some suspicious settings:
DEBUG = True DEBUG_PROPAGATE_EXCEPTIONS = True
These being set was probably a product of me trying to debug some plugins, but forgetting about the settings. Chalk another one up to forgetfulness. Set both to False and boom. Redis is back down to ~4 MB after a
One more story. This one is quick! I/O on the same server has been slammed on and off for a bit, but it's proven pretty difficult to debug, given the number of things actually running on that machine. Finally, I remembered that I had enabled debug logging in Nginx, because otherwise, who knows what is wrong with your rewrites and upstreams? A quick
ls -lah shows this:
[root@68c39403-047d-11e5-aed9-931e8adbea11 ~]# ls -lah /var/log/nginx/debug.log -rwxr---— 1 www www 35G Sep 29 19:45 /var/log/nginx/debug.log
35 GB of plaintext log file. Beautiful. Just beautiful.
Server is much happier now.