Tuesday, June 02, 2020

Don't Fail Closed Unless It's for Security

Apparently Plex... the leading home media server platform in the English speaking world... is down, (or at least partially and intermittently down) worldwide at the moment.

For about 60-90 minutes so far. They're working on it, and uou can check the status here: https://status.plex.tv/



To be clear... this isn't just the Plex web service and remote UI, local media servers are failing to display libraries and videos... Not every one, not all the time... but a lot of them, and by default (you have to manually access the direct URL for the media library you want to access, and somtimes that still fails).

It seems that they've got an API hook that calls home when you access your media server, and it's not supposed to be required for operations when there is internet access... but in practice, it IS required, because It's failing closed. That API hook is not completely down, but it's responding so slowly, that it is effectively down, as requests will time out most of the time from most servers etc...

Theoretically, if there's no internet access from your media server, and you access it locally via direct URI (local ip address, port, and path), your media server SHOULD just load the default page view. Though in my experience, this also fails sometimes on some clients.
UPDATE 2145utc : unless you access some specific URLs, some of their entire web domains or subdomains are timing out or giving server errors. 
I think they may have an infrastructure issue, as well as an API issue.
For example, as of right now, the main app URL and app URI are both giving a server error. https://plex.tv/app and https://plex.tv are both giving server errors.
But, if you access it by https://www.plex.tv the main page loads.... Until you try to sign in, at which point it starts timing out again. That's generally a session management, authentication management,  load balancing, or content distribution and delivery network issue. 
Then, if you attempt to sign in, sometimes it timesout without presenting the login dialog, sometimes the dialog loads, however every time the dialog loaded, my signin timed out sliently, either freezing, or just going back to the login prompt... But the really fun part, is that I got a "new login" notification email from Plex, even though the site wasn't actually granting me access. 
Doing some basic systematic investigation... it's definitely a session and authentication management issue somewhere... or likely a combination of issues stacking to cause the failure. Especially as it's a timeout issue and it's intermittent, and given the URL/URI issue, and the login and presentation issue It's most likely an interaction between their load balancing/content distribution, and their auth and session management API or backend service. 
This is a good lesson on why you don't implement optional non-security things, with "fail closed" dependencies. The default should be, if that API hook can't hit its call home, then the default page view appears. Not "plex is unreachable".

Now... There are lots of times when you want things to fail closed. When something is not actually optional, then yes, if that thing isn't available, you should fail closed, and provide a helpful error message as to why. If something is important for security reasons and it's not available, you should definitely fail closed... Often in those circumstances you should fail closed silently, without error output, or with just generic and non-helpful output, so that the failure in security is non-obvious.

... But you should never fail closed on something just because it's an option you want to have, but isn't necessary for functionality and security.

Your personal gratification, and "nice to have"... or your businesses desire to have some piece of data.. are MUCH less important than making sure your users have the best possible functionality and user experience, as much of the time as possible.

Sadly, it's a very common flaw in both implementation, and basic thought process.