Monday, October 07, 2013

Hmm... yeah... there's your problem right there...

So, my MacBook Pro has been acting a bit wonky lately... and I'm pretty sure I know why.

A few weeks ago, I started getting disk corruption, that couldn't be handled with the normal disk utilities; and required me to get a clean backup on an external drive, wipe and reformat the internal drive, and restore...

Well, I kept getting the corruption problems after a few hours or a few days... and they kept getting worse.

Finally, I ended up rebuilding the thing 5 times in two day; and 3 times in one night (this was the night before my big compliance webinar. I didn't sleep at all the day before or that night, and ended up working all night and all morning before the webinar to try to get sorted).

This was basically two weeks of escalating pain, but until the last night the issues were intermittent with variable recurrence, so I couldn't get enough diagnostic info to nail it down.

With the 3 in on night episode, I was finally able to see the problem occuring...

And it's something I have NEVER seen... never even heard of...

What was happening on the HDD was lots of tiny single bit/single block/single write i/o errors. Ok, that happens... but why? It was a less than 90 day old relatively high end SSD (my last SSD went bad this past summer).

So I looked deeper at the errors, and noted that not all of them were from the hard drive...

Some of them were from the DVD drive...

Which had a scratched up DVD-R in it...

I pulled out the bad DVD-R and... holy crap, no more I/O errors.

What was happening, was that the particular damage on the DVD drive, was causing the I/O controller to constantly attempt to re-read the drive, and fail... hundreds of times a second. Instead of just limiting out though, it was causing enough latency in the SSD, that it was getting I/O errors as well...

Thing is... I didn't notice, because the DVD drive wasn't constantly spinning up... just a couple times an hour maybe? Which could have been explained by finder doing crap.

I've never seen that before... never even heard of that before, in a desktop or laptop (it's something that can happen with large high volume high transaction count servers, if they don't have sufficient spindles or cache, and their i/o controllers don't handle the exceptions properly).

Anyway... I got that resolved, and got my MBP functional...

But, ever since the last rebuild (after I figured out the problems), it's been a bit wonky. The finder doing some weird things etc...

I've run all the normal diagnostics, and at this point I'm pretty sure that to get sorted, I'm going to need to do another clean beackup, but instead of just restoring, I need to do a clean install, then migrate my apps and data.

It's a PITA, so I'm putting it off until I can't put it off anymore...

Meantime, I'm living with assorted wonkiness.

One of the items of said assortment; I hadn't really noticed it until a couple days ago, but I couldn't empty my trash.

This happens on OSX sometimes, it's not really a big deal. Usually it's a file that is locked somewhere and it can't be forced to let go because of a zombie process, or a bad pointer somewhere etc...

It's generally easy to fix. You just go into the trash directory from the command line, and force delete everything.

So, I went in, as root, and did a listing of my .Trash.

And it took a while... a LOOONG while... many many many screens of data flashing by my screen...

24 MILLION ITEMS... for a total of 243.8 gigabytes.

Well... there's yer problem right there...

It seems that the detritus of the multiple rebuilds... including several complete copies of my hard drive... ended up getting stuck in the trash for some reason; and couldn't empty out.

So, I started the force delete and went on to other things in other windows... after about 20 minutes I came back... and my command prompt hadn't come back...

I figured it had frozen up, or otherwise wasn't working; so I cancelled the job. Ran the listing again...

Nope... it had been working... It had deleted 9 million of the items, there were still 15 million left.

So I started the job back up again and went away for 20 more minutes... went back... still working...

As I was about to switch windows away it finally finished.

It took 40 minutes to delete the crap from the command line, no wonder I couldn't empty or open my trash in finder...