So on monday I’m using my computer and I open up VLC to watch some data engineering course I’ve downloaded. I get 45 seconds into the video and the screen goes blank, audio still playing. 10 seconds later the audio stops too. I log into the server under my desk and reboot the VM “Hyperion” that serves as my Desktop (due to bluetooth dongle and graphics card passed through to the VM from the server for keyboard&mouse and monitors, respectively).
Everything seems normal.. Try VLC again. This time it’s three whole videos before the screen goes black. Okay. These videos are dodgy.
I open up my phone and open the Proxmox server Management app and browse to my desktop VM. I hit the reset button. Nothing. I try again a couple times, going out and back into the app, and then the app says that the entire server is offline. This isn’t good. Any issue inside a virtual machine that can cause the Hypervisor host to go offline is a serious issue. My mind races, thinking about viruses that can break out of VMs by exploiting the graphics chip - my graphics chip is very old indeed - I pull the cord out of the server, and try to let my anxiety settle as the disks spin down and the everpresent hum of the fans disappears.
Friday.
It’s 9:30 AM on my day off.
I was up since dawn cleaning the kitchen, turning my partner’s dSLR into a webcam and brainstorming ideas to make my habits more atomic.
Anything to distract myself from the crushing realisation that my Server is deeply unwell. This server has single handedly handled any and all of my compute needs since I.. ahem.. consolidated my personal infrasrtructure estate, 3 years ago. It’s my fileserver, my daily driver, my test rig and my sandbox. It’s a good server.
At 9AM I finally got to work. I plugged the power cord back in, ensured the server’s graphics card was plugged into a monitor and I turned the power on. 30 seconds and an obvious boot loop later, I’m unplugging the power cord and taking the side off.
With the case off and the power turned on, the motherboard has a fancy little LCD which flicks through an alphanumeric sequence and then settles on a number - 07. This is a “Dr Debug” code, a standard for reporting computer startup problems. 07 means at the issue is something to do with “ap initialization after microcode loading”. That doesn’t sound simple to fix.
shifting gears
There are two paths I could take right now, as there often are. I could go low-tech, or I could go high-tech. What I mean is that I could try figure out what the hell “ap initialization after microcode loading” is, or I could just start unplugging thing and swapping things in and out until I get it either boots, or I get a different error.
Since I’m writing a blog about it, lets be smart about this and go high-tech to start with.
It’s 10 minutes later and I’m starting to regret my decision. Depending on who you ask, the issue might be the bios , faulty RAM or “some relationship to graphics” . A few other articles back up the idea of a memory fault.
Okay then. I guess it’s hardware. time to go low tech
Stripsearch
The plan is to systematically remove pieces of equipment until we have the bare minimum pieces of hardware attached. If it doesn’t work with just a CPU and one stick of RAM, then it’s definitely not going to work with GPUs and hard disks on top!
First step then is to unplug everything we don’t need.
I get some paper sheets (all my antistatic bags are in the garage) and lay them on my dekstop to put hardware on. Last thing I want is to give a component a static shock and never be able to use it again!
Now I pop out the hard disks from the hot-swap bays - blowing the dust off before I take a picture to hide my shame - and remove all the peripherals from the I/O shield, leaving only the power supply, the wifi aerial and the monitor cable (for the server GPU) plugged in. Power cord is always plugged in. Not just to test it and turn it on, but also to make sure that the server is always grounded.
No dice. 07 still showing.
I’ve done all the easy stuff, so, after grounding myself - both spiritually and electrically - I remove my, now 10 year old, gaming GPU and hit the power
07
Network card? 07.
both sticks of ram? 07.
I take out the server GPU, just in case. 07.
I’m starting to get a little desperate now. It’s possible both RAM sticks failed at the same time, but it’s very unlikely. This is enterprise grade server RAM. Kingston 16GB 2666MHZ 2rx8 unregistered ECC DIMM modules. These are the real deal. That said, I have been seeing (and ignoring) a lot of logs about “ecc corrected error” recently..
I go out to the garage and fetch the “test stick” - an 8GB stick of DDR4 that I got from CEX years ago to test the motherboard while I was waiting for the server RAM.
oh seven. Right so it’s not the RAM.
Oh god. I know what I forgot. The NVMEs! I already popped the hot-swaps, but there are two more disks still attached. One 256GB NVME disk and one 1TB nvme. Reason I forgot them is because they’re hidden under that big flashy aluminium heatsink. Right, so I need to remove those too, just in case it’s one of them that’s causing the issues.
Spoiler alert. It wasn’t the NVMEs.
Hail Mary
So this is it. with a cleared BIOS, no peripherals whatsoever and with three different sticks of RAM I get the same behaviour. A flicker of hex codes followed by 07.
It’s either the motherboard, the CPU or the PSU at this point. The only possible glimmer of hope I have is to update the BIOS to the latest supported version and hope to hell that for some reason that fixes it. At the very least if I do that then the board will be compatible with the new generation of processors I’m already eying up .
I pull up the “bios flashback” instructions from the motherboard manual - it’s an ASRock X570 “taichi” gaming motherboard - and I download the new bios. I unpack it and plug it into the special “flashback” port and hold down the button.
It blinks for about 2 minutes and then stops. Success. the bios is now up to date. Turn on the power.
..
..
Oh Seven.
End of the line
Okay. what am I going to do now? the problem has got to be either the motherboard, the CPU or the Power supply. I know power supplies can be tricky and mine is old enough to vote in Scotland. That’s something I can test.
Second, it looks like you can return a CPU to amazon within 30 days for any reason , so I can get a new CPU, and if it doesn’t work, I can just return it.
So that’s that I guess. I’ll buy a PSU and a new CPU, and return them if they don’t fix it.
If the issue is the motherboard then I’m looking at a total rebuild. yikes.
Now I gotta tidy upðŸ˜