Outline ·
[ Standard ] ·
Linear+
Hardware IBM Server x3850 X5, need help with amber light issues.
TSkualaloco79
|
Oct 26 2021, 04:30 PM, updated 3y ago
|
New Member
|
Hi guys, i don't know whether this is the right place to ask for this.
So, currently I'm running IBM Server x3850 X5 as database server and running RHEL 7.3 (Maipo) Issue is I've been having a problem with PCI amber light keeps lighting on. Server currently having 2 PCI adapter in which 1 is directed to SAN Storage through fiber connection. Couple years back, the SAN Storage corrupted causing some data are unable to retrieve. previous staff who works here have redirect the data from SAN to backup tape and they've disable the connection to SAN Storage. Now the only available backup place is the backup tape. issue is the amber light error on pci keeps lighting on even there is no connection available.
1. I suspected the PCI card might be faulty so decided to remove the PCI from the slow and plug it again (with power removed). amber light went of for roughly 24 hours and turns back on after that.
2. fast forward 3 month after this time, fully removing the pci adapter from the slot and didn't plug it back in. amber light went off and turns on again after 1 night.
from that, i concluded that there is nothing wrong with the pci card and problem could be from the mainboard itself.
I've checked the log and no kernel panic error involved and only this error message keeps popping up.
db kernel: mce: [Hardware Error]: Machine check events logged db mcelog: Hardware event. This is not a software error. db mcelog: MCE 0 db mcelog: CPU 6 BANK 8 db mcelog: TIME 1631766780 Thu Sep 16 12:33:00 2021 db mcelog: MCG status: db mcelog: MCi status: db mcelog: Error overflow db mcelog: Corrected error db mcelog: Error enabled db mcelog: MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR db mcelog: Transaction: Memory read error db mcelog: STATUS d00000800800009f MCGSTATUS 0 db mcelog: MCGCAP 1000c18 APICID c0 SOCKETID 3 db mcelog: CPUID Vendor Intel Family 6 Model 47
currently still searching on what could be the cause of the amber light keeps lighting on and really appreciate if you guys who maybe have dealt with similar issues before on how and what did you guys do to solve it.
Much appreciated
Thank you.
|
|
|
|
hokl1010
|
Oct 26 2021, 05:07 PM
|
New Member
|
|
|
|
|
Eventless
|
Oct 26 2021, 06:33 PM
|
|
QUOTE(kualaloco79 @ Oct 26 2021, 04:30 PM) db mcelog: MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR db mcelog: Transaction: Memory read error Based on the error message above, it looks like a memory issue. QUOTE(kualaloco79 @ Oct 26 2021, 04:30 PM) I am guessing that this is the specific memory module that is causing the problem. You may need to look into the user manual to see where this memory module is located on the motherboard. Maybe there is a diagram inside the server casing that shows the location of that cpu/ram slot ?
|
|
|
|
TSkualaloco79
|
Oct 27 2021, 10:06 AM
|
New Member
|
QUOTE(hokl1010 @ Oct 26 2021, 05:07 PM) its worth for you to try to take a look on the following: i did take a look at some of the resolution from redhat website. what confuse me is that the server keeps rebooting at different times. although it does not really affecting productivity of work but it keeps me wondering whether it could be an OS issues as well instead of hardware error that the logs state. will do further digging on this to clarify. thank you for the help.
|
|
|
|
TSkualaloco79
|
Oct 27 2021, 10:14 AM
|
New Member
|
QUOTE(Eventless @ Oct 26 2021, 06:33 PM) Based on the error message above, it looks like a memory issue. I am guessing that this is the specific memory module that is causing the problem. You may need to look into the user manual to see where this memory module is located on the motherboard. Maybe there is a diagram inside the server casing that shows the location of that cpu/ram slot ? could it be that the memory module that is shown in the log be the main causes of the server to be rebooted unexpectedly? . there's a diagram on the top cover of the server on the location of the memory module. will look into it to confirm. is there a configuration on the memory module that needs to be look into first before removing it or i can remove it just like that. i'm afraid that if i just remove it without knowing what will happen will jeopardize the whole system. thank you for your reply.
|
|
|
|
hokl1010
|
Oct 27 2021, 02:34 PM
|
New Member
|
if it keeps rebooting at random intervals. Then the underlying issues were normally caused by hardware.
1. You can consider virtualizing the server as a test server to rule out the OS issue. 2. Try changing the CPU power profiles using the command: tuned-adm 3. Consider swapping the server motherboard and memory from an identical machine
|
|
|
|
Eventless
|
Oct 27 2021, 08:59 PM
|
|
QUOTE(kualaloco79 @ Oct 27 2021, 10:14 AM) could it be that the memory module that is shown in the log be the main causes of the server to be rebooted unexpectedly? . there's a diagram on the top cover of the server on the location of the memory module. will look into it to confirm. is there a configuration on the memory module that needs to be look into first before removing it or i can remove it just like that. i'm afraid that if i just remove it without knowing what will happen will jeopardize the whole system. thank you for your reply. If the server is booting unexpectedly without leaving any kind of log, it probably caused by something else that is hardware related. How old is the server? It is possible that the power supply is starting to fail. The configuration for the memory module layout should be mentioned in the user manual. I suggest you look there first before continuing. Any warnings should also be there.
|
|
|
|
TSkualaloco79
|
Oct 28 2021, 09:06 AM
|
New Member
|
QUOTE(hokl1010 @ Oct 27 2021, 02:34 PM) if it keeps rebooting at random intervals. Then the underlying issues were normally caused by hardware. 1. You can consider virtualizing the server as a test server to rule out the OS issue. 2. Try changing the CPU power profiles using the command: tuned-adm 3. Consider swapping the server motherboard and memory from an identical machine will look into these as well. swapping server MOBO might not be possible since the other same spec server is used to host production. i'm new to these server thingy so most of time i really need to invest my time searching for the solution somewhere. thank you for your reply.
|
|
|
|
TSkualaloco79
|
Oct 28 2021, 09:09 AM
|
New Member
|
QUOTE(Eventless @ Oct 27 2021, 08:59 PM) If the server is booting unexpectedly without leaving any kind of log, it probably caused by something else that is hardware related. How old is the server? It is possible that the power supply is starting to fail. The configuration for the memory module layout should be mentioned in the user manual. I suggest you look there first before continuing. Any warnings should also be there. there is some kinds of log that it leaves its just that i don't really know how to read the logs. server is about 8 years old now. i previously check the power supply it looks like it is still in good condition. will dig into the manual for info. thank you for the reply. appreciate it.
|
|
|
|
hokl1010
|
Oct 28 2021, 10:18 AM
|
New Member
|
QUOTE(kualaloco79 @ Oct 28 2021, 09:09 AM) there is some kinds of log that it leaves its just that i don't really know how to read the logs. server is about 8 years old now. i previously check the power supply it looks like it is still in good condition. will dig into the manual for info. thank you for the reply. appreciate it. if the server is more than 6 years i would suggest that your organization should look at either virtualizing it (on premise or cloud) or migrate it to a new server since its a DB server (which i believe its mission critical component for your organization). Which DB instance you are running in the box btw? This post has been edited by hokl1010: Oct 28 2021, 10:20 AM
|
|
|
|
Eventless
|
Oct 28 2021, 08:48 PM
|
|
QUOTE(kualaloco79 @ Oct 28 2021, 09:09 AM) there is some kinds of log that it leaves its just that i don't really know how to read the logs. server is about 8 years old now. i previously check the power supply it looks like it is still in good condition. will dig into the manual for info. thank you for the reply. appreciate it. If you know when the reboots are happening, narrow the search of the logs to that particular time frame? Also is it possible that the bios menu on the server have some kind of log, message or status indicator that could indicate the cause of the amber light? Are the drives still in good condition? Try to check the SMART status of the drives to see whether there are read errors or bad sectors happening.
|
|
|
|