Welcome Guest ( Log In | Register )

Outline · [ Standard ] · Linear+

Hardware IBM Server x3850 X5, need help with amber light issues.

views
     
TSkualaloco79
post Oct 26 2021, 04:30 PM, updated 3y ago

New Member
*
Junior Member
12 posts

Joined: Aug 2019
Hi guys, i don't know whether this is the right place to ask for this.

So, currently I'm running IBM Server x3850 X5 as database server and running RHEL 7.3 (Maipo)
Issue is I've been having a problem with PCI amber light keeps lighting on. Server currently having 2 PCI adapter in which 1 is directed to SAN Storage through fiber connection.
Couple years back, the SAN Storage corrupted causing some data are unable to retrieve. previous staff who works here have redirect the data from SAN to backup tape and they've disable the connection to SAN Storage.
Now the only available backup place is the backup tape.
issue is the amber light error on pci keeps lighting on even there is no connection available.

1. I suspected the PCI card might be faulty so decided to remove the PCI from the slow and plug it again (with power removed). amber light went of for roughly 24 hours and turns back on after that.

2. fast forward 3 month after this time, fully removing the pci adapter from the slot and didn't plug it back in. amber light went off and turns on again after 1 night.

from that, i concluded that there is nothing wrong with the pci card and problem could be from the mainboard itself.

I've checked the log and no kernel panic error involved and only this error message keeps popping up.

db kernel: mce: [Hardware Error]: Machine check events logged
db mcelog: Hardware event. This is not a software error.
db mcelog: MCE 0
db mcelog: CPU 6 BANK 8
db mcelog: TIME 1631766780 Thu Sep 16 12:33:00 2021
db mcelog: MCG status:
db mcelog: MCi status:
db mcelog: Error overflow
db mcelog: Corrected error
db mcelog: Error enabled
db mcelog: MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
db mcelog: Transaction: Memory read error
db mcelog: STATUS d00000800800009f MCGSTATUS 0
db mcelog: MCGCAP 1000c18 APICID c0 SOCKETID 3
db mcelog: CPUID Vendor Intel Family 6 Model 47


currently still searching on what could be the cause of the amber light keeps lighting on and really appreciate if you guys who maybe have dealt with similar issues before on how and what did you guys do to solve it.

Much appreciated

Thank you.
hokl1010
post Oct 26 2021, 05:07 PM

New Member
*
Junior Member
36 posts

Joined: Jan 2006


its worth for you to try to take a look on the following:

https://www.ibm.com/support/pages/interpret...850-x5-x3950-x5

https://access.redhat.com/solutions/367773

https://access.redhat.com/solutions/67599

https://forums.centos.org/viewtopic.php?t=7218

This post has been edited by hokl1010: Oct 26 2021, 05:08 PM
Eventless
post Oct 26 2021, 06:33 PM

Look at all my stars!!
*******
Senior Member
2,641 posts

Joined: Jan 2003
QUOTE(kualaloco79 @ Oct 26 2021, 04:30 PM)
db mcelog: MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
db mcelog: Transaction: Memory read error
*
Based on the error message above, it looks like a memory issue.

QUOTE(kualaloco79 @ Oct 26 2021, 04:30 PM)
db mcelog: CPU 6 BANK 8
*
I am guessing that this is the specific memory module that is causing the problem. You may need to look into the user manual to see where this memory module is located on the motherboard.

Maybe there is a diagram inside the server casing that shows the location of that cpu/ram slot ?
TSkualaloco79
post Oct 27 2021, 10:06 AM

New Member
*
Junior Member
12 posts

Joined: Aug 2019
QUOTE(hokl1010 @ Oct 26 2021, 05:07 PM)
its worth for you to try to take a look on the following:



*
i did take a look at some of the resolution from redhat website.

what confuse me is that the server keeps rebooting at different times. although it does not really affecting productivity of work but it keeps me wondering whether it could be an OS issues as well instead of hardware error that the logs state.

will do further digging on this to clarify. thank you for the help.
TSkualaloco79
post Oct 27 2021, 10:14 AM

New Member
*
Junior Member
12 posts

Joined: Aug 2019
QUOTE(Eventless @ Oct 26 2021, 06:33 PM)
Based on the error message above, it looks like a memory issue.
I am guessing that this is the specific memory module that is causing the problem. You may need to look into the user manual to see where this memory module is located on the motherboard.

Maybe there is a diagram inside the server casing that shows the location of that cpu/ram slot ?
*
could it be that the memory module that is shown in the log be the main causes of the server to be rebooted unexpectedly? .

there's a diagram on the top cover of the server on the location of the memory module. will look into it to confirm.

is there a configuration on the memory module that needs to be look into first before removing it or i can remove it just like that. i'm afraid that if i just remove it without knowing what will happen will jeopardize the whole system.

thank you for your reply.
hokl1010
post Oct 27 2021, 02:34 PM

New Member
*
Junior Member
36 posts

Joined: Jan 2006


if it keeps rebooting at random intervals. Then the underlying issues were normally caused by hardware.

1. You can consider virtualizing the server as a test server to rule out the OS issue.
2. Try changing the CPU power profiles using the command: tuned-adm
3. Consider swapping the server motherboard and memory from an identical machine
Eventless
post Oct 27 2021, 08:59 PM

Look at all my stars!!
*******
Senior Member
2,641 posts

Joined: Jan 2003
QUOTE(kualaloco79 @ Oct 27 2021, 10:14 AM)
could it be that the memory module that is shown in the log be the main causes of the server to be rebooted unexpectedly? .

there's a diagram on the top cover of the server on the location of the memory module. will look into it to confirm.

is there a configuration on the memory module that needs to be look into first before removing it or i can remove it just like that. i'm afraid that if i just remove it without knowing what will happen will jeopardize the whole system.

thank you for your reply.
*
If the server is booting unexpectedly without leaving any kind of log, it probably caused by something else that is hardware related. How old is the server? It is possible that the power supply is starting to fail.

The configuration for the memory module layout should be mentioned in the user manual. I suggest you look there first before continuing. Any warnings should also be there.
TSkualaloco79
post Oct 28 2021, 09:06 AM

New Member
*
Junior Member
12 posts

Joined: Aug 2019
QUOTE(hokl1010 @ Oct 27 2021, 02:34 PM)
if it keeps rebooting at random intervals. Then the underlying issues were normally caused by hardware.

1. You can consider virtualizing the server as a test server to rule out the OS issue.
2. Try changing the CPU power profiles using the command: tuned-adm
3. Consider swapping the server motherboard and memory from an identical machine
*
will look into these as well. swapping server MOBO might not be possible since the other same spec server is used to host production.

i'm new to these server thingy so most of time i really need to invest my time searching for the solution somewhere. biggrin.gif

thank you for your reply.

TSkualaloco79
post Oct 28 2021, 09:09 AM

New Member
*
Junior Member
12 posts

Joined: Aug 2019
QUOTE(Eventless @ Oct 27 2021, 08:59 PM)
If the server is booting unexpectedly without leaving any kind of log, it probably caused by something else that is hardware related. How old is the server? It is possible that the power supply is starting to fail.

The configuration for the memory module layout should be mentioned in the user manual. I suggest you look there first before continuing. Any warnings should also be there.
*
there is some kinds of log that it leaves its just that i don't really know how to read the logs. server is about 8 years old now. i previously check the power supply it looks like it is still in good condition.

will dig into the manual for info.

thank you for the reply. appreciate it.
hokl1010
post Oct 28 2021, 10:18 AM

New Member
*
Junior Member
36 posts

Joined: Jan 2006


QUOTE(kualaloco79 @ Oct 28 2021, 09:09 AM)
there is some kinds of log that it leaves its just that i don't really know how to read the logs. server is about 8 years old now. i previously check the power supply it looks like it is still in good condition.

will dig into the manual for info.

thank you for the reply. appreciate it.
*
if the server is more than 6 years i would suggest that your organization should look at either virtualizing it (on premise or cloud) or migrate it to a new server since its a DB server (which i believe its mission critical component for your organization). Which DB instance you are running in the box btw?

This post has been edited by hokl1010: Oct 28 2021, 10:20 AM
Eventless
post Oct 28 2021, 08:48 PM

Look at all my stars!!
*******
Senior Member
2,641 posts

Joined: Jan 2003
QUOTE(kualaloco79 @ Oct 28 2021, 09:09 AM)
there is some kinds of log that it leaves its just that i don't really know how to read the logs. server is about 8 years old now. i previously check the power supply it looks like it is still in good condition.

will dig into the manual for info.

thank you for the reply. appreciate it.
*
If you know when the reboots are happening, narrow the search of the logs to that particular time frame?

Also is it possible that the bios menu on the server have some kind of log, message or status indicator that could indicate the cause of the amber light?

Are the drives still in good condition? Try to check the SMART status of the drives to see whether there are read errors or bad sectors happening.



 

Change to:
| Lo-Fi Version
0.0132sec    0.37    5 queries    GZIP Disabled
Time is now: 29th March 2024 - 09:07 PM