Welcome to our community


Read-Only Read-Only Topic
Go
New
Find
Notify
Tools
-star Rating Rate It!  Login/Join 
Member
Posted
Hi all, just a really wierd one for you to think about. Sorry - it goes on a bit. We had a power cut at our site (great fun as I was the only server hardware eng on site and about 3000 servers went down simultaneously), one of the ES40's from a cluster was failing to see it's CI cards and so unable to rejoin the cluster (or boot). I was there till the early hours and never really found the problem.
I ordered up a set of CI cards (two cards that work together as one). I swapped them when the new ones arrived, but they still weren't seen. I tried them in a different pair of slots and the system did see them - so I ordered up a PCI backplane. That arrived and I swapped it, put the original CI cards in and they still weren't seen. I tried the new ones and they weren't either. I put them in different slots and they were picked up straight away.
So two sets of CI cards, two PCI backplanes, same fault with any combination. It had by this time gone 1:00 am and we (I and the software guy) decided to accept defeat and get the system working. So we put the original cards back in with the original PCI backplane, but put the cards in different slots. The poor software guy then had to spend until 5:00 reconfiguring the software to use the CI cards in the new location.
So a workaround rather than a fix, but the problem is, that this is a critical system so I've not been able to go back to it to fix the fault. So now, months later I still don't know what the problem is. Any ideas?


Regards,

Adam
 
Posts: 17 | Location: LancashireReport This Post
Anorak
Posted Hide Post
Hello,
In order to discover the cause of this problem can you tell me which bus/slots in which the CI cards were in originally and where they are now. Thanks


So long and thanks for all the fish
 
Posts: 75 | Location: Host ComputersReport This Post
Member
Posted Hide Post
Hi, I've been thinking about this since I did the course (last week) and thought I knew what it was. The PChips are on the system board in the ES40, so if one of the PChips was faulty and I moved them to slots belonging to the other one, that would explain it. However, I went to check which slots I'd moved them from and to and they came from the 7th and 8th down and I moved them to 5th and 6th down. I've checked the book and these slots are all on the same hose. So I'm stumped again (it doesn't take much)!


Regards,

Adam
 
Posts: 17 | Location: LancashireReport This Post
Anorak
Posted Hide Post
Hi Adam,
It looks like you have managed to stump us with this one too. We agree the working slots share the same Pchip. There are two possibilities, 1: the Pchip cannot, for some reason identify any device in the slot to which the CIPCI adapter uses to communicates with the PCI bus, each slot is probed by SRM at initialization time, SRM waits for an interrupt from a device after bus reset then walks down the bus slot by slot configuring adapters as they are found. 2: SRM may have become corrupt during the power failure, SRM lives in Flash memory but is not write protected. In any case moving the adapter to another slot negates reason 2. I give in.
Regards
Paul


So long and thanks for all the fish
 
Posts: 75 | Location: Host ComputersReport This Post
Member
Posted Hide Post
OK. Thanks anyway. By pure coincidence, I was working on this same machine again today - another power cut believe it or not! It had a dead PSU, a dead CPU and the SRM was corrupt. I used the FSL to start it and updated the firmware, so that may have corrected the old fault too!??! Still, I wasn't brave enough to move the CI cards. :-s


Regards,

Adam
 
Posts: 17 | Location: LancashireReport This Post
Member
Posted Hide Post
I've had a fare share of ES40 problems concerning CPU and Memory trouble. After powerdowns (accidental or ment to) the ES40 (ES45 as well but less) often has a problem with his contacts. On one system equipped with 8 Gb, after a powerdown only 2 Gb came back online....plus I missed one CPU. After cleaning all contacts (CPU board Memory board, DIMM's) with denaturalized alcohol (sorry P&S, not drinkable) everything was back online. I've had several cases with similar problems. All of wich were solved after cleaning contacts.
Logical thinking concludes to oxydation & cooling down (contracting) of CI's?

Regards, Bert.

"The reason for living is life itself. So live your life to the fullest and at the end you'll be able to say: My life ment something!!!"
 
Posts: 31 | Location: NetherlandsReport This Post
Member
Posted Hide Post
Thanks - yes, I've very often fixed ES40s and other alphas with just a clean of the conacts and reseat on MMBs DIMMs and CPUs. In this case though, swapping the boards for known good ones didn't do any good, but they did work in a different slot so I don't think that would help in this case. Thanks for the reply though.


Regards,

Adam
 
Posts: 17 | Location: LancashireReport This Post
  Powered by Eve Community  

Read-Only Read-Only Topic