Welcome to our community

Host Computer Services    hostcomputers.infopop.cc    Forums  Hop To Forum Categories  HP Hardware Archive    DS10 - 630 -- Correctable Processor Event

Read-Only Read-Only Topic
Go
New
Find
Notify
Tools
-star Rating Rate It!  Login/Join 
Member
Posted
Hello,
can someone help me with troubleshooting of ECC error on DS10? I am somewhat stucked...

Problem DS10 (4 x 512MB dimms) is hanging intermittently
there are a lot of "Correctable Error Throttling Notification Event Detected" in errorlog.
But only two "630 -- Correctable Processor Event" almost 1 year old. So I have focused on them.

1. DC_STAT and C_STAT register
DC_STAT: x0000 0000 0000 0008 - Dcache ECC during load instruction
C_STAT: x0000 0000 0000 0003 - Single-bit Memory ECC Fill to Dcache Error

2. C_ADDR register
value: x0000 0000 260F 10C0
The error falls into array 0
AAR_0 x0000 0000 0000 7009
AAR_1 x0000 0000 4000 7009

3. XORing
XORing is enabled
Cchip_CSC_Reg x0142 4441 1515 2864 (bit 39 is clear)

4. bit [8:7] of C_ADDR is 01 and original array is 0. So the real array should be 1.

5. and here is the point where I am stucked. I can not find a way how to identify which DIMM in the array is causing these problems. Documentation refers to some DPRs registers which I cannot find.

Thank you for your help
Zdenek


here is whole wsea output

COMMON EVENT HEADER (CEH) V2.0 Event_Leader xFFFF FFFE
Header_Length 284
Event_Length 488
Header_Rev_Major 2
Header_Rev_Minor 1
OS_Type 2 -- OpenVMS
Hardware_Arch 4 -- Alpha
CEH_Vendor_ID 3,564 -- Hewlett-Packard Company
Hdwr_Sys_Type 34 -- Tsunami/Typhoon Corelogic
Logging_CPU 0 -- CPU Logging this Event
CPUs_In_Active_Set 1
Major_Class 26
Minor_Class 0
Entry_Type 630 -- Correctable Processor Event
DSR_Msg_Num 1,839 -- AlphaServer DS10 466Mhz
Chip_Type 8 -- EV6 - 21264
CEH_Device 0
CEH_Device_ID_0 x0000 0000
CEH_Device_ID_1 x0000 0000
CEH_Device_ID_2 x0000 0000
Unique_ID_Count 3,313
Unique_ID_Prefix 0
Exact_Length 192
Num_Strings 6



TLV Section of CEH TLV_DSR_String AlphaServer DS10 466 MHz
TLV_Sys_Serial_Num JA22804934
TLV_Time_as_Local Sun 26 Feb 2006 11:23:20 GMT+01:00
TLV_OS_Version V7.3-2
TLV_Computer_Name PLCZ2



Entry_Type 630


Logout_Frame_CPU_Section Frame_Size x0000 0080
Frame_Flags x8000 0000
CPU_Area_Offset x0000 0018
System_Area_Offset x0000 0058
Mchk_Error_Code x0000 0086 Machine Check Logout Frame Error Code
Value[31:0] x86 CPU Non-Fatal
Frame_Rev x0000 0001
I_STAT x0000 0000 0000 0000
DC_STAT x0000 0000 0000 0008 Dcache Status Register
ECC_Err_Ld[3] x1 Dcache ECC during load instruction
C_ADDR x0000 0000 260F 10C0 Cbox Read Erred Address Register
Error_Ref[42:6] x98 3C43 Access Reference Location
Io_M[43] x0 System Memory Access
C_SYNDROME_1 x0000 0000 0000 0000 Odd QW Data Syndrome
QW_Upper[7:0] x0 No Syndrome
C_SYNDROME_0 x0000 0000 0000 005B Even QW Data Syndrome
QW_Lower[7:0] x5B Data Bit 38
C_STAT x0000 0000 0000 0003 Cbox Read Status Register
Cbox_Error[4:0] x3 Single-bit Memory ECC Fill to Dcache Error
C_STS x0000 0000 0000 000A Cache Block Access Status Register
Cblock_Status[3:0] xA Dirty, Parity
MM_STAT x0000 0000 0000 0000 Memory Management Status Register



Logout_Frame_System_Section SW_Sum_Flags x0000 0000 0000 0004 Software Summary Flags Register
Pchip_Mem_Error[2] x1 Pchip 0/1 or CPU Detected Memory ECC Error
Cchip_DIR x0000 0000 0000 0000 Cchip Device Interrupt Request Register
Cchip_MISC x0000 0000 0000 0000
Nxm[28] x0 Nxs[31:29] NOT Valid
Nxs[31:29] x0 If Nxm[28] = 1 - CPU 0 Source Device
P0_Perror x0000 0000 0000 0000 No Errors Detected
P0_PCI_Addr[47:18] x0 PCI Address Bits[31:2]
P0_Sys_Addr[50:19] x0 System Memory Address Bits [34:3]
P0_Cmd[55:52] x0 IACK - CPU PIO Read or DMA Read
P0_Synd[63:56] x0 No Syndrome
P1_Perror x0000 0000 0000 0000 Pchip1 Error Detection Register
P1_PCI_Addr[47:18] x0 PCI Address Bits[31:2]
P1_Sys_Addr[50:19] x0 System Memory Address Bits[34:3]
P1_Cmd[55:52] x0 IACK - CPU IO Read or DMA Read
P1_Synd[63:56] x0 No Syndrome



START OF SUBPACKETS IN THIS EVENT


Cchip Config Subpacket, Version 1 AAR_0 x0000 0000 0000 7009 Memory Array 0 Configuration Register
Sa0[8] x0 Non - Split Array
Asiz0[15:12] x7 1 Gb
Addr0[34:24] x0 Array0 Base Address [34:24] Bits
AAR_1 x0000 0000 4000 7009 Memory Array 1 Configuration Register
Sa1[8] x0 Non - Split Array
Asiz1[15:12] x7 1 Gb
Addr1[34:24] x40 Array1 Base Address [34:24] Bits
AAR_2 x0000 0000 0000 0000 Memory Array 2 Configuration Register
Sa2[8] x0 Non - Split Array
Asiz2[15:12] x0 Array 2 Not Used
Addr2[34:24] x0 Array2 Base Address [34:24] Bits
AAR_3 x0000 0000 0000 0000 Memory Array 3 Configuration Register
Sa3[8] x0 Non - Split Array
Asiz3[15:12] x0 Array 3 Not Used
Addr3[34:24] x0 Array3 Base Address [34:24] Bits
P0_CTL x0000 1244 014C 0091 Pchip 0 Control Register
ECCen0[18] x1 Pchip 0 ECC Detection Enabled
P1_CTL x2020 325A 434C 5008 Pchip 1 Control Register
Eccen1[18] x1 Pchip 1 ECC Detection Enabled
 
Posts: 9 | Location: Prague, CZReport This Post
How do I get out of here?
Posted Hide Post
Hi Zdenek,
I will quickly try & answer a few bits for you, but I don't have time to look at everything right now...
No point looking for DPR registers on a DS10, it dosn't have any.
I believe that your XORing thing is also of little value as I suspect that what you have been looking at is an ES40 or ES45 service guide & the memory configuration etc is different for a DS10.I don't think that DS10's do this but I'm not 100% on that.
I'm not sure why you are looking at events that are a year old if the system is crashing having problems today/this week ???
Machine Check 630 are CPU internal [like cpu register/cpu problems & Bcache errors] & non fatal so it is unlikely that it is a memory dimm fault.
Going by the info in your extract of the error log the it is the transfer of data from the Bcache [L2 cache] to the Dcache [Data cache] that is the problem.
This indicates that a data bit is being dropped when it is transferred from Bcache to the Dcache which is internal to the CPU. However it has been corrected so unless this continues [multiple errors over a short period of time/days]then there should be no reason to change h/w for this reason as the platform is doing what it has been designed to do. If you wish to change something for this error then I would suggest that the problem is with the main logic board [the cpu/bcache is part of this board] as it is a DS10.
The bit about throttling is that there has been a large burst of errors [in a typically very short period of time (secs or less)] that have the potential to swamp the errorlog, therefore this has triggered a threashold which "turns the errorlog off" to these events for a short period of time. When it turns back on & the same errors are still being reported it turns off again etc until things settle down. Note only the event[s] that trigger the throttling will be disabled.
This is where I suggest you spend a bit of time checking as this is probably the problem.
You are using SEA but I would first suggest that if it is old, that you update to the current version which can be found: HERE just pick the correct version that you need.
Then use the following commands from the command line to filter down the error log:
wsea n sum index
! this will give you one line for every entry in the current default errorlog otherwise specify your input log
wsea x filterlog XXX.errlog new.errlog “dtb=20-oct-2006,16:00 & et=mchk-all”
! specify your input log file to only look for, in this case, all machine checks since the date & time specified. Obviously tailor this to suit the first command
wsea n ana input new.errlog out new.txt
! this will output to a file so you can search it etc of just miss the "out new.txt" bit off & watch the screen. If this doesn't give out anything then try:
wsea n tra input new.errlog all
! the "all" bit is important as this well show correctables which may well be what you are missing from your output in the first place ? The default in NOT to show correctables. This may well help.
Also, if you can shut the system down then you could try "memexer" & "sys_exer" from the console prompt. You may also be able to get some information from the "info" cmd if the system has just crashed.

Hope this help...

MEA[T]

This message has been edited. Last edited by: MEA(T) Cleaver,


Jolly Jack Tar 3E Mess HMS "GUZZ"
 
Posts: 187 | Location: GUZZReport This Post
How do I get out of here?
Posted Hide Post
carrying on from above.....

If you suspect mem errors and the memexer, sys_exer, SEA doesn't pull anything out, then from you original post you say the the system has four 512Mb DIMMS.
Remove bank 1 [see inside cover for location or DS10 service guide] for a few days/week & see what happens, asumming the customer is OK with this.
If nothing happens then remove the now original Bank 0 DIMMS & put the other ones back in there place {Bank 0} for a few days/weeks.
If one of the mem banks fails then that is probably where your problem is, if it fails on both sets then it is probably a mainlogic board problem so replace as required.
To track down an individual DIMM the put one from the "good" pair in with one of the others to establish the faulty one.
The main problem with this mem testing is time & how the customer reacts to you doing this as in down time, system crashing etc.
If the mem option is not possible then get four new DIMMs, replace the customer original ones & test the customer original dimms in your own DS10. This makes an assumtion that you have a DS10 available as well as enough DIMMS to swap them out in the first place!

Hope this helps....

MEA[T] Smiler


Jolly Jack Tar 3E Mess HMS "GUZZ"
 
Posts: 187 | Location: GUZZReport This Post
  Powered by Eve Community  

Read-Only Read-Only Topic

Host Computer Services    hostcomputers.infopop.cc    Forums  Hop To Forum Categories  HP Hardware Archive    DS10 - 630 -- Correctable Processor Event