|
Welcome to our community |
hostcomputers.infopop.cc
Forums
HP Hardware Archive
DS10 - 630 -- Correctable Processor Event
Read-Only Topic|
Go
![]() |
New
![]() |
Find
![]() |
Notify
![]() |
Tools
![]() |
|
Member |
Hello,
can someone help me with troubleshooting of ECC error on DS10? I am somewhat stucked... Problem DS10 (4 x 512MB dimms) is hanging intermittently there are a lot of "Correctable Error Throttling Notification Event Detected" in errorlog. But only two "630 -- Correctable Processor Event" almost 1 year old. So I have focused on them. 1. DC_STAT and C_STAT register DC_STAT: x0000 0000 0000 0008 - Dcache ECC during load instruction C_STAT: x0000 0000 0000 0003 - Single-bit Memory ECC Fill to Dcache Error 2. C_ADDR register value: x0000 0000 260F 10C0 The error falls into array 0 AAR_0 x0000 0000 0000 7009 AAR_1 x0000 0000 4000 7009 3. XORing XORing is enabled Cchip_CSC_Reg x0142 4441 1515 2864 (bit 39 is clear) 4. bit [8:7] of C_ADDR is 01 and original array is 0. So the real array should be 1. 5. and here is the point where I am stucked. I can not find a way how to identify which DIMM in the array is causing these problems. Documentation refers to some DPRs registers which I cannot find. Thank you for your help Zdenek here is whole wsea output COMMON EVENT HEADER (CEH) V2.0 Event_Leader xFFFF FFFE Header_Length 284 Event_Length 488 Header_Rev_Major 2 Header_Rev_Minor 1 OS_Type 2 -- OpenVMS Hardware_Arch 4 -- Alpha CEH_Vendor_ID 3,564 -- Hewlett-Packard Company Hdwr_Sys_Type 34 -- Tsunami/Typhoon Corelogic Logging_CPU 0 -- CPU Logging this Event CPUs_In_Active_Set 1 Major_Class 26 Minor_Class 0 Entry_Type 630 -- Correctable Processor Event DSR_Msg_Num 1,839 -- AlphaServer DS10 466Mhz Chip_Type 8 -- EV6 - 21264 CEH_Device 0 CEH_Device_ID_0 x0000 0000 CEH_Device_ID_1 x0000 0000 CEH_Device_ID_2 x0000 0000 Unique_ID_Count 3,313 Unique_ID_Prefix 0 Exact_Length 192 Num_Strings 6 TLV Section of CEH TLV_DSR_String AlphaServer DS10 466 MHz TLV_Sys_Serial_Num JA22804934 TLV_Time_as_Local Sun 26 Feb 2006 11:23:20 GMT+01:00 TLV_OS_Version V7.3-2 TLV_Computer_Name PLCZ2 Entry_Type 630 Logout_Frame_CPU_Section Frame_Size x0000 0080 Frame_Flags x8000 0000 CPU_Area_Offset x0000 0018 System_Area_Offset x0000 0058 Mchk_Error_Code x0000 0086 Machine Check Logout Frame Error Code Value[31:0] x86 CPU Non-Fatal Frame_Rev x0000 0001 I_STAT x0000 0000 0000 0000 DC_STAT x0000 0000 0000 0008 Dcache Status Register ECC_Err_Ld[3] x1 Dcache ECC during load instruction C_ADDR x0000 0000 260F 10C0 Cbox Read Erred Address Register Error_Ref[42:6] x98 3C43 Access Reference Location Io_M[43] x0 System Memory Access C_SYNDROME_1 x0000 0000 0000 0000 Odd QW Data Syndrome QW_Upper[7:0] x0 No Syndrome C_SYNDROME_0 x0000 0000 0000 005B Even QW Data Syndrome QW_Lower[7:0] x5B Data Bit 38 C_STAT x0000 0000 0000 0003 Cbox Read Status Register Cbox_Error[4:0] x3 Single-bit Memory ECC Fill to Dcache Error C_STS x0000 0000 0000 000A Cache Block Access Status Register Cblock_Status[3:0] xA Dirty, Parity MM_STAT x0000 0000 0000 0000 Memory Management Status Register Logout_Frame_System_Section SW_Sum_Flags x0000 0000 0000 0004 Software Summary Flags Register Pchip_Mem_Error[2] x1 Pchip 0/1 or CPU Detected Memory ECC Error Cchip_DIR x0000 0000 0000 0000 Cchip Device Interrupt Request Register Cchip_MISC x0000 0000 0000 0000 Nxm[28] x0 Nxs[31:29] NOT Valid Nxs[31:29] x0 If Nxm[28] = 1 - CPU 0 Source Device P0_Perror x0000 0000 0000 0000 No Errors Detected P0_PCI_Addr[47:18] x0 PCI Address Bits[31:2] P0_Sys_Addr[50:19] x0 System Memory Address Bits [34:3] P0_Cmd[55:52] x0 IACK - CPU PIO Read or DMA Read P0_Synd[63:56] x0 No Syndrome P1_Perror x0000 0000 0000 0000 Pchip1 Error Detection Register P1_PCI_Addr[47:18] x0 PCI Address Bits[31:2] P1_Sys_Addr[50:19] x0 System Memory Address Bits[34:3] P1_Cmd[55:52] x0 IACK - CPU IO Read or DMA Read P1_Synd[63:56] x0 No Syndrome START OF SUBPACKETS IN THIS EVENT Cchip Config Subpacket, Version 1 AAR_0 x0000 0000 0000 7009 Memory Array 0 Configuration Register Sa0[8] x0 Non - Split Array Asiz0[15:12] x7 1 Gb Addr0[34:24] x0 Array0 Base Address [34:24] Bits AAR_1 x0000 0000 4000 7009 Memory Array 1 Configuration Register Sa1[8] x0 Non - Split Array Asiz1[15:12] x7 1 Gb Addr1[34:24] x40 Array1 Base Address [34:24] Bits AAR_2 x0000 0000 0000 0000 Memory Array 2 Configuration Register Sa2[8] x0 Non - Split Array Asiz2[15:12] x0 Array 2 Not Used Addr2[34:24] x0 Array2 Base Address [34:24] Bits AAR_3 x0000 0000 0000 0000 Memory Array 3 Configuration Register Sa3[8] x0 Non - Split Array Asiz3[15:12] x0 Array 3 Not Used Addr3[34:24] x0 Array3 Base Address [34:24] Bits P0_CTL x0000 1244 014C 0091 Pchip 0 Control Register ECCen0[18] x1 Pchip 0 ECC Detection Enabled P1_CTL x2020 325A 434C 5008 Pchip 1 Control Register Eccen1[18] x1 Pchip 1 ECC Detection Enabled |
||
|
|
How do I get out of here? |
Hi Zdenek,
I will quickly try & answer a few bits for you, but I don't have time to look at everything right now... No point looking for DPR registers on a DS10, it dosn't have any. I believe that your XORing thing is also of little value as I suspect that what you have been looking at is an ES40 or ES45 service guide & the memory configuration etc is different for a DS10.I don't think that DS10's do this but I'm not 100% on that. I'm not sure why you are looking at events that are a year old if the system is crashing having problems today/this week ??? Machine Check 630 are CPU internal [like cpu register/cpu problems & Bcache errors] & non fatal so it is unlikely that it is a memory dimm fault. Going by the info in your extract of the error log the it is the transfer of data from the Bcache [L2 cache] to the Dcache [Data cache] that is the problem. This indicates that a data bit is being dropped when it is transferred from Bcache to the Dcache which is internal to the CPU. However it has been corrected so unless this continues [multiple errors over a short period of time/days]then there should be no reason to change h/w for this reason as the platform is doing what it has been designed to do. If you wish to change something for this error then I would suggest that the problem is with the main logic board [the cpu/bcache is part of this board] as it is a DS10. The bit about throttling is that there has been a large burst of errors [in a typically very short period of time (secs or less)] that have the potential to swamp the errorlog, therefore this has triggered a threashold which "turns the errorlog off" to these events for a short period of time. When it turns back on & the same errors are still being reported it turns off again etc until things settle down. Note only the event[s] that trigger the throttling will be disabled. This is where I suggest you spend a bit of time checking as this is probably the problem. You are using SEA but I would first suggest that if it is old, that you update to the current version which can be found: HERE just pick the correct version that you need. Then use the following commands from the command line to filter down the error log: wsea n sum index ! this will give you one line for every entry in the current default errorlog otherwise specify your input log wsea x filterlog XXX.errlog new.errlog “dtb=20-oct-2006,16:00 & et=mchk-all” ! specify your input log file to only look for, in this case, all machine checks since the date & time specified. Obviously tailor this to suit the first command wsea n ana input new.errlog out new.txt ! this will output to a file so you can search it etc of just miss the "out new.txt" bit off & watch the screen. If this doesn't give out anything then try: wsea n tra input new.errlog all ! the "all" bit is important as this well show correctables which may well be what you are missing from your output in the first place ? The default in NOT to show correctables. This may well help. Also, if you can shut the system down then you could try "memexer" & "sys_exer" from the console prompt. You may also be able to get some information from the "info" cmd if the system has just crashed. Hope this help... MEA[T] This message has been edited. Last edited by: MEA(T) Cleaver, Jolly Jack Tar 3E Mess HMS "GUZZ" |
|||
|
|
How do I get out of here? |
carrying on from above.....
If you suspect mem errors and the memexer, sys_exer, SEA doesn't pull anything out, then from you original post you say the the system has four 512Mb DIMMS. Remove bank 1 [see inside cover for location or DS10 service guide] for a few days/week & see what happens, asumming the customer is OK with this. If nothing happens then remove the now original Bank 0 DIMMS & put the other ones back in there place {Bank 0} for a few days/weeks. If one of the mem banks fails then that is probably where your problem is, if it fails on both sets then it is probably a mainlogic board problem so replace as required. To track down an individual DIMM the put one from the "good" pair in with one of the others to establish the faulty one. The main problem with this mem testing is time & how the customer reacts to you doing this as in down time, system crashing etc. If the mem option is not possible then get four new DIMMs, replace the customer original ones & test the customer original dimms in your own DS10. This makes an assumtion that you have a DS10 available as well as enough DIMMS to swap them out in the first place! Hope this helps.... MEA[T] Jolly Jack Tar 3E Mess HMS "GUZZ" |
|||
|
| Powered by Eve Community |
| Please Wait. Your request is being processed... |
Read-Only Topic
hostcomputers.infopop.cc
Forums
HP Hardware Archive
DS10 - 630 -- Correctable Processor Event
