eccezione collegamento hard resetting Emask 0x50 SAct 0x0 SErr 0x4090800 azione 0xe congelata

8

Seguente situazione:

Un server Linux debian 7 produttivo con kernel 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u2 x86_64 GNU/Linux

Produttore: Supermicro Nome prodotto: X10SLL-F Versione:1.02

Controller SATA: Intel Corporation Lynx Point 6-port SATA Controller 1 [AHCI mode] (rev 04)

2x SSD, 2x hdd

ogni unità può fare Sata Rev3 (6.0Gb / s)

hdparm -I /dev/sd[a-d]|egrep "Model|speed|Transport"
    Model Number:       TOSHIBA THNSNH128GBST                   
    Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
       *    Gen1 signaling speed (1.5Gb/s)
       *    Gen2 signaling speed (3.0Gb/s)
       *    Gen3 signaling speed (6.0Gb/s)
       *    SMART Command Transport (SCT) feature set
    Model Number:       TOSHIBA THNSNH128GBST                   
    Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
       *    Gen1 signaling speed (1.5Gb/s)
       *    Gen2 signaling speed (3.0Gb/s)
       *    Gen3 signaling speed (6.0Gb/s)
       *    SMART Command Transport (SCT) feature set
    Model Number:       ST2000VX000-1CU164                      
    Transport:          Serial, SATA Rev 3.0
       *    Gen1 signaling speed (1.5Gb/s)
       *    Gen2 signaling speed (3.0Gb/s)
       *    Gen3 signaling speed (6.0Gb/s)
       *    SMART Command Transport (SCT) feature set
    Model Number:       ST2000VX000-1CU164                      
    Transport:          Serial, SATA Rev 3.0
       *    Gen1 signaling speed (1.5Gb/s)
       *    Gen2 signaling speed (3.0Gb/s)
       *    Gen3 signaling speed (6.0Gb/s)
       *    SMART Command Transport (SCT) feature set

I messaggi del kernel suggeriscono (almeno per me) un problema con tutte e 4 le unità, il che mi porta a credere che sia il controller SATA che potrebbe essere in errore.

ata1: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
ata1: irq_stat 0x00400040, connection status changed
ata1: SError: { HostInt PHYRdyChg 10B8B DevExch }
ata1: hard resetting link
ata2: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
ata2: irq_stat 0x00400040, connection status changed
ata2: SError: { HostInt PHYRdyChg 10B8B DevExch }
ata2: hard resetting link
ata4: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
ata4: irq_stat 0x00400040, connection status changed
ata4: SError: { HostInt PHYRdyChg 10B8B DevExch }
ata4: hard resetting link
ata3: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
ata3: irq_stat 0x00400040, connection status changed
ata3: SError: { HostInt PHYRdyChg 10B8B DevExch }
ata3: hard resetting link
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata4.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
ata4.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
ata2.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
ata2.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
ata3.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
ata3.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
ata2.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
ata2.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
ata2.00: configured for UDMA/33
ata2: EH complete
ata1.00: configured for UDMA/33
ata1: EH complete
ata3.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
ata3.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
ata4.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
ata4.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
ata3.00: configured for UDMA/33
ata3: EH complete
ata4.00: configured for UDMA/33
ata4: EH complete

Quello che ho già capito (o credo di aver capito)

I comandi SECURITY FREEZE LOCKe DEVICE CONFIGURATION OVERLAYnon sono importanti per il problema.

Durante la lettura di circa 20 segnalazioni di bug e molte documentazioni, alcuni collegati hanno suggerito di disabilitare NCQ, cosa che ho fatto.

Innanzitutto per un dispositivo, dopo aver atteso 1 giorno per verificare se l'errore si ripete, è successo di nuovo e l'ho disabilitato per tutti e 4 i dispositivi

echo "1" >/sys/block/sdc/device/queue_depth

Nessun evidente cambiamento nella situazione.

https://ata.wiki.kernel.org/index.php/Libata_error_messages

https://wiki.archlinux.org/index.php/Solid_State_Drives#Resolving_NCQ_errors

Altri suggeriscono un cavo SATA o addirittura un'incompatibilità tra scheda + unità.

Tuttavia, poiché sembro avere il problema su un disco e questo si popola a tutti e 4, o avendo il problema direttamente su tutti e 4 i dispositivi, non sono in grado di individuare ulteriormente il problema.

Dato che si tratta di un server di produzione, è possibile rimuoverlo per manutenzione (ovvero modifiche ai parametri bios / kernel), ma mi piace impedirlo, se possibile.

Secondo l'hoster questo potrebbe essere legato alla gestione dell'alimentazione:

https://bugzilla.kernel.org/show_bug.cgi?id=74961 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1318218

echo "medium_power" >/sys/class/scsi_host/host0/link_power_management_policy

Prima della modifica, questo era impostato su max_performance.

Questo non ha aiutato neanche.

I valori intelligenti degli HDD / SDD sono OK, niente di troppo ovvio.

Si noti che il valore UDMA sembra essere solo 33 ora.

All'avvio del server si trattava dei valori di velocità del collegamento sata:

[    3.161850] ata6: SATA link down (SStatus 0 SControl 300)
[    3.161867] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    3.161882] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    3.161894] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    3.161907] ata5: SATA link down (SStatus 0 SControl 300)

La situazione potrebbe verificarsi con un carico elevato solo sugli HDD, ma non l'ho ancora testato in quanto influirebbe ovviamente sulle prestazioni del server.

Non vi è alcun carico sugli SSD, sono montati ma non utilizzati da nessuno dei processi.

La RAM è ECC per quanto ne so.

dmidecode -t 17
# dmidecode 2.11
SMBIOS 2.7 present.

Handle 0x0023, DMI type 17, 34 bytes
Memory Device
    Array Handle: 0x0022
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 8192 MB
    Form Factor: DIMM
    Set: None
    Locator: P1-DIMMA1
    Bank Locator: P0_Node0_Channel0_Dimm0
    Type: DDR3
    Type Detail: Synchronous
    Speed: 1600 MHz
    Manufacturer: Samsung
    Serial Number: 373A6427
    Asset Tag: 9876543210
    Part Number: M391B1G73QH0-CK0  
    Rank: 2
    Configured Clock Speed: 1600 MHz

Per favore fatemi sapere se posso fornire ulteriori informazioni in quanto mi mancano le idee su cosa fare dopo.

— Dennis Nolte
fonte

chiedendo direttamente al venditore supermicro, possono aiutarli se l'hoster non lo fa.

— Dennis Nolte,

1

Si noti che il sistema sta rinegoziando a 1,5 Gbps. Prova a forzare 1,5 Gbps e verifica se ciò rende il sistema stabile. È un punto dati. Prova askubuntu.com/a/146290/11751 per un breve approfondimento su come farlo.

— un CVn

4

Ciò che si verifica sul server è fondamentalmente una rinegoziazione SATA a una velocità di collegamento inferiore dopo alcuni problemi di comunicazione con le unità.

Questi fattori possono essere al lavoro qui (ordinati per probabilità)

operazioni IOPS a latenza molto elevata (ad es. causate dalla garbage collection del controller SSD) con conseguente timeout del comando SATA. L'unità supporta il comando SATA Trim? In tal caso, prova a correre fstrim /. Cambia qualcosa?
Scheda madre / memoria difettosa: la tua memoria è protetta dall'ECC? In caso contrario, e se è possibile, eseguire una sessione di test memtest86 + estesa (2+ ore)
incompatibilità dei driver hardware / software
Controller SATA difettoso: sebbene abbastanza improbabile, non puoi escluderlo completamente
Cavi / unità SATA difettosi: poiché tutte e quattro le unità creano problemi, è molto improbabile

— shodanshok
fonte

gli ssd non sono attualmente in uso, sembra che sia usato ECC. da dmidecode -t17: Larghezza totale: 72 bit Larghezza dati: 64 bit

— Dennis Nolte,

3

Secondo il supporto di Supermicro, il difetto risiede nella scheda:

Citazione:

This board may need ECO 16238 update.

— Dennis Nolte
fonte