tidbits

PEBS disabled due to CPU errata

I noticed this in /var/log/messages on my freshly installed CentOS 7.5 system.

May 24 09:46:43 localhost kernel: smpboot: CPU0: Intel(R) Xeon(R) CPU E31220 @ 3.10GHz (fam: 06, model: 2a, stepping: 07)
May 24 09:46:43 localhost kernel: Performance Events: PEBS fmt1+, 16-deep LBR, SandyBridge events, full-width counters, Intel PMU driver.
May 24 09:46:43 localhost kernel: core: PEBS disabled due to CPU errata, please upgrade microcode
May 24 09:46:43 localhost kernel: ... version:                3
May 24 09:46:43 localhost kernel: ... bit width:              48
May 24 09:46:43 localhost kernel: ... generic registers:      8
May 24 09:46:43 localhost kernel: ... value mask:             0000ffffffffffff
May 24 09:46:43 localhost kernel: ... max period:             00007fffffffffff
May 24 09:46:43 localhost kernel: ... fixed-purpose events:   3
May 24 09:46:43 localhost kernel: ... event mask:             00000007000000ff

RedHat has this to say about it. ref: https://access.redhat.com/solutions/634443

Root Cause

  • Clovertown and SandyBridge processors have errata regarding PEBS functionality.

Diagnostic Steps

  • Look at /proc/cpuinfo for model number 15, 42 or 45

So I did

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 42
model name      : Intel(R) Xeon(R) CPU E31220 @ 3.10GHz
stepping        : 7
microcode       : 0x29
cpu MHz         : 1599.951
cache size      : 8192 KB

Install a few packages

# yum install microcode_ctl.x86_64
# yum install iucode-tool
# reboot

I see this in /var/log/messages on the reboot

May 26 18:55:50 wine kernel: microcode: microcode updated early to revision 0x2d, date = 2018-02-07

Confirmation the processor has microcode patches applied

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 42
model name      : Intel(R) Xeon(R) CPU E31220 @ 3.10GHz
stepping        : 7
microcode       : 0x2d

The microcode is being updated as the system loads.

2018/05/26 19:04 · 0 Linkbacks

Laptop Hard Reboot

Model: HP EliteBook 745 G3

For a number of months my laptop has done what I can only describe as a hard boot. For no apparent reasons it would power cycle. No windows shutdown message, no warning, its as if somebody has come along and pulled the power cord or held down a reset button.

As this is the 2nd time its happened in less than 2 week this is really starting to cause me problems as I'm loosing work. You can see the last event 4/24/2017 on the reliability history report below.

Clicking on the view technical details reveals this snippet of information

The computer has rebooted from a bugcheck.  
The bugcheck was: 0x00000124 (0x0000000000000000, 0xffffe000f3f6c838, 0x0000000000000000, 0x0000000000000000). 
A dump was saved in: C:\WINDOWS\Minidump\050517-14718-01.dmp. Report Id: e55192e8-8d9d-45c1-8c03-e14e66640510.

Well at least windows agrees with me it was not shutdown correctly. Lets see if we can find out why.

We are going to need some additional windows tools to read that minidump

I've been keeping a log but the minidump directory has these occasions time stamped for me. As you can see this is the 8th time this has happened.

After installing the WDK we need to fire up windbg which can be found here.

Pulling the 0505 minidump into windbg this is what it tells us.

Microsoft (R) Windows Debugger Version 10.0.15063.0 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.


Loading Dump File [C:\Windows\Minidump\050517-14718-01.dmp]
Mini Kernel Dump File: Only registers and stack trace are available

Symbol search path is: srv*
Executable search path is: 
Windows 10 Kernel Version 10586 MP (4 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 10586.839.amd64fre.th2_release.170303-1605
Machine Name:
Kernel base = 0xfffff803`d0218000 PsLoadedModuleList = 0xfffff803`d04f5c90
Debug session time: Fri May  5 13:51:37.032 2017 (UTC - 4:00)
System Uptime: 0 days 0:00:02.748
Loading Kernel Symbols
..

Press ctrl-c (cdb, kd, ntsd) or ctrl-break (windbg) to abort symbol loads that take too long.
Run !sym noisy before .reload to track down problems loading symbols.

.............................................................
..
Loading User Symbols
Mini Kernel Dump does not contain unloaded driver list
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

Use !analyze -v to get detailed debugging information.
BugCheck 124, {0, ffffe000f3f6c838, 0, 0}
Probably caused by : AuthenticAMD
Followup:     MachineOwner
---------
2: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000000, Machine Check Exception
Arg2: ffffe000f3f6c838, Address of the WHEA_ERROR_RECORD structure.
Arg3: 0000000000000000, High order 32-bits of the MCi_STATUS value.
Arg4: 0000000000000000, Low order 32-bits of the MCi_STATUS value.

Debugging Details:
------------------

DUMP_CLASS: 1
DUMP_QUALIFIER: 400
BUILD_VERSION_STRING:  10.0.10586.839 (th2_release.170303-1605)
DUMP_TYPE:  2
BUGCHECK_P1: 0
BUGCHECK_P2: ffffe000f3f6c838
BUGCHECK_P3: 0
BUGCHECK_P4: 0
BUGCHECK_STR:  0x124_AuthenticAMD
CPU_COUNT: 4
CPU_MHZ: 705
CPU_VENDOR:  AuthenticAMD
CPU_FAMILY: 15
CPU_MODEL: 60
CPU_STEPPING: 1
CUSTOMER_CRASH_COUNT:  1
DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT
PROCESS_NAME:  System
CURRENT_IRQL:  0
ANALYSIS_SESSION_HOST:  ENGLAND1
ANALYSIS_SESSION_TIME:  05-05-2017 14:34:37.0456
ANALYSIS_VERSION: 10.0.15063.0 amd64fre

STACK_TEXT:  
ffffd000`ab3245b0 fffff803`d05c77cd : 00000000`00000000 ffffe000`f3f6c810 fffff803`d04e96a0 fffff803`d05aa340 : nt!WheapCreateLiveTriageDump+0x81
ffffd000`ab324ae0 fffff803`d0428c94 : ffffe000`f3f6c810 ffffe000`f3f73030 ffffd000`ab324af8 00000000`00000000 : nt!WheapCreateTriageDumpFromPreviousSession+0x2d
ffffd000`ab324b10 fffff803`d0429dd9 : fffff803`d04e9640 fffff803`d04e9640 fffff803`d04e96a0 fffff803`d028d710 : nt!WheapProcessWorkQueueItem+0x48
ffffd000`ab324b50 fffff803`d025dcf9 : fffff803`d05aa200 ffffe000`f3bb8040 fffff803`00000000 ffffe000`f42fca48 : nt!WheapWorkQueueWorkerRoutine+0x25
ffffd000`ab324b80 fffff803`d02cd9b5 : 00000205`b4bbbdff 00000000`00000080 ffffe000`f2427680 ffffe000`f3bb8040 : nt!ExpWorkerThread+0xe9
ffffd000`ab324c10 fffff803`d035fae6 : fffff803`d0534180 ffffe000`f3bb8040 fffff803`d02cd974 00000000`00000000 : nt!PspSystemThreadStartup+0x41
ffffd000`ab324c60 00000000`00000000 : ffffd000`ab325000 ffffd000`ab31f000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16


STACK_COMMAND:  kb
THREAD_SHA1_HASH_MOD_FUNC:  26acd050bd9f055d0a04825d57b9e0e6be9c1a07
THREAD_SHA1_HASH_MOD_FUNC_OFFSET:  5e1e1a155874296ef3d407b143c830e84a016e94
THREAD_SHA1_HASH_MOD:  30a3e915496deaace47137d5b90c3ecc03746bf6
FOLLOWUP_NAME:  MachineOwner
MODULE_NAME: AuthenticAMD
IMAGE_NAME:  AuthenticAMD
DEBUG_FLR_IMAGE_TIMESTAMP:  0
FAILURE_BUCKET_ID:  0x124_AuthenticAMD_PROCESSOR_BUS_PRV
BUCKET_ID:  0x124_AuthenticAMD_PROCESSOR_BUS_PRV
PRIMARY_PROBLEM_CLASS:  0x124_AuthenticAMD_PROCESSOR_BUS_PRV
TARGET_TIME:  2017-05-05T17:51:37.000Z
OSBUILD:  10586
OSSERVICEPACK:  839
SERVICEPACK_NUMBER: 0
OS_REVISION: 0
SUITE_MASK:  272
PRODUCT_TYPE:  1
OSPLATFORM_TYPE:  x64
OSNAME:  Windows 10
OSEDITION:  Windows 10 WinNt TerminalServer SingleUserTS
OS_LOCALE:  
USER_LCID:  0
OSBUILD_TIMESTAMP:  2017-03-03 23:13:02
BUILDDATESTAMP_STR:  170303-1605
BUILDLAB_STR:  th2_release
BUILDOSVER_STR:  10.0.10586.839
ANALYSIS_SESSION_ELAPSED_TIME:  b84
ANALYSIS_SOURCE:  KM
FAILURE_ID_HASH_STRING:  km:0x124_authenticamd_processor_bus_prv
FAILURE_ID_HASH:  {6fd7875b-9a1b-9e09-d6d6-816026a875c8}

Followup:     MachineOwner
---------

Decoding that ARG2 from the WHEA_UNCORRECTABLE_ERROR (124)

2: kd> !errrec ffffe000f3f6c838
===============================================================================
Common Platform Error Record @ ffffe000f3f6c838
-------------------------------------------------------------------------------
Record Id     : 01d2c5c83ca54560
Severity      : Fatal (1)
Length        : 928
Creator       : Microsoft
Notify Type   : Machine Check Exception
Timestamp     : 5/5/2017 17:51:37 (UTC)
Flags         : 0x00000002 PreviousError

===============================================================================
Section 0     : Processor Generic
-------------------------------------------------------------------------------
Descriptor    @ ffffe000f3f6c8b8
Section       @ ffffe000f3f6c990
Offset        : 344
Length        : 192
Flags         : 0x00000001 Primary
Severity      : Fatal

Proc. Type    : x86/x64
Instr. Set    : x64
Error Type    : BUS error
Operation     : Generic
Flags         : 0x00
Level         : 3
CPU Version   : 0x0000000000660f01
Processor ID  : 0x0000000000000000

===============================================================================
Section 1     : x86/x64 Processor Specific
-------------------------------------------------------------------------------
Descriptor    @ ffffe000f3f6c900
Section       @ ffffe000f3f6ca50
Offset        : 536
Length        : 128
Flags         : 0x00000000
Severity      : Fatal

Local APIC Id : 0x0000000000000000
CPU Id        : 01 0f 66 00 00 08 04 00 - 0b 32 d8 7e ff fb 8b 17
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

Proc. Info 0  @ ffffe000f3f6ca50

===============================================================================
Section 2     : x86/x64 MCA
-------------------------------------------------------------------------------
Descriptor    @ ffffe000f3f6c948
Section       @ ffffe000f3f6cad0
Offset        : 664
Length        : 264
Flags         : 0x00000000
Severity      : Fatal

Error         : BUSLG_OBS_ERR_*_NOTIMEOUT_ERR (Proc 0 Bank 4)
  Status      : 0xfa000010000b0c0f

REF

https://answers.microsoft.com/en-us/windows/forum/windows_7-performance/help-windows-7-bsod-system-service-exception/7f165f52-d13b-4c1f-8160-f8483727c874?page=2

So my options are either:

  • Your RAM is faulty (Bank 4 = 4th DIMM slot). Run Memtest for NO LESS than ~8 passes (several hours):
  • Your motherboard is faulty, and will need to be replaced.

Either way its a hardware problem. Others reporting the same

Digging deeper: ref https://davidcmoisan.wordpress.com/2010/07/01/bad-hardware-day-more-on-hardware-bluescreens/

2: kd> .formats 0xfa000010000b0c0f
Evaluate expression:
  Hex:     fa000010`000b0c0f
  Decimal: -432345495507366897
  Octal:   1750000001000002606017
  Binary:  11111010 00000000 00000000 00010000 00000000 00001011 00001100 00001111
  Chars:   ........
  Time:    ***** Invalid FILETIME
  Float:   low 1.01452e-039 high -1.66154e+035
  Double:  -4.53808e+279

Wonder if this is heat related as I had this set to Passive as the FAN was noisy. Probably not that would imply that this laptop would never run on batteries ! We will change it back to ACTIVE and see if that helps any and I'll put up with the FAN spinning for a while.

That did not help I'm still suffering random reboots latest happened 16-Jun-2017

This machine is being returned I cannot tolerate a computer randomly rebooting.

2017/06/22 12:06

Windows patches and SHA1

Windows download URL contain a SHA1 checksum as part of the URL:

http://www.download.windowsupdate.com/msdownload/update/v3-19990518/cabpool/windowsserver2003-kb824141-x86-enu_90853a52ea80f7da3c5460ef102ade3.exe

You can download the file and the use the SHA1 checksum from the URL itself to then validate the file downloaded correctly. Sounds like a good idea. It is until MS screw up the SHA1 on the URL.

# openssl sha1 windowsserver2003-kb824141-x86-enu_90853a52ea80f7da3c5460ef102ade3.exe
SHA1(windowsserver2003-kb824141-x86-enu_90853a52ea80f7da3c5460ef102ade3.exe)= bfa8072aa29dbe552f952cdb42b1f635072ae081

These are a list of filenames that I've discovered where the SHA in the URL file does not match that computed.

['windowsserver2003-kb824141-x86-enu_90853a52ea80f7da3c5460ef102ade3.exe',
 'msjavwu_8073687b82d41db93f4c2a04af2b34d.exe',
 'windowsserver2003-kb835732-x86-enu_9c2348f833ade0cca439ec6b2a92179.exe',
 'windowsmedia9-kb819639-x86-enu_57af369562f19dc35e69681660521fb.exe',
 'windowsserver2003-kb828741-x86-enu_1e3156bf5ec0354f542c38f309bab49.exe',
 'windowsserver2003-kb819696-x86-enu_41cdc8619ebb756106ea383c055530d.exe',
 'windowsserver2003-kb825119-x86-enu_329e94ea193be4c2d2f8d9bfc4daf23.exe',
 'windowsserver2003-kb840374-x86-enu_eeafbc20c2402b1c951d155d3d2cb9c.exe',
 'windowsserver2003-kb837001-x86-enu_0a248bb59a71c52a288c837779ac98e.exe',
 'windowsserver2003-kb823980-x86-enu_7f97e0d2355f670acb9384ad0933515.exe',
 'windowsserver2003-kb824146-x86-enu_f759bdcfdc906b0b35ad697a29ed1a1.exe',
 'windowsserver2003-kb823559-x86-enu_d8d3b25c5678c692e29cf971a6c38fa.exe',
 'windowsserver2003-kb824105-x86-enu_c7fd830ee6b1c3bb594be4f7a61f43c.exe',
 'windowsserver2003-kb828028-x86-enu_52dce385c001ce81c2514c3fb1cac7e.exe',
 'windowsserver2003-kb828035-x86-enu_d1df77e311740d6c012bcda5a7f821f.exe',
 'directx9-kb819696-x86-enu_977f8cc86c1e151a0168d1296210913.exe',
 'windowsserver2003-kb830352-x86-enu_d67acb6c784dd87961c8070943dadd8.exe',
 'sql2000-kb815495-8.00.0818-enu_4c77bb3f492fb1670b90b477d674e7e.exe',
 'windowsserver2003-kb823182-x86-enu_c7ee6a3716815554656d98ed9bc85d5.exe',
 'windowsxp-kb883939-x64-enu_9e1efe32675530155c34f7af1172a6d496e1e5ee.exe',
 'ndp10_sp_q321884_en_0fc8b14a073e01a03c27c948d254feedaa79feae.exe']
2015/05/19 12:07 · 0 Linkbacks

Python decompress PACK_MAGIC

A file compressed with pack format has magic bytes in octal \036\037 or in hex 0x1e1f

GZIP can decode this along with the pcat program. For an exercise I converted the unpack.c module in gzip into its python equivalent.

The slowest part of the code is the look_bits function and this is where you can see how an interpreted language grinds compared to C.

Using the excellent line profiler: https://pypi.python.org/pypi/line_profiler/

Timer unit: 1e-06 s = 1uS

Total time: 43.3399 s
File: unpack.py
Function: look_bits at line 36

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    36                                               @profile
    37                                               def look_bits(self,bits,mask):
    38    351442       265280      0.8      0.6          while(self.valid < bits):
    39    140575     10361348     73.7     23.9              self.bitbuf <<= 8
    40    140575     14576495    103.7     33.6              self.bitbuf |= next(self.get_byte)
    41    140575       189102      1.3      0.4              self.valid += 8
    42    210867     17947709     85.1     41.4          return (self.bitbuf >> (self.valid - bits)) & mask

unpack.zip

2014/10/24 11:09 · 0 Linkbacks

vCenter VM monitoring with Graphite

This was done on a CentOS 6.3 server your mileage may vary on another platform.

The problem I was trying to solve was that I wanted to monitor the vitals of all my VM's without having to install collectd into each VM, by talking to vCenter we can pull everything out that we need. This work was inspired by collectd-vcenter which I could not get to work as its assumes you have an ESX Cluster. So I wrote my own.

See details on setting up graphite which I used for my xAP monitoring

We don't use the collectd RPM from the EPEL repository as its too old and does not have the graphite_write plugin.

You will need to have a version of collectd compiled that has the python plugin enabled. This can be down by downloading the collectd source and building it, making sure to have python-devel package installed.

Mathew has a good write on the graphite / collectd compilation and configuration for graphite. http://blog.matthewdfuller.com/2014/06/sending-collectd-metrics-to-graphite.html

Be sure to have this installed before run “./configure”

yum install python
yum install python-devel

The configuration of /opt/collectd/etc/collectd.conf requires the following entries. Adjust the Username, Vcenter and Password to suit your environment.

<LoadPlugin python>
  Globals true
</LoadPlugin>

<Plugin python>
        # vcenter.py is at /usr/lib/python2.6/site-packages
        LogTraces true
        Interactive false
        Import vcenter
        <Module vcenter>
             Username "root"
             Vcenter "vc.local"
             Password "vmware"
             Verbose false
        </Module>
</Plugin>

You'll need to have pysphere installed. Paraphrasing the installation:

yum install python-setuptools
easy_install -U pysphere

The magic script that pulls all the stats we need. This works in my ESX 5.1 lab where I have a single vCenter instance.

vcenter.py

#!/usr/bin/python
# Collect basic stats about power on VM's on a single vCenter
# Brett England - 1-Oct-2014
 
import collectd
from pysphere import VIServer
NAME = 'vCenter'   # Plugin name
 
# Metric and reporting type as a value
# https://collectd.org/wiki/index.php/Data_source
# /opt/collectd/share/collectd/types.db
# https://www.vmware.com/support/developer/vc-sdk/visdk41pubs/ApiReference/vim.vm.Summary.QuickStats.html
METRIC={"name":None,
        "runtime.powerState":None,
        "summary.quickStats.overallCpuUsage":"cpufreq",  # Mhz
        "summary.quickStats.uptimeSeconds":"gauge",
        "summary.quickStats.guestMemoryUsage":"bytes",
        "summary.quickStats.hostMemoryUsage":"bytes",
        "config.hardware.memoryMB":"bytes"}
 
def connect():
    server = VIServer()
    try:
        server.connect(VCENTER, USERNAME, PASSWORD)
    except:
        logger('warn', "failed to connect to %s" % (VCENTER))
        return None
    return server
 
def get_stats():
    server = connect()
    if server is None:
        return []
    ret = []
    props = server._retrieve_properties_traversal(property_names=METRIC.keys(),
                                                  obj_type="VirtualMachine")
    for obj in props:
        skip_vm = False
        d={}
        for vm_property in obj.PropSet:
            if vm_property.Name == "runtime.powerState":
                if vm_property.Val != "poweredOn":
                    skip_vm = True
                    break
                continue
 
            d[vm_property.Name] = vm_property.Val
 
        if not skip_vm:
            ret.append(d)
    server.disconnect()
    return ret
 
# callback configuration for module
def configure_callback(conf):
  global VCENTER, USERNAME, PASSWORD, VERBOSE_LOGGING
  VCENTER = ''
  USERNAME = ''
  PASSWORD = ''
  VERBOSE_LOGGING = False
 
  for node in conf.children:
    if node.key == "Vcenter":
      VCENTER = node.values[0]
    elif node.key == "Username":
      USERNAME = node.values[0]
    elif node.key == "Password":
      PASSWORD = node.values[0]
    elif node.key == "Verbose":
      VERBOSE_LOGGING = bool(node.values[0])
    else:
      logger('warn', 'Unknown config key: %s' % node.key)
 
# https://collectd.org/wiki/index.php/Naming_schema
# The serialized form of the identifier is:
# host "/" plugin ["-" plugin instance] "/" type ["-" type instance]
#
# We want:   VCENTER / VMNAME / type...
def dispatch_value(vmname, value, key, type):
 
    logger('verb','%s: Sending value: %s=%s' % (vmname, key, value))
    # This is not intuitive but its what we want.
    val = collectd.Values()
    val.host = VCENTER
    val.plugin = vmname
    val.type = type
    val.type_instance = key
    val.values = [value]
    val.dispatch()
 
def read_callback():
  logger('verb', "beginning read_callback")
  info = get_stats()
 
  if not info:
    logger('warn', "No data received")
    return
 
  for vm in info:
      vmname = vm['name']
      for key,value in vm.items():
          type = METRIC[key]
          if type:
              dispatch_value(vmname, value, key, type)
 
# logging function
def logger(t, msg):
    if t == 'err':
        collectd.error('%s: %s' % (NAME, msg))
    elif t == 'warn':
        collectd.warning('%s: %s' % (NAME, msg))
    elif t == 'verb':
        if VERBOSE_LOGGING:
            collectd.info('%s: %s' % (NAME, msg))
    else:
        collectd.notice('%s: %s' % (NAME, msg))
 
# main
collectd.register_config(configure_callback)
collectd.register_read(read_callback)
2014/10/01 02:44 · 0 Linkbacks

<< Newer entries | Older entries >>

  • tidbits.txt
  • Last modified: 2009/11/27 16:59
  • by 127.0.0.1