PEBS disabled due to CPU errata
I noticed this in /var/log/messages on my freshly installed CentOS 7.5 system.
May 24 09:46:43 localhost kernel: smpboot: CPU0: Intel(R) Xeon(R) CPU E31220 @ 3.10GHz (fam: 06, model: 2a, stepping: 07) May 24 09:46:43 localhost kernel: Performance Events: PEBS fmt1+, 16-deep LBR, SandyBridge events, full-width counters, Intel PMU driver. May 24 09:46:43 localhost kernel: core: PEBS disabled due to CPU errata, please upgrade microcode May 24 09:46:43 localhost kernel: ... version: 3 May 24 09:46:43 localhost kernel: ... bit width: 48 May 24 09:46:43 localhost kernel: ... generic registers: 8 May 24 09:46:43 localhost kernel: ... value mask: 0000ffffffffffff May 24 09:46:43 localhost kernel: ... max period: 00007fffffffffff May 24 09:46:43 localhost kernel: ... fixed-purpose events: 3 May 24 09:46:43 localhost kernel: ... event mask: 00000007000000ff
RedHat has this to say about it. ref: https://access.redhat.com/solutions/634443
Root Cause
- Clovertown and SandyBridge processors have errata regarding PEBS functionality.
Diagnostic Steps
- Look at /proc/cpuinfo for model number 15, 42 or 45
So I did
processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Xeon(R) CPU E31220 @ 3.10GHz stepping : 7 microcode : 0x29 cpu MHz : 1599.951 cache size : 8192 KB
Install a few packages
# yum install microcode_ctl.x86_64 # yum install iucode-tool # reboot
I see this in /var/log/messages on the reboot
May 26 18:55:50 wine kernel: microcode: microcode updated early to revision 0x2d, date = 2018-02-07
Confirmation the processor has microcode patches applied
processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Xeon(R) CPU E31220 @ 3.10GHz stepping : 7 microcode : 0x2d
The microcode is being updated as the system loads.
Laptop Hard Reboot
Model: HP EliteBook 745 G3
For a number of months my laptop has done what I can only describe as a hard boot. For no apparent reasons it would power cycle. No windows shutdown message, no warning, its as if somebody has come along and pulled the power cord or held down a reset button.
As this is the 2nd time its happened in less than 2 week this is really starting to cause me problems as I'm loosing work. You can see the last event 4/24/2017 on the reliability history report below.
Clicking on the view technical details reveals this snippet of information
The computer has rebooted from a bugcheck. The bugcheck was: 0x00000124 (0x0000000000000000, 0xffffe000f3f6c838, 0x0000000000000000, 0x0000000000000000). A dump was saved in: C:\WINDOWS\Minidump\050517-14718-01.dmp. Report Id: e55192e8-8d9d-45c1-8c03-e14e66640510.
Well at least windows agrees with me it was not shutdown correctly. Lets see if we can find out why.
We are going to need some additional windows tools to read that minidump
I've been keeping a log but the minidump directory has these occasions time stamped for me. As you can see this is the 8th time this has happened.
After installing the WDK we need to fire up windbg which can be found here.
Pulling the 0505 minidump into windbg this is what it tells us.
Microsoft (R) Windows Debugger Version 10.0.15063.0 AMD64 Copyright (c) Microsoft Corporation. All rights reserved. Loading Dump File [C:\Windows\Minidump\050517-14718-01.dmp] Mini Kernel Dump File: Only registers and stack trace are available Symbol search path is: srv* Executable search path is: Windows 10 Kernel Version 10586 MP (4 procs) Free x64 Product: WinNt, suite: TerminalServer SingleUserTS Built by: 10586.839.amd64fre.th2_release.170303-1605 Machine Name: Kernel base = 0xfffff803`d0218000 PsLoadedModuleList = 0xfffff803`d04f5c90 Debug session time: Fri May 5 13:51:37.032 2017 (UTC - 4:00) System Uptime: 0 days 0:00:02.748 Loading Kernel Symbols .. Press ctrl-c (cdb, kd, ntsd) or ctrl-break (windbg) to abort symbol loads that take too long. Run !sym noisy before .reload to track down problems loading symbols. ............................................................. .. Loading User Symbols Mini Kernel Dump does not contain unloaded driver list ******************************************************************************* * * * Bugcheck Analysis * * * ******************************************************************************* Use !analyze -v to get detailed debugging information. BugCheck 124, {0, ffffe000f3f6c838, 0, 0} Probably caused by : AuthenticAMD Followup: MachineOwner --------- 2: kd> !analyze -v ******************************************************************************* * * * Bugcheck Analysis * * * ******************************************************************************* WHEA_UNCORRECTABLE_ERROR (124) A fatal hardware error has occurred. Parameter 1 identifies the type of error source that reported the error. Parameter 2 holds the address of the WHEA_ERROR_RECORD structure that describes the error conditon. Arguments: Arg1: 0000000000000000, Machine Check Exception Arg2: ffffe000f3f6c838, Address of the WHEA_ERROR_RECORD structure. Arg3: 0000000000000000, High order 32-bits of the MCi_STATUS value. Arg4: 0000000000000000, Low order 32-bits of the MCi_STATUS value. Debugging Details: ------------------ DUMP_CLASS: 1 DUMP_QUALIFIER: 400 BUILD_VERSION_STRING: 10.0.10586.839 (th2_release.170303-1605) DUMP_TYPE: 2 BUGCHECK_P1: 0 BUGCHECK_P2: ffffe000f3f6c838 BUGCHECK_P3: 0 BUGCHECK_P4: 0 BUGCHECK_STR: 0x124_AuthenticAMD CPU_COUNT: 4 CPU_MHZ: 705 CPU_VENDOR: AuthenticAMD CPU_FAMILY: 15 CPU_MODEL: 60 CPU_STEPPING: 1 CUSTOMER_CRASH_COUNT: 1 DEFAULT_BUCKET_ID: WIN8_DRIVER_FAULT PROCESS_NAME: System CURRENT_IRQL: 0 ANALYSIS_SESSION_HOST: ENGLAND1 ANALYSIS_SESSION_TIME: 05-05-2017 14:34:37.0456 ANALYSIS_VERSION: 10.0.15063.0 amd64fre STACK_TEXT: ffffd000`ab3245b0 fffff803`d05c77cd : 00000000`00000000 ffffe000`f3f6c810 fffff803`d04e96a0 fffff803`d05aa340 : nt!WheapCreateLiveTriageDump+0x81 ffffd000`ab324ae0 fffff803`d0428c94 : ffffe000`f3f6c810 ffffe000`f3f73030 ffffd000`ab324af8 00000000`00000000 : nt!WheapCreateTriageDumpFromPreviousSession+0x2d ffffd000`ab324b10 fffff803`d0429dd9 : fffff803`d04e9640 fffff803`d04e9640 fffff803`d04e96a0 fffff803`d028d710 : nt!WheapProcessWorkQueueItem+0x48 ffffd000`ab324b50 fffff803`d025dcf9 : fffff803`d05aa200 ffffe000`f3bb8040 fffff803`00000000 ffffe000`f42fca48 : nt!WheapWorkQueueWorkerRoutine+0x25 ffffd000`ab324b80 fffff803`d02cd9b5 : 00000205`b4bbbdff 00000000`00000080 ffffe000`f2427680 ffffe000`f3bb8040 : nt!ExpWorkerThread+0xe9 ffffd000`ab324c10 fffff803`d035fae6 : fffff803`d0534180 ffffe000`f3bb8040 fffff803`d02cd974 00000000`00000000 : nt!PspSystemThreadStartup+0x41 ffffd000`ab324c60 00000000`00000000 : ffffd000`ab325000 ffffd000`ab31f000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16 STACK_COMMAND: kb THREAD_SHA1_HASH_MOD_FUNC: 26acd050bd9f055d0a04825d57b9e0e6be9c1a07 THREAD_SHA1_HASH_MOD_FUNC_OFFSET: 5e1e1a155874296ef3d407b143c830e84a016e94 THREAD_SHA1_HASH_MOD: 30a3e915496deaace47137d5b90c3ecc03746bf6 FOLLOWUP_NAME: MachineOwner MODULE_NAME: AuthenticAMD IMAGE_NAME: AuthenticAMD DEBUG_FLR_IMAGE_TIMESTAMP: 0 FAILURE_BUCKET_ID: 0x124_AuthenticAMD_PROCESSOR_BUS_PRV BUCKET_ID: 0x124_AuthenticAMD_PROCESSOR_BUS_PRV PRIMARY_PROBLEM_CLASS: 0x124_AuthenticAMD_PROCESSOR_BUS_PRV TARGET_TIME: 2017-05-05T17:51:37.000Z OSBUILD: 10586 OSSERVICEPACK: 839 SERVICEPACK_NUMBER: 0 OS_REVISION: 0 SUITE_MASK: 272 PRODUCT_TYPE: 1 OSPLATFORM_TYPE: x64 OSNAME: Windows 10 OSEDITION: Windows 10 WinNt TerminalServer SingleUserTS OS_LOCALE: USER_LCID: 0 OSBUILD_TIMESTAMP: 2017-03-03 23:13:02 BUILDDATESTAMP_STR: 170303-1605 BUILDLAB_STR: th2_release BUILDOSVER_STR: 10.0.10586.839 ANALYSIS_SESSION_ELAPSED_TIME: b84 ANALYSIS_SOURCE: KM FAILURE_ID_HASH_STRING: km:0x124_authenticamd_processor_bus_prv FAILURE_ID_HASH: {6fd7875b-9a1b-9e09-d6d6-816026a875c8} Followup: MachineOwner ---------
Decoding that ARG2 from the WHEA_UNCORRECTABLE_ERROR (124)
2: kd> !errrec ffffe000f3f6c838 =============================================================================== Common Platform Error Record @ ffffe000f3f6c838 ------------------------------------------------------------------------------- Record Id : 01d2c5c83ca54560 Severity : Fatal (1) Length : 928 Creator : Microsoft Notify Type : Machine Check Exception Timestamp : 5/5/2017 17:51:37 (UTC) Flags : 0x00000002 PreviousError =============================================================================== Section 0 : Processor Generic ------------------------------------------------------------------------------- Descriptor @ ffffe000f3f6c8b8 Section @ ffffe000f3f6c990 Offset : 344 Length : 192 Flags : 0x00000001 Primary Severity : Fatal Proc. Type : x86/x64 Instr. Set : x64 Error Type : BUS error Operation : Generic Flags : 0x00 Level : 3 CPU Version : 0x0000000000660f01 Processor ID : 0x0000000000000000 =============================================================================== Section 1 : x86/x64 Processor Specific ------------------------------------------------------------------------------- Descriptor @ ffffe000f3f6c900 Section @ ffffe000f3f6ca50 Offset : 536 Length : 128 Flags : 0x00000000 Severity : Fatal Local APIC Id : 0x0000000000000000 CPU Id : 01 0f 66 00 00 08 04 00 - 0b 32 d8 7e ff fb 8b 17 00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00 Proc. Info 0 @ ffffe000f3f6ca50 =============================================================================== Section 2 : x86/x64 MCA ------------------------------------------------------------------------------- Descriptor @ ffffe000f3f6c948 Section @ ffffe000f3f6cad0 Offset : 664 Length : 264 Flags : 0x00000000 Severity : Fatal Error : BUSLG_OBS_ERR_*_NOTIMEOUT_ERR (Proc 0 Bank 4) Status : 0xfa000010000b0c0f
REF
So my options are either:
- Your RAM is faulty (Bank 4 = 4th DIMM slot). Run Memtest for NO LESS than ~8 passes (several hours):
- Your motherboard is faulty, and will need to be replaced.
Either way its a hardware problem. Others reporting the same
Digging deeper: ref https://davidcmoisan.wordpress.com/2010/07/01/bad-hardware-day-more-on-hardware-bluescreens/
2: kd> .formats 0xfa000010000b0c0f Evaluate expression: Hex: fa000010`000b0c0f Decimal: -432345495507366897 Octal: 1750000001000002606017 Binary: 11111010 00000000 00000000 00010000 00000000 00001011 00001100 00001111 Chars: ........ Time: ***** Invalid FILETIME Float: low 1.01452e-039 high -1.66154e+035 Double: -4.53808e+279
Wonder if this is heat related as I had this set to Passive as the FAN was noisy. Probably not that would imply that this laptop would never run on batteries ! We will change it back to ACTIVE and see if that helps any and I'll put up with the FAN spinning for a while.
That did not help I'm still suffering random reboots latest happened 16-Jun-2017
This machine is being returned I cannot tolerate a computer randomly rebooting.
Windows patches and SHA1
Windows download URL contain a SHA1 checksum as part of the URL:
You can download the file and the use the SHA1 checksum from the URL itself to then validate the file downloaded correctly. Sounds like a good idea. It is until MS screw up the SHA1 on the URL.
# openssl sha1 windowsserver2003-kb824141-x86-enu_90853a52ea80f7da3c5460ef102ade3.exe SHA1(windowsserver2003-kb824141-x86-enu_90853a52ea80f7da3c5460ef102ade3.exe)= bfa8072aa29dbe552f952cdb42b1f635072ae081
These are a list of filenames that I've discovered where the SHA in the URL file does not match that computed.
['windowsserver2003-kb824141-x86-enu_90853a52ea80f7da3c5460ef102ade3.exe', 'msjavwu_8073687b82d41db93f4c2a04af2b34d.exe', 'windowsserver2003-kb835732-x86-enu_9c2348f833ade0cca439ec6b2a92179.exe', 'windowsmedia9-kb819639-x86-enu_57af369562f19dc35e69681660521fb.exe', 'windowsserver2003-kb828741-x86-enu_1e3156bf5ec0354f542c38f309bab49.exe', 'windowsserver2003-kb819696-x86-enu_41cdc8619ebb756106ea383c055530d.exe', 'windowsserver2003-kb825119-x86-enu_329e94ea193be4c2d2f8d9bfc4daf23.exe', 'windowsserver2003-kb840374-x86-enu_eeafbc20c2402b1c951d155d3d2cb9c.exe', 'windowsserver2003-kb837001-x86-enu_0a248bb59a71c52a288c837779ac98e.exe', 'windowsserver2003-kb823980-x86-enu_7f97e0d2355f670acb9384ad0933515.exe', 'windowsserver2003-kb824146-x86-enu_f759bdcfdc906b0b35ad697a29ed1a1.exe', 'windowsserver2003-kb823559-x86-enu_d8d3b25c5678c692e29cf971a6c38fa.exe', 'windowsserver2003-kb824105-x86-enu_c7fd830ee6b1c3bb594be4f7a61f43c.exe', 'windowsserver2003-kb828028-x86-enu_52dce385c001ce81c2514c3fb1cac7e.exe', 'windowsserver2003-kb828035-x86-enu_d1df77e311740d6c012bcda5a7f821f.exe', 'directx9-kb819696-x86-enu_977f8cc86c1e151a0168d1296210913.exe', 'windowsserver2003-kb830352-x86-enu_d67acb6c784dd87961c8070943dadd8.exe', 'sql2000-kb815495-8.00.0818-enu_4c77bb3f492fb1670b90b477d674e7e.exe', 'windowsserver2003-kb823182-x86-enu_c7ee6a3716815554656d98ed9bc85d5.exe', 'windowsxp-kb883939-x64-enu_9e1efe32675530155c34f7af1172a6d496e1e5ee.exe', 'ndp10_sp_q321884_en_0fc8b14a073e01a03c27c948d254feedaa79feae.exe']
Python decompress PACK_MAGIC
A file compressed with pack format has magic bytes in octal \036\037 or in hex 0x1e1f
GZIP can decode this along with the pcat program. For an exercise I converted the unpack.c module in gzip into its python equivalent.
The slowest part of the code is the look_bits function and this is where you can see how an interpreted language grinds compared to C.
Using the excellent line profiler: https://pypi.python.org/pypi/line_profiler/
Timer unit: 1e-06 s = 1uS Total time: 43.3399 s File: unpack.py Function: look_bits at line 36 Line # Hits Time Per Hit % Time Line Contents ============================================================== 36 @profile 37 def look_bits(self,bits,mask): 38 351442 265280 0.8 0.6 while(self.valid < bits): 39 140575 10361348 73.7 23.9 self.bitbuf <<= 8 40 140575 14576495 103.7 33.6 self.bitbuf |= next(self.get_byte) 41 140575 189102 1.3 0.4 self.valid += 8 42 210867 17947709 85.1 41.4 return (self.bitbuf >> (self.valid - bits)) & mask
vCenter VM monitoring with Graphite
This was done on a CentOS 6.3 server your mileage may vary on another platform.
The problem I was trying to solve was that I wanted to monitor the vitals of all my VM's without having to install collectd into each VM, by talking to vCenter we can pull everything out that we need. This work was inspired by collectd-vcenter which I could not get to work as its assumes you have an ESX Cluster. So I wrote my own.
See details on setting up graphite which I used for my xAP monitoring
We don't use the collectd RPM from the EPEL repository as its too old and does not have the graphite_write plugin.
You will need to have a version of collectd compiled that has the python plugin enabled. This can be down by downloading the collectd source and building it, making sure to have python-devel package installed.
Mathew has a good write on the graphite / collectd compilation and configuration for graphite. http://blog.matthewdfuller.com/2014/06/sending-collectd-metrics-to-graphite.html
Be sure to have this installed before run “./configure”
yum install python yum install python-devel
The configuration of /opt/collectd/etc/collectd.conf requires the following entries. Adjust the Username, Vcenter and Password to suit your environment.
<LoadPlugin python> Globals true </LoadPlugin> <Plugin python> # vcenter.py is at /usr/lib/python2.6/site-packages LogTraces true Interactive false Import vcenter <Module vcenter> Username "root" Vcenter "vc.local" Password "vmware" Verbose false </Module> </Plugin>
You'll need to have pysphere installed. Paraphrasing the installation:
yum install python-setuptools easy_install -U pysphere
The magic script that pulls all the stats we need. This works in my ESX 5.1 lab where I have a single vCenter instance.
vcenter.py
#!/usr/bin/python # Collect basic stats about power on VM's on a single vCenter # Brett England - 1-Oct-2014 import collectd from pysphere import VIServer NAME = 'vCenter' # Plugin name # Metric and reporting type as a value # https://collectd.org/wiki/index.php/Data_source # /opt/collectd/share/collectd/types.db # https://www.vmware.com/support/developer/vc-sdk/visdk41pubs/ApiReference/vim.vm.Summary.QuickStats.html METRIC={"name":None, "runtime.powerState":None, "summary.quickStats.overallCpuUsage":"cpufreq", # Mhz "summary.quickStats.uptimeSeconds":"gauge", "summary.quickStats.guestMemoryUsage":"bytes", "summary.quickStats.hostMemoryUsage":"bytes", "config.hardware.memoryMB":"bytes"} def connect(): server = VIServer() try: server.connect(VCENTER, USERNAME, PASSWORD) except: logger('warn', "failed to connect to %s" % (VCENTER)) return None return server def get_stats(): server = connect() if server is None: return [] ret = [] props = server._retrieve_properties_traversal(property_names=METRIC.keys(), obj_type="VirtualMachine") for obj in props: skip_vm = False d={} for vm_property in obj.PropSet: if vm_property.Name == "runtime.powerState": if vm_property.Val != "poweredOn": skip_vm = True break continue d[vm_property.Name] = vm_property.Val if not skip_vm: ret.append(d) server.disconnect() return ret # callback configuration for module def configure_callback(conf): global VCENTER, USERNAME, PASSWORD, VERBOSE_LOGGING VCENTER = '' USERNAME = '' PASSWORD = '' VERBOSE_LOGGING = False for node in conf.children: if node.key == "Vcenter": VCENTER = node.values[0] elif node.key == "Username": USERNAME = node.values[0] elif node.key == "Password": PASSWORD = node.values[0] elif node.key == "Verbose": VERBOSE_LOGGING = bool(node.values[0]) else: logger('warn', 'Unknown config key: %s' % node.key) # https://collectd.org/wiki/index.php/Naming_schema # The serialized form of the identifier is: # host "/" plugin ["-" plugin instance] "/" type ["-" type instance] # # We want: VCENTER / VMNAME / type... def dispatch_value(vmname, value, key, type): logger('verb','%s: Sending value: %s=%s' % (vmname, key, value)) # This is not intuitive but its what we want. val = collectd.Values() val.host = VCENTER val.plugin = vmname val.type = type val.type_instance = key val.values = [value] val.dispatch() def read_callback(): logger('verb', "beginning read_callback") info = get_stats() if not info: logger('warn', "No data received") return for vm in info: vmname = vm['name'] for key,value in vm.items(): type = METRIC[key] if type: dispatch_value(vmname, value, key, type) # logging function def logger(t, msg): if t == 'err': collectd.error('%s: %s' % (NAME, msg)) elif t == 'warn': collectd.warning('%s: %s' % (NAME, msg)) elif t == 'verb': if VERBOSE_LOGGING: collectd.info('%s: %s' % (NAME, msg)) else: collectd.notice('%s: %s' % (NAME, msg)) # main collectd.register_config(configure_callback) collectd.register_read(read_callback)