Quantcast
Channel: Kevin Holman's System Center Blog
Viewing all 158 articles
Browse latest View live

The 31552 event, or “why is my data warehouse server consuming so much CPU?”

$
0
0

A very common customer scenario – is where all of a sudden you start getting these 31552 events on the RMS, every 10 minutes.  This drives a monitor state and generates an alert when the monitor goes red.

image

 

However – most of the time my experience is that this alert gets “missed” in all the other alerts that OpsMgr raises throughout the day.  Eventually, customers will notice the state of the RMS is critical, or their availability reports take forever or start timing out, or they notice that CPU on the data warehouse server is pegged or very high.  It maybe be several days before they are even aware of the condition.

 

image

image

 

 

The 31552 event is similar to below:

Date and Time: 8/26/2010 11:10:10 AM
Log Name: Operations Manager
Source: Health Service Modules
Event Number: 31552
Level: 1
Logging Computer: OMRMS.opsmgr.net
User: N/A
Description:
Failed to store data in the Data Warehouse. Exception 'SqlException': Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. One or more workflows were affected by this. Workflow name: Microsoft.SystemCenter.DataWarehouse.StandardDataSetMaintenance Instance name: State data set Instance ID: {50F43FBB-3F59-10DA-AD1F-77E61C831E36} Management group: PROD1
 

The alert is:

Data Warehouse object health state data dedicated maintenance process failed to perform maintenance operation

Data Warehouse object health state data dedicated maintenance process failed to perform maintenance operation. Failed to store data in the Data Warehouse.
Exception 'SqlException': Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.

One or more workflows were affected by this.

Workflow name: Microsoft.SystemCenter.DataWarehouse.StandardDataSetMaintenance
Instance name: State data set
Instance ID: {50F43FBB-3F59-10DA-AD1F-77E61C831E36}
Management group: PROD1

 

 

Now – there can be MANY causes of getting this 31552 event and monitor state.  There is NO SINGLE diagnosis or solution.  Generally – we recommend you call into MS support when impacted by this so your specific issue can be evaluated. 

 

The most common issues causing the 31552 events seen are:

  • A sudden flood (or excessive sustained amounts) of data to the warehouse that is causing aggregations to fail moving forward.
  • The Exchange 2010 MP is imported into an environment with lots of statechanges happening.
  • Excessively large ManagedEntityProperty tables causing maintenance to fail because it cannot be parsed quickly enough in the time allotted.
  • Too many tables joined in a view or query (>256 tables) when using SQL 2005 as the DB Engine
  • SQL performance issues (typically disk I/O)
  • Using SQL standard edition, you might see these randomly at night, during maintenance as online indexing is not supported using SQL standard edition.
  • Messed up SQL permissions
  • Too much data in the warehouse staging tables which was not processed due to an issue and is now too much to be processed at one time.
  • Random 31552’s caused my DBA maintenance, backup operations, etc..

If you think you are impacted with this, and have an excessively large ManagedEntityProperty table – the best bet is to open a support case.  This requires careful diagnosis and involves manually deleting data from the database which is only supported when directed by a Microsoft Support Professional.

 

The “too many tables” is EASY to diagnose – because the text of the 31552 event will state exactly that.  That is easily fixed by reducing data warehouse retention of the affected dataset type.

 

 

Now – the MOST common scenario I seem to run into – actually just happened to me in my lab environment, which prompted this article.  I this this happen in customer environments all too often.

I had a monitor which was based on Windows Events.  There was a “bad” event and a “good” event.  However – something broke in the application – and cause BOTH events to be entered in the application log multiple times a second.  We could argue this is a bad monitor, or a defective logging module for the application…. but regardless, the condition is a monitor of ANY type starts flapping, changing from good to bad to good WAY too many times. 

What resulted – was 21,000 state changes for my monitor, within a 15 MINUTE period.

image

 

At the same time, all the aggregate rollup, and dependency monitors, were also having to process these statechanges…. which are also recorded as a statechange event in the database.  So you can see – a SINGLE bad monitor can wreak havoc on the entire system… affecting many more monitors in the health state rollup.

 

While the Operations Database handles these inserts quite well, while the DataWarehouse does not.  Each statechangeevent is written to both databases.  The standard dataset maintenance job is kicked off every 60 seconds on the warehouse.  This is called by a rule (Standard Data Warehouse Data Set maintenance rule) which targets the “Standard Data Set” class, which executes a specialized write action to start maintenance on the warehouse.

What is failing here – is that the maintenance operation (which also handles hourly and daily dataset aggregations for reports) is failing to complete in the default time allotted.  Essentially – there are SO many statechanges in a given hour – that the maintenance operation cannot complete and times out, rolling back the transaction.  This is a never-ending loop, which is why it never seems to “catch up”… because a single large transaction that cannot complete blocks this being committed to the database.  Under normal circumstances – 10 minutes is plenty of time to complete these aggregations, but under a flood condition, there are too many statechanges to calculate the time in state for each monitor and instance, to complete.

So – the solution here is fairly simple: 

  • First – solve the initial problem that caused the flood.  Ensure you don’t have too many statechanges constantly coming in that are contributing to this.  I discuss how to detect this condition and rectify it HERE.
  • Second – we need to disabled to standard built in maintenance that is failing, and run it manually, so it can complete with success. 

For the second step above – here is the process:

1.  Using the instance name section in the 31552 event, find the dataset that is causing the timeout (See the highlighted section in the event below)

Date and Time: 8/26/2010 11:10:10 AM
Log Name: Operations Manager
Source: Health Service Modules
Event Number: 31552
Level: 1
Logging Computer: OMRMS.opsmgr.net
User: N/A
   Description:
Failed to store data in the Data Warehouse. Exception 'SqlException': Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. One or more workflows were affected by this.

Workflow name: Microsoft.SystemCenter.DataWarehouse.StandardDataSetMaintenance
Instance name: State data set
Instance ID: {50F43FBB-3F59-10DA-AD1F-77E61C831E36}
Management group: PROD1

2.  Create an override to disable the maintenance procedure for this data set:

  • In the OpsMgr console go to Authoring-> Rules-> Change Scope to “Standard Data Set”
  • Right click the rule “Standard Data Warehouse Data Set maintenance rule” > Overrides > Override the rule > For a specific object of class: Standard Data Set
  • Select the data set that you found from the event in step 1.
  • Check the box next to Enabled and change the override value to “False”, and then apply the changes.
  • This will disable dataset maintenance from running automatically for the given dataset type.

3.  Restart the “System Center Management” service on the RMS.  This is done to kill any maintenance already running, and ensure the override is applied immediately.

4.  Wait 10 minutes and then connect to the SQL server that hosts the OperationsManagerDW database and open SQL Management Studio.

5. Run the query below replacing the highlighted portion with the name of the dataset from step 1.

**Note: This query could several hours to complete.  This is dependent on how much data has been flooded to the warehouse, and how behind it is in processing.  Do not stop the query
prior to completion.

USE [OperationsManagerDW]
DECLARE @DataSet uniqueidentifier
SET @DataSet = (SELECT DatasetId FROM StandardDataset WHERE SchemaName = 'State')
EXEC StandardDatasetMaintenance @DataSet

6. Once the query finishes, delete the override configured in step 2.

7. Monitor the event log for any further timeout events.

 

 

In my case – my maintenance task ran for 25 minutes then completed.  In most customer environments – this can take several hours to complete, depending on how powerful their SQL servers are and how big the backlog is.  If the maintenance task returns immediately and does not appear to run, ensure your override is set correctly, and try again after 10 minutes.  Maintenance will not run if the warehouse thinks it is already running.

***Note:  Now – this seemed to clear up my issue, as immediately the 31552’s were gone.  However – at 2am, they came back, every 10 minutes again and my warehouse CPU was spiked again.  My assumption here – is that it got through the hourly aggregations flood, and now it was trying to get through the daily aggregations work and had the same issue.  So – when I discovered this was sick again – I used the same procedure above, and this time the job took the same 25 minutes.  I have seen this same behavior with a customer �� where it took several days to “plow through” the flood of data to finally get to a state where the maintenance would always complete in the 10 minute time period.

 

This is a good – simple process to try to resolve the issue yourself, without having to log a call with Microsoft first.  There is no risk in attempting this process yourself – to see if it can resolve your issue.

If you are still seeing timeout events, there are other issues involved.  I’d recommend opening a call up with Microsoft that that point.

Again – this is just ONE TYPE of (very common) 31552 issue.  There are many others, and careful diagnosis is needed.  Never assume someone else's fix will resolve your specific problem, and NEVER edit an OpsMgr database directly unless under the direct support of a Microsoft support engineer.

 

 

(***Special thanks to Chris Wallen, a Sr. Support Escalation Engineer in Microsoft Support for assisting with the data for this article)


After moving your OperationsManager Database–you might find event 18054 errors in the SQL server application log

$
0
0

I recently wrote about My Experience Moving the Operations Database to New Hardware

Something I noticed today – is that the application event log on the SQL server was full of 18054 events, such as below:

Log Name:      Application
Source:        MSSQL$I01
Date:          10/23/2010 5:40:14 PM
Event ID:      18054
Task Category: Server
Level:         Error
Keywords:      Classic
User:          OPSMGR\msaa
Computer:      SQLDB1.opsmgr.net
Description:
Error 777980007, severity 16, state 1 was raised, but no message with that error number was found in sys.messages. If error is larger than 50000, make sure the user-defined message is added using sp_addmessage.

You might also notice some truncated events in the OpsMgr event log, on your RMS or management servers:

Event Type:    Warning
Event Source:    DataAccessLayer
Event Category:    None
Event ID:    33333
Date:        10/23/2010
Time:        5:40:13 PM
User:        N/A
Computer:    OMMS3
Description:
Data Access Layer rejected retry on SqlError:
Request: p_DiscoverySourceUpsert -- (DiscoverySourceId=f0c57af0-927a-335f-1f74-3a3f1f5ca7cd), (DiscoverySourceType=0), (DiscoverySourceObjectId=74fb2fa8-94e5-264d-5f7e-57839f40de0f), (IsSnapshot=True), (TimeGenerated=10/23/2010 10:37:36 PM), (BoundManagedEntityId=3304d59d-5af5-ba80-5ba7-d13a07ed21d4), (IsDiscoveryPackageStale=), (RETURN_VALUE=1)
Class: 16
Number: 18054
Message: Error 777980007, severity 16, state 1 was raised, but no message with that error number was found in sys.messages. If error is larger than 50000, make sure the user-defined message is added using sp_addmessage.

Event Type:    Error
Event Source:    Health Service Modules
Event Category:    None
Event ID:    10801
Date:        10/23/2010
Time:        5:40:13 PM
User:        N/A
Computer:    OMMS3
Description:
Discovery data couldn't be inserted to the database. This could have happened because  of one of the following reasons:

     - Discovery data is stale. The discovery data is generated by an MP recently deleted.
     - Database connectivity problems or database running out of space.
     - Discovery data received is not valid.

The following details should help to further diagnose:

DiscoveryId: 74fb2fa8-94e5-264d-5f7e-57839f40de0f
HealthServiceId: bf43c6a9-8f4b-5d6d-5689-4e29d56fed88
 Error 777980007, severity 16, state 1 was raised, but no message with that error number was found in sys.messages. If error is larger than 50000, make sure the user-defined message is added using sp_addmessage..

 

After a little research – apparently this is caused when following the guide to move the Operations Database to new hardware. 

Marnix blogged about this issue http://thoughtsonopsmgr.blogspot.com/2009/02/moving-scom-database-to-another-server.html which references Matt Goedtel’s article http://blogs.technet.com/b/mgoedtel/archive/2007/08/06/update-to-moving-operationsmanager-database-steps.aspx

 

Because in this process – we simply restore the Operations Database ONLY, we do not carry over some of the modifications to the MASTER database that are performed when you run the Database Installation during setup to create the original operations database.

For some OpsMgr events, which stem from database activity, we get the event data from SQL.  If these messages do not exist in SQL – you see the above issue.

What is bad about this – is that it will keep some event rules from actually alerting us to the condition!  For instance – the rule “Discovery Data Submission Failure” which will alert when there is a failure to insert discovery data – will not trigger now, because it is looking for specific information in parameter 3 of the event, which is part of the missing data:

 

image

 

To resolve this – we need to add back the missing information into the MASTER database. 

  • IF you have moved your OperationsManager database to new hardware

AND:

  • IF you are seeing event 18054 events in the application log of the OpsDB SQL instance server.

Then you are impacted.  To resolve this – you should run the attached SQL script against the Master database of the SQL instance that hosts your OperationsManager Database.  You should ONLY consider running this if you are 100% sure that you are impacted by this issue.

See attached:  Fix_OpsMgrDB_ErrorMsgs.sql

How to collect performance data for SQL databases (multi-instance objects)

$
0
0

I have had several blog posts in the past discussing how to write rules and monitors against multi-instance objects.  Special care must always be taken when writing workflows against classes where an agent can host more than one instance of the same class type.  Examples would be Logical Disk, SQL DB Engine, SQL Database, etc.

Some of the previous articles:

How do I collect data from a multi-instance object – like a SQL DB instance-

Writing monitors to target Logical or Physical Disks

and most recently:

Collecting SQL Database size as a performance counter

On the last article – I discussed how to collect database size – for all databases, and targeting the SQL (200x) DB Engine as the target class for the collection.  This probably wasn’t the best idea.  This is because, for a database specific counter, we probably want to collect that performance data at the database class level – not the instance.  The reason for this, is to facilitate SQL performance views when scoping to the database objects, and for reporting down the road, when we add specific databases to a report.

 

So – the rest of this post will be an example on how to collect the database size.

 

I want to replicate the way the SQL MP’s work – so I will actually create two rules – one to collect for SQL 2005 database objects, and one to collect for SQL 2008 database objects.  The reason I am doing this – is because if I targeted generically “SQL Database” – the next version of SQL would be included in this parent class, but might use a different object/counter down the road.  So I will stick to known versions and perf counters.  So I will create my rule targeting “SQL Server 2005 DB ” and “SQL 2008 DB”

 

The first step in creating a Management Pack – is to open your existing custom SQL workflow MP into the Authoring Console, or create a new empty Management Pack.

I will create a new empty MP and give it the ID of “Microsoft.SQLServer.2008.Monitoring.Addendum” and Display Name of “SQL Server 2008 (Monitoring) Addendum”.  Once you save it – go to File – Management Pack Properties.  We need to version our MP (increment by 1) since we will be changing it, or assign a new version number.

We also need to add a reference here if we don’t already have it – to ensure this MP has a reference for the SQL 2008 Discovery MP.  This will allow us to choose SQL Classes later on when targeting our collection rules.  Click the references tab, and add the SQL 2008 discovery MP if it isn't already present:

image

 

 

Then go to Health Model, Rules, New, Collection, Performance Based, Windows Performance Collection.

image

 

The first step is give my rule an ID.  This will be the ID of the management pack, plus some additional text.  It defaults to “NewElement” and we need to change that:

I will call mine “Microsoft.SQLServer.2008.Monitoring.Addendum.CollectSQL2008DBSize

Give the rule a display name that is in accordance with your custom rule naming standard

Under “Target” – Browse all classes and find the “Microsoft.SQLServer.2008.Database” class. 

Under Category – change to Performance Collection.

When completed – here is how mine appears:

image

 

Click Next.

On this screen – we have the option to type in the performance counter, object, and instance we want to collect.

Great care should be taken here.  This is because the SQL DB Engine is a multi-instance object, and each instance appears differently in Perfmon.  If we don’t choose the correct object here – then we wont collect the data from all of our instances.  Let me explain.

In a “default instance” of SQL – the perf counter looks like this:

image

In a Named instance – it appears like the following:

image

If we typed in “SQLServer:Databases” we would only collect from the default instances of SQL in the environment.  If we typed in “MSSQL$I01:Databases” we would only collect the data from identically named instances in the environment.  However – we want to collect this from ALL instances.  In that case – we need to use a VARIABLE in the performance counter object – since the actual object names vary in Perfmon.  We can cheat by looking at a some other perf collection rules in the SQL MP and see how they handled this…. or we can look in discovered inventory and see if there is a good class property of our chosen class to handle this.

It just so happens that the SQL DB Engine class – has a property called “Performance Counter Object Name” that was created specifically for this purpose!  If you look at this value in discovered inventory, you can see these correspond perfectly with what we need:

image

Sweet!  And if you spot check a few Perf Collection rules in the SQL MP using our same target class, you’d find they also use this.

So – back to the authoring console – we need to use this object, as a variable, for our Perfmon Object.  Here is how:  There is a fly-out on the right – this will show all the class properties based on our target.  In this specific case – our class target is “database”.  The Database is hosted by a SQL DB Engine – so in the flyout – select (Host=SQL DB Engine), and this will expose class properties from the host class.  From here we can choose the “Performance Counter Object Name”

image

 

That will drop the entire variable into the object.  We only need to add the actual perfmon object at the end (:Databases)

image

 

For the counter – that’s simple – just type in the counter exactly as it is by name in perfmon:  Data File(s) Size (KB)

image

Now – for the instance – this is another tricky part.  We don’t want to collect “all instances” when targeting all databases – that could potentially collect a TON of duplicate data depending on the datasource configuration.  It is best to use a variable again here – to match the database perfmon instance to the database name.  This will allow each instance of the rule – targeting each database, to collect performance data only about itself.  Here is an example:

Under “Instance” – again using the flyout on the right – choose a property of the targeted class which matches up in Perfmon.  In this class “Database Name” is perfect!

image

 

For the Interval – we don’t expect this to change often, so once an hour is fine.  (You could even do once or twice a day, but then our hourly reports would not be populated).

Here is our final configuration:

image

 

Click Next.  For the Optimization tab – this counter might be a good candidate for optimization – not to even collect the data unless there is significant change, but since I chose once per hour – I will not use optimization and get an actual perf record per hour, for each database.  If you wanted to collect this counter more frequently – you might consider optimization.

Done!  Now import this into your management group. 

To test if our new rule is working – go to My Workspace, create a new performance view, and scope it to “Collected by specific rules”.  Choose your rule from the list….

image

 

Once your SQL 2008 Servers have downloaded your new MP, applied the new config, and sent up their first performance data sample (takes up to the frequency of the collection rule), you will see this view populated:

image

 

You can also right click any database object in a state view – and choose “Open > Performance View” and see all the counters available for a given database.

 

image

 

 

Now – you can also run a “Performance Top Objects” report – and create a new one for “My Largest Databases”  (required daily aggregation – so wait 24 hours for data to show up)

Now – you can repeat this process for the SQL Server 2005 DB objects, to ensure you are collecting DB size for SQL 2005 hosted databases as well.

 

I am attaching my sample MP below:

If you are using SQL 2008 R2 for OpsMgr DB’s, you need SQL 2008R2 CU5

$
0
0

 

There is an issue where your SQL server hosting the OperationsManager database might consume large amounts of CPU for extended periods. 

This is due to a security cache issue when a non-sysadmin creates a heavy workload on a TempDB database.  Read more about this on the support team blog:

http://blogs.technet.com/b/operationsmgr/archive/2011/02/03/fix-sql-server-hits-100-cpu-utilization-when-there-are-configuration-update-requests-in-scom-2007.aspx

 

 

This SQL issue was first fixed in SQL Server 2008 R2 Cumulative Update 5 (CU5):

http://support.microsoft.com/default.aspx?scid=kb;en-US;2438347

System Center Universe is coming – January 19th!

$
0
0

 

REGISTER NOW HERE:  http://www.systemcenteruniverse.com/

image

 

Read Cameron Fuller’s blog post on this here:  http://blogs.catapultsystems.com/cfuller/archive/2015/12/17/scuniverse-returns-to-dallas-tx-and-the-world-on-january-19th-2016/

 

 

SCU is an awesome day of sessions covering Microsoft System Center, Windows Server, and Azure technologies from top speakers including Microsoft experts and MVP’s in the field.

There are two tracks depending on your interests – Cloud and Datacenter Management, and Enterprise Client Management.

The sponsors for 2016 include:

  • Catapult Systems
  • Microsoft
  • Veeam
  • Adaptiva
  • Secunia
  • Heat Software
  • MPx Alliance
  • Squared Up
  • Cireson

If you cannot attend in person – you can still attend via simulcast!  If you want to attend virtually, there are user group based simulcast locations around the world. Registration is available at: http://www.systemcenteruniverse.com/venue.htm

Simulcast event locations include:

  • Austin, TX
  • Denver, CO
  • Houston, TX
  • Omaha, NE
  • Phoenix, AZ
  • San Antonio, TX
  • Seattle, WA
  • Tampa, FL
  • Amsterdam
  • Germany
  • Vienna
  • And of course our event location in Dallas, TX!

If you want to attend, the in-person event it is available in Dallas Texas and registration is available at: https://www.eventbrite.com/e/scu-2016-live-tickets-7970023555

UR8 for SCOM 2012 R2 – Step by Step

$
0
0

 

image

 

NOTE:  I get this question every time we release an update rollup:   ALL SCOM Update Rollups are CUMULATIVE.  This means you do not need to apply them in order, you can always just apply the latest update.  If you have deployed SCOM 2012R2 and never applied an update rollup – you can go strait to the latest one available.  If you applied an older one (such as UR3) you can always go straight to the latest one!

 

 

KB Article for OpsMgr:  https://support.microsoft.com/en-us/kb/3096382

KB Article for all System Center components:  https://support.microsoft.com/en-us/kb/3096378

Download catalog site:  http://catalog.update.microsoft.com/v7/site/Search.aspx?q=3096382

 

Key fixes:

  • Slow load of alert view when it is opened by an operator
    Sometimes when the operators change between alert views, the views take up to two minutes to load. After this update rollup is installed, the reported performance issue is eradicated. The Alert View Load for the Operator role is now almost same as that for the Admin role user.
  • SCOMpercentageCPUTimeCounter.vbs causes enterprise wide performance issue
    Health Service encountered slow performance every five to six (5-6) minutes in a cyclical manner. This update rollup resolves this issue.
  • System Center Operations Manager Event ID 33333 Message: The statement has been terminated.
    This change filters out "statement has been terminated" warnings that SQL Server throws. These warning messages cannot be acted on. Therefore, they are removed.
  • System Center 2012 R2 Operations Manager: Report event 21404 occurs with error '0x80070057' after Update Rollup 3 or Update Rollup 4 is applied.
    In Update Rollup 3, a design change was made in the agent code that regressed and caused SCOM agent to report error ‘0x80070057’ and MonitoringHost.exe to stop responding/crash in some scenarios. This update rollup rolls back that UR3 change.
  • SDK service crashes because of Callback exceptions from event handlers being NULL
    In a connected management group environment in certain race condition scenarios, the SDK of the local management group crashes if there are issues during the connection to the different management groups. After this update rollup is installed, the SDK of the local management group should no longer crash.
  • Run As Account(s) Expiring Soon — Alert does not raise early enough
    The 14-day warning for the RunAs account expiration was not visible in the SCOM console. Customers received only an Error event in the console three days before the account expiration. After this update rollup is installed, customers will receive a warning in their SCOM console 14 days before the RunAs account expiration, and receive an Error event three (3) days before the RunAs account expiration.
  • Network Device Certification
    As part of Network device certification, we have certified the following additional devices in Operations Manager to make extended monitoring available for them:
    • Cisco ASA5515
    • Cisco ASA5525
    • Cisco ASA5545
    • Cisco IPS 4345
    • Cisco Nexus 3172PQ
    • Cisco ASA5515-IPS
    • Cisco ASA5545-IPS
    • F5 Networks BIG-IP 2000
    • Dell S4048
    • Dell S3048
    • Cisco ASA5515sc
    • Cisco ASA5545sc
  • French translation of APM abbreviation is misleading
    The French translation of “System Center Management APM service” is misleading. APM abbreviation is translated incorrectly in the French version of Microsoft System Center 2012 R2 Operations Manager. APM means “Application Performance Monitoring” but is translated as “Advanced Power Management." This fix corrects the translation.
  • p_HealthServiceRouteForTaskByManagedEntityId does not account for deleted resource pool members in System Center 2012 R2 Operations Manager
    If customers use Resource Pools and take some servers out of the pool, discovery tasks start failing in some scenarios. After this update rollup is installed, these issues are resolved.
  • Exception in the 'Managed Computer' view when you select Properties of a managed server in Operations Manager Console
    In the Operations Manager Server “Managed Computer” view on the Administrator tab, clicking the “Properties” button of a management server causes an error. After this update rollup is installed, a dialog box that contains a “Heart Beat” tab is displayed.
  • Duplicate entries for devices when network discovery runs
    When customers run discovery tasks to discover network devices, duplicate network devices that have alternative MAC addresses are discovered in some scenarios. After this update rollup is installed, customers will not receive any duplicate devices discovered in their environments.
  • Preferred Partner Program in Administration Pane
    This update lets customers view certified System Center Operations Manager partner solutions directly from the console. Customers can obtain an overview of the partner solutions and visit the partner websites to download and install the solutions.
There are no updates for Linux, and there are no updated MP’s for Linux in this update.

 

Lets get started.

From reading the KB article – the order of operations is:

  1. Install the update rollup package on the following server infrastructure:
    • Management servers
    • Gateway servers
    • Web console server role computers
    • Operations console role computers
  2. Apply SQL scripts.
  3. Manually import the management packs.
  4. Update Agents

Now, NORMALLY we need to add another step – if we are using Xplat monitoring – need to update the Linux/Unix MP’s and agents.   However, in UR8 for SCOM 2012 R2, there are no updates for Linux

 

 

 

1.  Management Servers

image

Since there is no RMS anymore, it doesn’t matter which management server I start with.  There is no need to begin with whomever holds the RMSe role.  I simply make sure I only patch one management server at a time to allow for agent failover without overloading any single management server.

I can apply this update manually via the MSP files, or I can use Windows Update.  I have 3 management servers, so I will demonstrate both.  I will do the first management server manually.  This management server holds 3 roles, and each must be patched:  Management Server, Web Console, and Console.

The first thing I do when I download the updates from the catalog, is copy the cab files for my language to a single location:

image

Then extract the contents:

image

Once I have the MSP files, I am ready to start applying the update to each server by role.

***Note:  You MUST log on to each server role as a Local Administrator, SCOM Admin, AND your account must also have System Administrator (SA) role to the database instances that host your OpsMgr databases.

My first server is a management server, and the web console, and has the OpsMgr console installed, so I copy those update files locally, and execute them per the KB, from an elevated command prompt:

image

This launches a quick UI which applies the update.  It will bounce the SCOM services as well.  The update usually does not provide any feedback that it had success or failure. 

I got a prompt to restart:

image

I choose yes and allow the server to restart to complete the update.

 

You can check the application log for the MsiInstaller events to show completion:

Log Name:      Application
Source:        MsiInstaller
Event ID:      1036
Level:         Information
Computer:      SCOM01.opsmgr.net
Description:
Windows Installer installed an update. Product Name: System Center Operations Manager 2012 Server. Product Version: 7.1.10226.0. Product Language: 1033. Manufacturer: Microsoft Corporation. Update Name: System Center 2012 R2 Operations Manager UR8 Update Patch. Installation success or error status: 0.

You can also spot check a couple DLL files for the file version attribute. 

image

Next up – run the Web Console update:

image

This runs much faster.   A quick file spot check:

image

Lastly – install the console update (make sure your console is closed):

image

A quick file spot check:

image

 

 

Secondary Management Servers:

image

I now move on to my secondary management servers, applying the server update, then the console update. 

On this next management server, I will use the example of Windows Update as opposed to manually installing the MSP files.  I check online, and make sure that I have configured Windows Update to give me updates for additional products: 

Apparently when I tried this – the catalog was broken – because none of the system center stuff was showing up in Windows Updates.

So….. because of this – I elect to do manual updates like I did above.

I apply these updates, and reboot each management server, until all management servers are updated.

 

 

 

Updating Gateways:

image

I can use Windows Update or manual installation.

image

The update launches a UI and quickly finishes.

Then I will spot check the DLL’s:

image

I can also spot-check the \AgentManagement folder, and make sure my agent update files are dropped here correctly:

image

 

 

 

2. Apply the SQL Scripts

In the path on your management servers, where you installed/extracted the update, there are two SQL script files: 

%SystemDrive%\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\SQL Script for Update Rollups

(note – your path may vary slightly depending on if you have an upgraded environment of clean install)

image

First – let’s run the script to update the OperationsManager database.  Open a SQL management studio query window, connect it to your Operations Manager database, and then open the script file.  Make sure it is pointing to your OperationsManager database, then execute the script.

You should run this script with each UR, even if you ran this on a previous UR.  The script body can change so as a best practice always re-run this.

image

Click the “Execute” button in SQL mgmt. studio.  The execution could take a considerable amount of time and you might see a spike in processor utilization on your SQL database server during this operation.  I have had customers state this takes from a few minutes to as long as an hour. In MOST cases – you will need to shut down the SDK, Config, and Monitoring Agent (healthservice) on ALL your management servers in order for this to be able to run with success.

You will see the following (or similar) output:

image47

or

image

IF YOU GET AN ERROR – STOP!  Do not continue.  Try re-running the script several times until it completes without errors.  In a production environment, you almost certainly have to shut down the services (sdk, config, and healthservice) on your management servers, to break their connection to the databases, to get a successful run.

Technical tidbit:   Even if you previously ran this script in UR1, UR2, UR3, UR4, UR5, UR6, or UR7, you should run this again for UR8, as the script body can change with updated UR’s.

image

Next, we have a script to run against the warehouse DB.  Do not skip this step under any circumstances.    From:

%SystemDrive%\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\SQL Script for Update Rollups

(note – your path may vary slightly depending on if you have an upgraded environment of clean install)

Open a SQL management studio query window, connect it to your OperationsManagerDW database, and then open the script file UR_Datawarehouse.sql.  Make sure it is pointing to your OperationsManagerDW database, then execute the script.

If you see a warning about line endings, choose Yes to continue.

image

Click the “Execute” button in SQL mgmt. studio.  The execution could take a considerable amount of time and you might see a spike in processor utilization on your SQL database server during this operation.

You will see the following (or similar) output:

image

 

 

 

3. Manually import the management packs

image

There are 26 management packs in this update!

The path for these is on your management server, after you have installed the “Server” update:

\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\Management Packs for Update Rollups

However, the majority of them are Advisor/OMS, and language specific.  Only import the ones you need, and that are correct for your language.  I will remove all the Advisor MP’s for other languages, and I am left with the following:

image

The TFS MP bundles are only used for specific scenarios, such as DevOps scenarios where you have integrated APM with TFS, etc.  If you are not currently using these MP’s, there is no need to import or update them.  I’d skip this MP import unless you already have these MP’s present in your environment.

The Advisor MP’s are only needed if you are using Microsoft Operations Management Suite cloud service, (Previously known as Advisor, and Operation Insights).

However, the Image and Visualization libraries deal with Dashboard updates, and these always need to be updated.

I import all of these shown without issue.

 

 

4.  Update Agents

image43_thumb

Agents should be placed into pending actions by this update (mine worked great) for any agent that was not manually installed (remotely manageable = yes):   One the Management servers where I used Windows Update to patch them, their agents did not show up in this list.  Only agents where I manually patched their management server showed up in this list.  FYI.

image

If your agents are not placed into pending management – this is generally caused by not running the update from an elevated command prompt, or having manually installed agents which will not be placed into pending.

In this case – my agents that were reporting to a management server that was updated using Windows Update – did NOT place agents into pending.  Only the agents reporting to the management server for which I manually executed the patch worked.

You can approve these – which will result in a success message once complete:

image

Soon you should start to see PatchList getting filled in from the Agents By Version view under Operations Manager monitoring folder in the console:

image

 

 

 

5.  Update Unix/Linux MPs and Agents

image

There are no updates for Linux in UR8.  Please see the instructions for UR7 if you are not updating from UR7 directly:

http://blogs.technet.com/b/kevinholman/archive/2015/08/17/ur7-for-scom-2012-r2-step-by-step.aspx

 

 

6.  Update the remaining deployed consoles

image

This is an important step.  I have consoles deployed around my infrastructure – on my Orchestrator server, SCVMM server, on my personal workstation, on all the other SCOM admins on my team, on a Terminal Server we use as a tools machine, etc.  These should all get the matching update version.

 

 

 

Review:

Now at this point, we would check the OpsMgr event logs on our management servers, check for any new or strange alerts coming in, and ensure that there are no issues after the update.

image

Known issues:

See the existing list of known issues documented in the KB article.

1.  Many people are reporting that the SQL script is failing to complete when executed.  You should attempt to run this multiple times until it completes without error.  You might need to stop the Exchange correlation engine, stop all the SCOM services on the management servers, and/or bounce the SQL server services in order to get a successful completion in a busy management group.  The errors reported appear as below:

——————————————————
(1 row(s) affected)
(1 row(s) affected)
Msg 1205, Level 13, State 56, Line 1
Transaction (Process ID 152) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
Msg 3727, Level 16, State 0, Line 1
Could not drop constraint. See previous errors.
——————————————————–

Writing a service recovery script – Cluster service example

$
0
0

 

I had a customer request the ability to monitor the cluster service on clusters, and ONLY alert when a recovery attempt failed.

This is a fairly standard request for service monitoring when we use recoveries – we generally don’t want an alert to be generated from the Service Monitor, because that will be immediate upon service down detection.  We want the service monitor to detect the service down, then run a recovery, and then if the recovery fails to restore service, generate an alert.

Here is an example of that.

The cluster service monitor is unique, in that it already has a built in recovery.  However, it is too simple for our needs, as it only runs NET START.

image

 

So the first thing we will need to do, is create an override disabling this built in recovery:

image

 

Next – override the “Cluster service status” monitor to not generate alerts:

image

 

Now we can add our own script base recovery to the monitor:

image

 

image

 

And paste in a script which I will provide below.  Here is the script:

'========================================================================== ' ' COMMENT: This is a recovery script to recovery the Cluster Service ' '========================================================================== Option Explicit SetLocale("en-us") Dim StartTime,EndTime,sTime 'Capture script start time StartTime = Now 'Time that the script starts so that we can see how long it has been watching to see if the service stops again. Dim strTime strTime = Time Dim oAPI Set oAPI = CreateObject("MOM.ScriptAPI") Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3750,0,"Service Recovery script is starting") Dim strComputer, strService, strStartMode, strState, objCount, strClusterService 'The script will always be run on the machine that generated the monitor error strComputer = "." strClusterService = "ClusSvc" 'Record the current state of each service before recovery in an event Dim strClusterServicestate ServiceState(strClusterService) strClusterServicestate = strState Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3751,0,"Current service state before recovery is: " & strClusterService & " : " & strClusterServicestate) 'Stop script if all services are running If (strClusterServicestate = "Running") Then Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3752,2,"All services were found to be already running, recovery should not run, ending script") Wscript.Quit End If 'Check to see if a specific event has been logged previously that means this recovery script should NOT run if event is present 'This section optional and not commonly used Dim dtmStartDate, iCount, colEvents, objWMIService, objEvent ' Const CONVERT_TO_LOCAL_TIME = True ' Set dtmStartDate = CreateObject("WbemScripting.SWbemDateTime") ' dtmStartDate.SetVarDate dateadd("n", -60, now)' CONVERT_TO_LOCAL_TIME ' ' iCount = 0 ' Set objWMIService = GetObject("winmgmts:" _ ' & "{impersonationLevel=impersonate,(Security)}!\\" _ ' & strComputer & "\root\cimv2") ' Set colEvents = objWMIService.ExecQuery _ ' ("Select * from Win32_NTLogEvent Where Logfile = 'Application' and " _ ' & "TimeWritten > '" & dtmStartDate & "' and EventCode = 100") ' For Each objEvent In colEvents ' iCount = iCount+1 ' Next ' If iCount => 1 Then ' EndTime = Now ' sTime = DateDiff("s", StartTime, EndTime) ' Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3761,2,"script found event which blocks execution of this recovery. Recovery will not run. Script ending after " & sTime & " seconds") ' WScript.Quit ' ElseIf iCount < 1 Then ' Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3762,0,"script did not find any blocking events. Script will continue") ' End If 'At least one service is stopped to cause this recovery, stopping all three services so we can start them in order 'You would only use this section if you had multiple services and they needed to be started in a specific order ' Call oAPI.LogScriptEvent("ServiceRecovery.vbs",3753,0,"At least one service was found not running. Recovery will run. Attempting to stop all services now") ' ServiceStop(strService1) ' ServiceStop(strService2) ' ServiceStop(strService3) 'Check to make sure all services are actually in stopped state ' Optional Wait 15 seconds for slow services to stop ' Wscript.Sleep 15000 ServiceState(strClusterService) strClusterServicestate = strState 'Stop script if all services are not stopped If (strClusterServicestate <> "Stopped") Then Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3754,2,"Recovery script found service is not in stopped state. Manual intervention is required, ending script. Current service state is: " & strClusterService & " : " & strClusterServicestate) Wscript.Quit Else Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3755,0,"Recovery script verified all services in stopped state. Continuing.") End If 'Start services in order. Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3756,0,"Attempting to start all services") Dim errReturn 'Restart Services and watch to see if the command executed without error ServiceStart(strClusterService) Wscript.sleep 5000 'Check service state to ensure all services started ServiceState(strClusterService) strClusterServicestate = strState 'Log success or fail of recovery If (strClusterServicestate = "Running") Then Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3757,0,"All services were successfully started and then found to be running") Else Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3758,2,"Recovery script failed to start all services. Manual intervention is required. Current service state is: " & strClusterService & " : " & strClusterServicestate) End If 'Check to see if this recovery script has been run three times in the last 60 minutes for loop detection Set dtmStartDate = CreateObject("WbemScripting.SWbemDateTime") dtmStartDate.SetVarDate dateadd("n", -60, now)' CONVERT_TO_LOCAL_TIME iCount = 0 Set objWMIService = GetObject("winmgmts:" _ & "{impersonationLevel=impersonate,(Security)}!\\" _ & strComputer & "\root\cimv2") Set colEvents = objWMIService.ExecQuery _ ("Select * from Win32_NTLogEvent Where Logfile = 'Operations Manager' and " _ & "TimeWritten > '" & dtmStartDate & "' and EventCode = 3750") For Each objEvent In colEvents iCount = iCount+1 Next If iCount => 3 Then EndTime = Now sTime = DateDiff("s", StartTime, EndTime) Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3759,2,"script restarted " & strClusterService & " service 3 or more times in the last hour, script ending after " & sTime & " seconds") WScript.Quit ElseIf iCount < 3 Then EndTime = Now sTime = DateDiff("s", StartTime, EndTime) Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3760,0,"script restarted " & strClusterService & " service less than 3 times in the last hour, script ending after " & sTime & " seconds") End If Wscript.Quit '================================================================================== ' Subroutine: ServiceState ' Purpose: Gets the service state and startmode from WMI '================================================================================== Sub ServiceState(strService) Dim objWMIService, colRunningServices, objService Set objWMIService = GetObject("winmgmts:" _ & "{impersonationLevel=impersonate}!\\" & strComputer & "\root\cimv2") Set colRunningServices = objWMIService.ExecQuery _ ("Select * from Win32_Service where Name = '"& strService & "'") For Each objService in colRunningServices strState = objService.State strStartMode = objService.StartMode Next End Sub '================================================================================== ' Subroutine: ServiceStart ' Purpose: Starts a service '================================================================================== Sub ServiceStart(strService) Dim objWMIService, colRunningServices, objService, colServiceList Set objWMIService = GetObject("winmgmts:" _ & "{impersonationLevel=impersonate}!\\" & strComputer & "\root\cimv2") Set colServiceList = objWMIService.ExecQuery _ ("Select * from Win32_Service where Name='"& strService & "'") For Each objService in colServiceList errReturn = objService.StartService() Next End Sub '================================================================================== ' Subroutine: ServiceStop ' Purpose: Stops a service '================================================================================== Sub ServiceStop(strService) Dim objWMIService, colRunningServices, objService, colServiceList Set objWMIService = GetObject("winmgmts:" _ & "{impersonationLevel=impersonate}!\\" & strComputer & "\root\cimv2") Set colServiceList = objWMIService.ExecQuery _ ("Select * from Win32_Service where Name='"& strService & "'") For Each objService in colServiceList errReturn = objService.StopService() Next End Sub

 

Here it is inserted into the UI.  I provide a 3 minute timeout for this one:

 

image

 

Here is how it will look once added:

image

 

Now – we need to generate an alert when the script detects that it failed to start the service:

image

 

Provide a name and we will target the same class as the service monitor:

image

 

For the expression – the ID comes from the event generated by the recovery script, and the string search makes sure we are only alerting on a Cluster service recovery, if we reuse the script for other services we need to be able to distinguish from them:

image

 

 

Lets test!

If we just simply stop the Cluster Service – the recovery kicks in and see evidence in the state changes, and event log:

 

image

 

I like REALLY verbose logging in the scripts I write…. more is MUCH better than less especially when troubleshooting, and recoveries should not be running often clogging up the logs.

image

image

image

image

 

image

image

 

 

If the recovery fails to start the service – the script detects this – drops a very specific event, and then an alert is generated for the service being down and manual intervention required:

 

image

 

image

 

 

There we have it – we only get alerts if the service is not recoverable.  This makes SCOM more actionable.  If we want a record of this for reporting – we can collect the events for recovery starting, and then report on those events.

You can download this example MP at:

https://gallery.technet.microsoft.com/Cluster-Service-Recovery-270ca2cd

UR9 for SCOM 2012 R2 – Step by Step

$
0
0

 

 

 

image48

 

NOTE:  I get this question every time we release an update rollup:   ALL SCOM Update Rollups are CUMULATIVE.  This means you do not need to apply them in order, you can always just apply the latest update.  If you have deployed SCOM 2012R2 and never applied an update rollup – you can go strait to the latest one available.  If you applied an older one (such as UR3) you can always go straight to the latest one!

 

 

KB Article for OpsMgr:  https://support.microsoft.com/en-us/kb/3129774

Download catalog site:  http://catalog.update.microsoft.com/v7/site/Search.aspx?q=3129774

 

Key fixes:

  • SharePoint workflows fail with an access violation under APM
    A certain sequence of the events may trigger an access violation in APM code when it tries to read data from the cache during the Application Domain unload. This fix resolves this kind of behavior.
  • Application Pool worker process crashes under APM with heap corruption
    During the Application Domain unload two threads might try to dispose of the same memory block leading to DOUBLE FREE heap corruption. This fix makes sure that memory is disposed of only one time.
  • Some Application Pool worker processes become unresponsive if many applications are started under APM at the same time
    Microsoft Monitoring Agent APM service has a critical section around WMI queries it performs. If a WMI query takes a long time to complete, many worker processes are waiting for the active one to complete the call. Those application pools may become unresponsive, depending on the wait duration. This fix eliminates the need in WMI query and significantly improves the performance of this code path.
  • MOMAgent cannot validate RunAs Account if only RODC is available
    If there's a read-only domain controller (RODC), the MonAgent cannot validate the RunAs account. This fix resolves this issue.
  • Missing event monitor does not warn within the specified time range in SCOM 2012 R2 the first time after restart
    When you create a monitor for a missed event, the first alert takes twice the amount of time specified time in the monitor. This fix resolves the issue, and the alert is generated in the time specified.
  • SCOM cannot verify the User Account / Password expiration date if it is set by using Password Setting object
    Fine grained password policies are stored in a different container from the user object container in Active Directory. This fix resolves the problems in computing resultant set of policy (RSOP) from these containers for a user object.
  • SLO Detail report displays histogram incorrectly
    In some specific scenarios, the representation of the downtime graph is not displayed correctly. This fix resolves this kind of behavior.
  • APM support for IIS 10 and Windows Server 2016
    Support of IIS 10 on Windows Server 2016 is added for the APM feature in System Center 2012 R2 Operations Manager. An additional management pack Microsoft.SystemCenter.Apm.Web.IIS10.mp is required to enable this functionality. This management pack is located in %SystemDrive%\Program Files\System Center 2012 R2\Operations Manager\Server\Management Packs for Update Rollups alongside its dependencies after the installation of Update Rollup 9.
    Important Note One dependency is not included in Update Rollup 9 and should be downloaded separately:

    Microsoft.Windows.InternetInformationServices.2016.mp

  • APM Agent Modules workflow fail during workflow shutdown with Null Reference Exception
    The Dispose() method of Retry Manager of APM connection workflow is executed two times during the module shutdown. The second try to execute this Dispose() method may cause a Null Reference Exception. This fix makes sure that the Dispose() method can be safely executed one or more times.
  • AEM Data fills up SCOM Operational database and is never groomed out
    If you use SCOM’s Agentless Exception Monitoring to examine application crash data and report on it, the data never grooms out of the SCOM Operational database. The problem with this is that soon the SCOM environment will be overloaded with all the instances and relationships of the applications, error groups, and Windows-based computers, all which are hosted by the management servers. This fix resolves this issue. Additionally, the following management pack’s must be imported in the following order:
    • Microsoft.SystemCenter.ClientMonitoring.Library.mp
    • Microsoft.SystemCenter.DataWarehouse.Report.Library.mp
    • Microsoft.SystemCenter.ClientMonitoring.Views.Internal.mp
    • Microsoft.SystemCenter.ClientMonitoring.Internal.mp
  • The DownTime report from the Availability report does not handle the Business Hours settings
    In the downtime report, the downtime table was not considering the business hours. This fix resolves this issue and business hours will be shown based on the specified business hour values.
    The updated RDL files are located in the following location:

    %SystemDrive%\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\Reporting

    To update the RDL file, follow these steps:

    1. Go to http://MachineName/Reports_INSTANCE1/Pages/Folder.aspxMachineName //Reporting Server.
    2. On this page, go to the folder to which you want to add the RDL file. In this case, click Microsoft.SystemCenter.DataWarehouse.Report.Library.
    3. Upload the new RDL files by clicking the upload button at the top. For more information, see https://msdn.microsoft.com/en-us/library/ms157332.aspx.
  • Adding a decimal sign in an SLT Collection Rule SLO in the ENU Console on a non-ENU OS does not work
    You run the System Center 2012 R2 Operations Manager Console in English on a computer that has the language settings configured to use a non-English (United States) language that uses a comma (,) as the decimal sign instead of a period (.). When you try to create Service Level Tracking, and you want to add a Collection Rule SLO, the value you enter as the threshold cannot be configured by using a decimal sign. This fix resolves the issue.
  • SCOM Agent issue while logging Operations Management Suite (OMS) communication failure
    An issue occurs when OMS communication failures are logged. This fix resolves this issue.

 

There are no updates for Linux, and there are no updated MP’s for Linux in this update as of this time.  The most current Linux MP’s are available below in the Linux section

 

Lets get started.

From reading the KB article – the order of operations is:

  1. Install the update rollup package on the following server infrastructure:
    • Management servers
    • Gateway servers
    • Web console server role computers
    • Operations console role computers
  2. Apply SQL scripts.
  3. Manually import the management packs.
  4. Update Agents

Now, NORMALLY we need to add another step – if we are using Xplat monitoring – need to update the Linux/Unix MP’s and agents.   However, in UR8 and UR9 for SCOM 2012 R2, there are no updates for Linux

 

 

 

1.  Management Servers

image

Since there is no RMS anymore, it doesn’t matter which management server I start with.  There is no need to begin with whomever holds the RMSe role.  I simply make sure I only patch one management server at a time to allow for agent failover without overloading any single management server.

I can apply this update manually via the MSP files, or I can use Windows Update.  I have 3 management servers, so I will demonstrate both.  I will do the first management server manually.  This management server holds 3 roles, and each must be patched:  Management Server, Web Console, and Console.

The first thing I do when I download the updates from the catalog, is copy the cab files for my language to a single location:

Then extract the contents:

image

Once I have the MSP files, I am ready to start applying the update to each server by role.

***Note:  You MUST log on to each server role as a Local Administrator, SCOM Admin, AND your account must also have System Administrator (SA) role to the database instances that host your OpsMgr databases.

My first server is a management server, and the web console, and has the OpsMgr console installed, so I copy those update files locally, and execute them per the KB, from an elevated command prompt:

image

This launches a quick UI which applies the update.  It will bounce the SCOM services as well.  The update usually does not provide any feedback that it had success or failure. 

I got a prompt to restart:

image

I choose yes and allow the server to restart to complete the update.

 

You can check the application log for the MsiInstaller events to show completion:

Log Name:      Application
Source:        MsiInstaller
Date:          1/27/2016 9:37:28 AM
Event ID:      1036
Description:
Windows Installer installed an update. Product Name: System Center Operations Manager 2012 Server. Product Version: 7.1.10226.0. Product Language: 1033. Manufacturer: Microsoft Corporation. Update Name: System Center 2012 R2 Operations Manager UR9 Update Patch. Installation success or error status: 0.

You can also spot check a couple DLL files for the file version attribute. 

image

Next up – run the Web Console update:

image

This runs much faster.   A quick file spot check:

image

Lastly – install the console update (make sure your console is closed):

image

A quick file spot check:

image

 

 

Additional Management Servers:

image

I now move on to my additional management servers, applying the server update, then the console update and web console update where applicable.

On this next management server, I will use the example of Windows Update as opposed to manually installing the MSP files.  I check online, and make sure that I have configured Windows Update to give me updates for additional products: 

image

The applicable updates show up under optional – so I tick the boxes and apply these updates.

After a reboot – go back and verify the update was a success by spot checking some file versions like we did above.

 

 

Updating Gateways:

image

I can use Windows Update or manual installation.

image

The update launches a UI and quickly finishes.

Then I will spot check the DLL’s:

image

I can also spot-check the \AgentManagement folder, and make sure my agent update files are dropped here correctly:

image

 

***NOTE:  You can delete any older UR update files from the \AgentManagement directories.  The UR’s do not clean these up and they provide no purpose for being present any longer.

 

 

 

2. Apply the SQL Scripts

In the path on your management servers, where you installed/extracted the update, there are two SQL script files: 

%SystemDrive%\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\SQL Script for Update Rollups

(note – your path may vary slightly depending on if you have an upgraded environment of clean install)

image

First – let’s run the script to update the OperationsManager database.  Open a SQL management studio query window, connect it to your Operations Manager database, and then open the script file.  Make sure it is pointing to your OperationsManager database, then execute the script.

You should run this script with each UR, even if you ran this on a previous UR.  The script body can change so as a best practice always re-run this.

image

Click the “Execute” button in SQL mgmt. studio.  The execution could take a considerable amount of time and you might see a spike in processor utilization on your SQL database server during this operation.  I have had customers state this takes from a few minutes to as long as an hour. In MOST cases – you will need to shut down the SDK, Config, and Monitoring Agent (healthservice) on ALL your management servers in order for this to be able to run with success.

You will see the following (or similar) output:

image47

or

image

IF YOU GET AN ERROR – STOP!  Do not continue.  Try re-running the script several times until it completes without errors.  In a production environment, you almost certainly have to shut down the services (sdk, config, and healthservice) on your management servers, to break their connection to the databases, to get a successful run.

Technical tidbit:   Even if you previously ran this script in UR1, UR2, UR3, UR4, UR5, UR6, UR7, or UR8, you should run this again for UR9, as the script body can change with updated UR’s.

image

Next, we have a script to run against the warehouse DB.  Do not skip this step under any circumstances.    From:

%SystemDrive%\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\SQL Script for Update Rollups

(note – your path may vary slightly depending on if you have an upgraded environment of clean install)

Open a SQL management studio query window, connect it to your OperationsManagerDW database, and then open the script file UR_Datawarehouse.sql.  Make sure it is pointing to your OperationsManagerDW database, then execute the script.

If you see a warning about line endings, choose Yes to continue.

image

Click the “Execute” button in SQL mgmt. studio.  The execution could take a considerable amount of time and you might see a spike in processor utilization on your SQL database server during this operation.

You will see the following (or similar) output:

image

 

 

 

3. Manually import the management packs

image

There are 55 management packs in this update!   Most of these we don’t need – so read carefully.

The path for these is on your management server, after you have installed the “Server” update:

\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\Management Packs for Update Rollups

However, the majority of them are Advisor/OMS, and language specific.  Only import the ones you need, and that are correct for your language.  I will remove all the MP’s for other languages (keeping only ENU), and I am left with the following:

image

 

What NOT to import:

The Advisor MP’s are only needed if you are using Microsoft Operations Management Suite cloud service, (Previously known as Advisor, and Operation Insights).

The APM MP’s are only needed if you are using the APM feature in SCOM.

Note the APM MP with a red X.  This MP requires the IIS MP’s for Windows Server 2016 which are in Technical Preview at the time of this writing.  Only import this if you are using APM *and* you need to monitor Windows Server 2016.  If so, you will need to download and install the technical preview editions of that MP from https://www.microsoft.com/en-us/download/details.aspx?id=48256

The TFS MP bundle is only used for specific scenarios, such as DevOps scenarios where you have integrated APM with TFS, etc.  If you are not currently using these MP’s, there is no need to import or update them.  I’d skip this MP import unless you already have these MP’s present in your environment.

However, the Image and Visualization libraries deal with Dashboard updates, and these always need to be updated.

I import all of these shown without issue.

 

 

4.  Update Agents

image43_thumb

Agents should be placed into pending actions by this update for any agent that was not manually installed (remotely manageable = yes):  

 

One the Management servers where I used Windows Update to patch them, their agents did not show up in this list.  Only agents where I manually patched their management server showed up in this list.  FYI.   The experience is NOT the same when using Windows Update vs manual.  If yours don’t show up – you can try running the update for that management server again – manually.

image

If your agents are not placed into pending management – this is generally caused by not running the update from an elevated command prompt, or having manually installed agents which will not be placed into pending.

In this case – my agents that were reporting to a management server that was updated using Windows Update – did NOT place agents into pending.  Only the agents reporting to the management server for which I manually executed the patch worked.

I manually re-ran the server MSP file manually on these management servers, from an elevated command prompt, and they all showed up:

image

 

 

You can approve these – which will result in a success message once complete:

image

Soon you should start to see PatchList getting filled in from the Agents By Version view under Operations Manager monitoring folder in the console:

image

 

 

 

5.  Update Unix/Linux MPs and Agents

image

There are no updates for Linux in UR9 at the time of this writing.   The current Linux MP’s can be downloaded from:

https://www.microsoft.com/en-us/download/details.aspx?id=29696

7.5.1045.0 is current at this time for SCOM 2012 R2 and these shipped with UR7.  If you are already running 7.5.1045.0 version of the Linux MP’s and agents – no update is necessary.

****Note – take GREAT care when downloading – that you select the correct download for R2.  You must scroll down in the list and select the MSI for 2012 R2:

image

Download the MSI and run it.  It will extract the MP’s to C:\Program Files (x86)\System Center Management Packs\System Center 2012 R2 Management Packs for Unix and Linux\

Update any MP’s you are already using.   These are mine for RHEL, SUSE, and the Universal Linux libraries. 

image

You will likely observe VERY high CPU utilization of your management servers and database server during and immediately following these MP imports.  Give it plenty of time to complete the process of the import and MPB deployments.

Next up – you would upgrade your agents on the Unix/Linux monitored agents.  You can now do this straight from the console:

image

image

You can input credentials or use existing RunAs accounts if those have enough rights to perform this action.

Mine FAILED, with an SSH exception about copying the new agent.  It turns out my files were not updated on the management server – see pic:

image

I had to restart the Healthservice on the management server, and within a few minutes all the new files were there.

Finally:

image

 

 

6.  Update the remaining deployed consoles

image

This is an important step.  I have consoles deployed around my infrastructure – on my Orchestrator server, SCVMM server, on my personal workstation, on all the other SCOM admins on my team, on a Terminal Server we use as a tools machine, etc.  These should all get the matching update version.

 

 

 

Review:

Now at this point, we would check the OpsMgr event logs on our management servers, check for any new or strange alerts coming in, and ensure that there are no issues after the update.

image

Known issues:

See the existing list of known issues documented in the KB article.

1.  Many people are reporting that the SQL script is failing to complete when executed.  You should attempt to run this multiple times until it completes without error.  You might need to stop the Exchange correlation engine, stop all the SCOM services on the management servers, and/or bounce the SQL server services in order to get a successful completion in a busy management group.  The errors reported appear as below:

——————————————————
(1 row(s) affected)
(1 row(s) affected)
Msg 1205, Level 13, State 56, Line 1
Transaction (Process ID 152) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
Msg 3727, Level 16, State 0, Line 1
Could not drop constraint. See previous errors.
——————————————————–


Removing / Migrating old Management Servers to new ones

$
0
0

 

This is a common practice for rotating old physical servers coming off lease, or when moving VM based management servers to a new operating system. 

 

There are some generic instructions on TechNet here:  https://technet.microsoft.com/en-us/library/hh456439.aspx   however, these don’t really paint the whole picture of what all should be checked first.  Customers sometimes run into orphaned objects, or management servers they cannot delete because the MS is hosting remote monitoring activities.

Here is a checklist I have put together, the steps are not necessarily enforced in this order… so you can rearrange much of this as you see fit.

 

  • Install new management server(s)
  • Configure any registry modifications in place on existing management servers for the new MS
  • Patch new MS with current UR to bring parity with other management servers in the management group.
  • If you have gateways reporting to old management servers, install certificates from the same trusted publisher on the new MS, and then use PowerShell to change GW to MS assignments.
  • Inspect Resource pools. Make sure old management server is removed from any Resource pools with manual membership, and place new management servers in those resource pools.
  • If you have any 3rd party service installations, ensure they are installed as needed on new MS (connector services, hardware monitoring add-ons.
  • If you have any hard coded script or EXE paths in place for notifications or scheduled tasks, ensure those are moved.
  • If you run the Exchange 2010 Correlation engine – ensure it is moved to a new MS.
  • If you use any URL watcher nodes hard coded to a management server – ensure those are moved to a new MS. (Web Transaction Monitoring)
  • If you have any other watcher nodes – migrate those templates (OLEDB probe, port, etc.)
  • If you have any custom registry keys in place on a MS, to discover it as a custom class for any reason, ensure these are migrated.
  • If you have any special roles, such as the RMSe - migrate them.
  • Ensure the new MS will host optional roles such as web console or console roles if required.
  • Migrate any agent assignments in the console or AD integration.
  • Ensure you have BOTH management servers online for a considerable time to allow all agents to get updated config – otherwise you will orphan the agents until they know about the new management server.
  • If you perform UNIX/LINUX monitoring, these should migrate with resource pools. You will need to import and export SCX certs for the new management servers that will take part in the pool.
  • If you use IM notifications, ensure the prerequisites are installed on the new MS.
  • Ensure any new management servers are allowed to send email notifications to your SMTP server if it uses an access list.
  • If you have any network devices, ensure the discovery is moved to another MS for any MS that is being removed.
  • If you are using AEM, ensure this role is reconfigured for any retiring MS.
  • If you are using ACS and the collector role needs to be migrated, perform this and update the forwarders to their new collector.
  • If you have customized heartbeat settings for the management server, ensure this consistent.
  • If you have any agentless monitored systems (rare) move their proxy server.
  • If you were running a hardware load balancer for the SDK service connections – remove the old management servers and add new ones.
  • Review event logs on new management servers and ensure there aren't any major health issues.
  • Uninstall old management server gracefully.
  • Delete management server object in console if required post-uninstall.

 

If you have any additional steps you feel should be part of this list – feel free to comment.

Event 18054 errors in the SQL application log – in SCOM 2012 R2 deployments

$
0
0

 

I wrote about this issue for SCOM 2007 here:

http://blogs.technet.com/b/kevinholman/archive/2010/10/26/after-moving-your-operationsmanager-database-you-might-find-event-18054-errors-in-the-sql-server-application-log.aspx

When SCOM is installed – it doesn’t just create the databases on the SQL instance – it adds data to the sysmessages view for different error scenarios, to the master database for the instance.

This is why after moving a database, or restoring a DB backup to a rebuilt SQL server, we might end up missing this data. 

These are important because they give very good detailed data about the error and how to resolve it.  If you see these – you need to update your SQL instance with some scripts.

Examples of these events on the SQL server:

Log Name:      Application
Source:        MSSQL$I01
Date:          10/23/2010 5:40:14 PM
Event ID:      18054
Task Category: Server
Level:         Error
Keywords:      Classic
User:          OPSMGR\msaa
Computer:      SQLDB1.opsmgr.net
Description:
Error 777980007, severity 16, state 1 was raised, but no message with that error number was found in sys.messages. If error is larger than 50000, make sure the user-defined message is added using sp_addmessage.

You might also notice some truncated events in the OpsMgr event log, on your RMS or management servers:

Event Type:    Warning
Event Source:    DataAccessLayer
Event Category:    None
Event ID:    33333
Date:        10/23/2010
Time:        5:40:13 PM
User:        N/A
Computer:    OMMS3
Description:
Data Access Layer rejected retry on SqlError:
Request: p_DiscoverySourceUpsert — (DiscoverySourceId=f0c57af0-927a-335f-1f74-3a3f1f5ca7cd), (DiscoverySourceType=0), (DiscoverySourceObjectId=74fb2fa8-94e5-264d-5f7e-57839f40de0f), (IsSnapshot=True), (TimeGenerated=10/23/2010 10:37:36 PM), (BoundManagedEntityId=3304d59d-5af5-ba80-5ba7-d13a07ed21d4), (IsDiscoveryPackageStale=), (RETURN_VALUE=1)
Class: 16
Number: 18054
Message: Error 777980007, severity 16, state 1 was raised, but no message with that error number was found in sys.messages. If error is larger than 50000, make sure the user-defined message is added using sp_addmessage.

Event Type:    Error
Event Source:    Health Service Modules
Event Category:    None
Event ID:    10801
Date:        10/23/2010
Time:        5:40:13 PM
User:        N/A
Computer:    OMMS3
Description:
Discovery data couldn't be inserted to the database. This could have happened because  of one of the following reasons:

     – Discovery data is stale. The discovery data is generated by an MP recently deleted.
     – Database connectivity problems or database running out of space.
     – Discovery data received is not valid.

The following details should help to further diagnose:

DiscoveryId: 74fb2fa8-94e5-264d-5f7e-57839f40de0f
HealthServiceId: bf43c6a9-8f4b-5d6d-5689-4e29d56fed88
Error 777980007, severity 16, state 1 was raised, but no message with that error number was found in sys.messages. If error is larger than 50000, make sure the user-defined message is added using sp_addmessage..

 

I have created some SQL scripts which are taken from the initial installation files, and you can download them below.  You simply run them in SQL Management studio to get this data back.

These are for SCOM 2012 R2 ONLY!!!!

 

Download link:   https://gallery.technet.microsoft.com/SQL-to-fix-event-18054-c4375367

Alert Lifecycle Management

$
0
0

 

Sometimes – this is almost a dirty word in some companies.  It is applying an ITSM process around monitoring, to ensure alerts are real, actionable, assigned, accountable, and reportable.

In my travels, I see companies with an excellent process around this.  I also see companies with ZERO process. 

My colleague Nathan Gau has a 3-part series on this topic – check it out over here:

 

http://blogs.technet.com/b/nathangau/archive/2016/02/04/the-anatomy-of-a-good-scom-alert-management-process-part-1-why-is-alert-management-necessary.aspx

The impact of moving databases in SCOM

$
0
0

 

I recently had an interesting customer issue.

We were deploying a new management group to do some performance testing of the impact to SCOM performance as we scale up agents.  This particular management group only had the default MP’s from installing SCOM, and the Base OS MP’s.  Nothing more.

When we scaled up to ~2000 agents, we took a checkpoint at performance.  The console was zippy, and the management servers were having no issues.  However – when we analyzed performance on the database, we saw really high CPU.

image

 

Zooming into a smaller time chunk – the CPU was pretty wild:

 

image

 

What we found – was that the customer had moved the SCOM databases to a different server than originally installed to.  When they did this – they did not fully follow the TechNet instructions, to ensure that SQL Broker is enabled and CLR is enabled.

You can check this :

SQL Broker:

SELECT is_broker_enabled FROM sys.databases WHERE name='OperationsManager'

CLR:

SELECT * FROM sys.configurations WHERE name = 'clr enabled'

Both should return a value of “1” to show they are enabled.

Changing these values are covered here:  https://technet.microsoft.com/en-ca/library/hh278848.aspx

Always make sure you handle the other changes necessary when moving a database, and don’t forget to add the sysmessages back, documented here:  Event 18054 errors in the SQL application log – in SCOM 2012 R2 deployments

 

After making these changes – the impact was significant, going from 50% avg CPU consumption, to 11%.

 

24 hour snapshot:

image

One hour snapshot:

image

 

Whenever you visit a SCOM customer, or inherit a SCOM environment that you don’t know the full history on, they might not have these settings optimized, and they might not even be aware they are impacted, especially if their agent count is low.  There are other symptoms you’d see, such as regular expressions failing in the logs without CLR enabled, and agent discovery not working without SQL broker…. but always a good thing to inspect when reviewing the health of a deployment.

Windows 10 Client MP’s are available

$
0
0

 

image

 

Download here:    https://www.microsoft.com/en-us/download/details.aspx?id=51189

The client OS MP’s are available when you need to monitor Windows clients in your SCOM management group.  These might be “light” monitoring of desktops and laptops in the organization, or these might be for mission critical roles such as Kiosks and ATM type machines running a Windows client OS.

 

image

The MP’s will upgrade your base client library (still has a name referencing to SCOM 2007 but these are applicable to SCOM 2012) and will import additional MP’s specific to discovering and monitoring Windows 10 clients.

 

image

If you are importing this MP for Windows 10 clients, and you also already monitor Windows 8 clients, make SURE you update your Windows 8 MP’s to the latest version 6.0.7251.0 available here:  https://www.microsoft.com/en-us/download/details.aspx?id=38434    6.0.7251.0 MP’s contain a fix to stop discovering a Win10 client as a Windows 8 client, otherwise you will get duplicate monitoring and overload your Win10 clients unnecessarily.  Make sure you upgrade the Windows 8 MP’s FIRST before installing the agents on any Windows 10 clients.  If you still have duplicate instances of Windows 8 Computer for a Windows 10 client, you need to delete the agent from Agent Managed in SCOM, then approve them again, and this will clean up the old discovered objects from the Windows 8 client MP’s.

 

Individual workflows are enabled on every client computer, to discover and monitor disks, memory, CPU, etc.  However, the monitors are all set to not generate alerts via overrides.  You have to put clients in a “Business Critical” group in order to see alerts for these clients.  However, the monitors will still show health state for all clients.  Just not alerts.

Same goes for performance collection rules.  There are overrides to enable these (all disabled out of the box) and collect performance data for business critical computers.

The guide also discusses the use of aggregate client monitoring.  These load special workflows that fill the data warehouse with trending reports, and run SQL queries against the warehouse on a regular basis.  Make sure you DON’T import the Aggregate MP’s if you don’t want or need this type of monitoring, as it is optional.

See the MP guide for advanced details on how to configure this MP, and other client OS management packs.

Base OS MP’s have been updated – version 6.0.7303.0

$
0
0

 

***WARNING***  There are some significant issues in this release of the Base OS MP, I do not recommend applying this one until an updated version comes out.

Issues:

  • Cluster Disks on Server 2008R2 clusters are no longer discovered as cluster disks.
  • Cluster Disks on Server 2008 clusters are not discovered as logical disks.
  • Quorum (or small size) disks on clusters that ARE discovered as Cluster disks, do not monitor for free space correctly.
  • Cluster shared volumes are discovered twice, once as a Cluster Shared Volume instance, and once as a Logical disk instance, with the latter likely cause by enabling mounted disk discovery.
  • On Hyper-V servers, I discover an extra disk, which has no properties:

image

 

 

What was changed?

 

From the guide:

MP used to discover physical CPU, which performance monitor instance name property was not correlated with Windows PerfMon object (expecting instance name in (socket, core) format). That affected related rules and monitors. With this release, MP discovers logical processors, rather than physical, and populates performance monitor instance name in proper format

That was a real problem for anyone trying to monitor individual CPU’s in the past – we actually discovered “sockets” not cores – so this didn’t jive with Perfmon at all.  I look forward to testing this.

Microsoft.Windows.Server.ClusterSharedVolumeMonitoring.mp and Microsoft.Windows.Server.Library.mp scripts code migration to PowerShell in scope of Windows Server 2016 Nano support (relevantly introduced in Windows Server 2016 MP version 10.0.1.0).

It is these changes that likely broke cluster disk discovery.

Updated Microsoft.Windows.Server.ClusterSharedVolumeMonitoring.ClusterSharedVolume.Monitoring.State monitor alert properties and description. The fix resolved property replacement failure warning been generated on monitor alert firing.

Exchange 2013 Addendum MP – for Exchange 2013 and 2016

$
0
0

 

image

 

 

 

The Exchange 2013 MP has been released for some time now.  The current version at this writing is 15.0.666.19 which you can get HERE

This MP can be used to discover and monitor Exchange Server 2013 and 2016.

 

 

 

 

However, one of the things I always disliked about this MP – is that it does not use a seed class discovery.  Therefore – it runs a PowerShell script every 4 hours on EVERY machine in your management group, looking for Exchange servers.  The problem with this, is that it doesn’t follow best practices.  As a general best practice, we should NOT run scripts on all servers unless truly necessary.  Another issue – many customers have servers running 2003 and 2008 that DON’T have PowerShell installed!  You will see nuisance events like the following:

 

Event Type:    Error
Event Source:    Health Service Modules
Event Category:    None
Event ID:    21400
Date:        3/2/2016
Time:        3:29:26 AM
User:        N/A
Computer:    WINS2003X64
Description:
Failed to create process due to error '0x80070003 : The system cannot find the path specified.
', this workflow will be unloaded.
Command executed:    "C:\WINDOWS\System32\WindowsPowerShell\v1.0\powershell.exe" -PSConsoleFile "bin\exshell.psc1" -Command "& '"C:\Program Files\Microsoft Monitoring Agent\Agent\Health Service State\Monitoring Host Temporary Files 85\26558\MicrosoftExchangeDiscovery.ps1"'" 0 '{3E7D658E-FA5E-924E-334E-97C84E068C4A}' '{B21B34F9-2817-4800-73BD-012E79609F7E}' 'wins2003x64.dmz.corp' 'wins2003x64' 'Default-First-Site-Name' 'dmz.corp' '' '' '0' 'false'
Working Directory:    C:\Program Files\Microsoft Monitoring Agent\Agent\Health Service State\Monitoring Host Temporary Files 85\26558\
One or more workflows were affected by this. 
Workflow name: Microsoft.Exchange.15.Server.DiscoveryRule
Instance name: wins2003x64.dmz.corp
Instance ID: {B21B34F9-2817-4800-73BD-012E79609F7E}
Management group: OMMG1

 

 

So, I have created an addendum MP which should resolve this.  My MP creates a class and discovery, looking for “HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ExchangeServer\v15\Setup\MsiInstallPath” in the registry.  If it finds the registry path, SCOM will add it as an instance of my seed class.

image

 

Then, I created a group of Windows Computer objects that “contain” an instance of the seed class. 

image

 

Next, I added an override to disable the main script discovery the Exchange 2013 MP.

Finally, I added an override to enable this same discovery, for my custom group.  This should have the effect that our Exchange discovery script ONLY runs on server that actually have Exchange installed (based on the registry key)

image

 

 

This works for discovering Exchange 2013 and Exchange 2016 with the current Exchange 2013 MP.

 

You can download this sample MP at the following location:

https://gallery.technet.microsoft.com/Exchange-Server-2013-and-cfdfcf2f


How to generate an alert and make it look like it came from someone else

$
0
0

 

This capability has been around forever, but I have never seen it documented.  This is a really cool way to generate alerts as if they came from other agents, but target a different agent.

Suppose a scenario:  You have a client/server application (such as a backup program) where a central server logs all the events about success or failed jobs from clients.

This is scenario – we could simply generate alerts targeting the central server, and reading the event log, and bubble up the broken client name from the logs, into the alert.  The challenge becomes, what if some agents are test, or dev, and some are prod?  What if we have already put in place “tiering” of servers by groupings, and we use this to filter which alerts from which servers get ticketed?

There is actually a way to target one instance of a class with a workflow, but to generate alerts as if they came from a different instance of a different class, EVEN if that instance is a different agent altogether!

Let me demonstrate:

The most common write action for generating alerts in rules, is System.Health.GenerateAlert, which is the one commonly used in every Alert Generating rule you typically come across.  It is documented here:  https://msdn.microsoft.com/en-us/library/ee809352.aspx

HOWEVER – there is another write action you can use:  System.Health.GenerateAlertForType. 

This is documented here:  https://msdn.microsoft.com/en-us/library/jj130310.aspx  While we document the modules and a sample XML example, we don’t really give much guidance anywhere on use cases.

This is a really cool write action, which allows us to generate alerts “on behalf” of a different object type, or even a different object type from a different computer!  Let me show the difference:

A typical System.Health.GenerateAlert looks like this:

<WriteAction ID="GenerateAlert" TypeID="Health!System.Health.GenerateAlert"> <Priority>1</Priority> <Severity>1</Severity> <AlertMessageId>$MPElement[Name="Demo.Rule.AlertMessage"]$</AlertMessageId> <AlertParameters> <AlertParameter1>$Data/EventDescription$</AlertParameter1> </AlertParameters> </WriteAction>

As you can see – very simple.  It sets the priority and severity of the alert, references the Alert Message ID (which is the alert name and description configuration) and contains any alert parameters we want to use in the display output (in this case, Event Description is very common).

 

Now, see the System.Health.GenerateAlertForType:

<WriteAction ID="GenerateAlertForTypeWA" TypeID="Health!System.Health.GenerateAlertForType"> <Priority>1</Priority> <Severity>1</Severity> <ManagedEntityTypeId>$MPElement[Name="Example.Client.Class"]$</ManagedEntityTypeId> <KeyProperties> <KeyProperty> <PropertyId>$MPElement[Name="Windows!Microsoft.Windows.Computer"]/PrincipalName$</PropertyId> <IsCaseSensitive>false</IsCaseSensitive> <Value>servername.fqdn.local</Value> </KeyProperty> <KeyProperty> <PropertyId>$MPElement[Name="Example.Client.Class"]/ClientName$</PropertyId> <IsCaseSensitive>false</IsCaseSensitive> <Value>servername.fqdn.local</Value> </KeyProperty> </KeyProperties> <AlertMessageId>$MPElement[Name="Demo.Rule.AlertMessage"]$</AlertMessageId> <AlertParameters> <AlertParameter1>$Data/EventDescription$</AlertParameter1> </AlertParameters> </WriteAction>

The key section here is <ManagedEntityTypeId> and then some <KeyProperties>

In the <ManagedEntityTypeId> we need to reference the CLASS that we want the alert to appear as it is coming FROM.

Then, in the <KeyProperties> we need two sections:

The first key property is mapping the Windows Computer principal name to the fqdn of the agent we want the alert to “appear to be from”.  This part is easy.

The second key property is mapping the SAME fqdn, to a matching property on the CLASS we referenced in <ManagedEntityTypeId>, or a parent base class that has the key property defined.

The second key property is the tough one.  The criteria for this to work (from my testing) is that we MUST have a class with a key property first, and that key property MUST be the fqdn of the agent/server for each instance (or whatever value we are “matching” on.

In most of my classes I create, I don’t create key properties.  Key properties aren't required unless I have a class that will discover multiple instances on the same healthservice (agent).  For stuff I do – this is rarely the case.  However, it is EASY to create a key property for your custom classes, and many Microsoft classes already have key properties.  The big “gotchya” here is that in order to generate an alert for another instance of a class (not the targeted instance), the class we specify MUST have a key property defined for this to work.

So – I simply added a key property of “ClientName” to my custom class, and then to discover it, all I have to do is add some simple code to the discovery which maps the hosting Windows Computer principal name to the property.

Ok…. I know…. I probably lost a lot of you up to this point….. but it is easier to just do it, than it is to understand it.  That’s why I will post my XML examples at a link below.  Smile

 

Here is an example of me adding a custom key property to my custom class:

<ClassType ID="Example.AlertFromAnotherInstance.Client.Class" Accessibility="Public" Abstract="false" Base="Windows!Microsoft.Windows.LocalApplication" Hosted="true" Singleton="false" Extension="false"> <Property ID="ClientName" Type="string" AutoIncrement="false" Key="true" CaseSensitive="false" MaxLength="256" MinLength="0" Required="false" Scale="0" /> </ClassType>

And here is part of the discovery that I will use to map “ClientName” to the hosting Windows Computer principal name:

 

<Discovery ID="Example.AlertFromAnotherInstance.Client.Class.Discovery" Enabled="true" Target="Windows!Microsoft.Windows.Server.OperatingSystem" ConfirmDelivery="false" Remotable="true" Priority="Normal"> <Category>Discovery</Category> <DiscoveryTypes> <DiscoveryClass TypeID="Example.AlertFromAnotherInstance.Client.Class" /> </DiscoveryTypes> <DataSource ID="DS" TypeID="Windows!Microsoft.Windows.FilteredRegistryDiscoveryProvider"> <ComputerName>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/PrincipalName$</ComputerName> <RegistryAttributeDefinitions> <RegistryAttributeDefinition> <AttributeName>ClientExists</AttributeName> <Path>SOFTWARE\Demo\Client</Path> <PathType>0</PathType> <AttributeType>0</AttributeType> </RegistryAttributeDefinition> </RegistryAttributeDefinitions> <Frequency>86400</Frequency> <ClassId>$MPElement[Name="Example.AlertFromAnotherInstance.Client.Class"]$</ClassId> <InstanceSettings> <Settings> <Setting> <Name>$MPElement[Name="Windows!Microsoft.Windows.Computer"]/PrincipalName$</Name> <Value>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/PrincipalName$</Value> </Setting> <Setting> <Name>$MPElement[Name="System!System.Entity"]/DisplayName$</Name> <Value>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/PrincipalName$</Value> </Setting> <Setting> <Name>$MPElement[Name="Example.AlertFromAnotherInstance.Client.Class"]/ClientName$</Name> <Value>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/PrincipalName$</Value> </Setting> </Settings> </InstanceSettings>

 

So – now all I need to do it write a rule, and use our new write action.

You can write the event rule like my example will do using the console, or any other tool, then simply modify the write action section in XML.

Here is my simple rule:

 

<Rule ID="Example.AlertFromAnotherInstance.Server.Event.Rule" Enabled="true" Target="Example.AlertFromAnotherInstance.CentralServer.Class" ConfirmDelivery="false" Remotable="true" Priority="Normal" DiscardLevel="100"> <Category>EventCollection</Category> <DataSources> <DataSource ID="Microsoft.Windows.EventCollector" TypeID="Windows!Microsoft.Windows.EventCollector"> <ComputerName>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/NetworkName$</ComputerName> <LogName>Application</LogName> <AllowProxying>false</AllowProxying> <Expression> <And> <Expression> <SimpleExpression> <ValueExpression> <XPathQuery Type="UnsignedInteger">EventDisplayNumber</XPathQuery> </ValueExpression> <Operator>Equal</Operator> <ValueExpression> <Value Type="UnsignedInteger">999</Value> </ValueExpression> </SimpleExpression> </Expression> <Expression> <SimpleExpression> <ValueExpression> <XPathQuery Type="String">PublisherName</XPathQuery> </ValueExpression> <Operator>Equal</Operator> <ValueExpression> <Value Type="String">TEST</Value> </ValueExpression> </SimpleExpression> </Expression> </And> </Expression> </DataSource> </DataSources> <WriteActions> <WriteAction ID="GenerateAlertForTypeWA" TypeID="Health!System.Health.GenerateAlertForType"> <Priority>1</Priority> <Severity>2</Severity> <ManagedEntityTypeId>$MPElement[Name="Example.AlertFromAnotherInstance.Client.Class"]$</ManagedEntityTypeId> <KeyProperties> <KeyProperty> <PropertyId>$MPElement[Name="Windows!Microsoft.Windows.Computer"]/PrincipalName$</PropertyId> <IsCaseSensitive>false</IsCaseSensitive> <Value>$Data/Params/Param[1]$</Value> </KeyProperty> <KeyProperty> <PropertyId>$MPElement[Name="Example.AlertFromAnotherInstance.Client.Class"]/ClientName$</PropertyId> <IsCaseSensitive>false</IsCaseSensitive> <Value>$Data/Params/Param[1]$</Value> </KeyProperty> </KeyProperties> <AlertMessageId>$MPElement[Name="Example.AlertFromAnotherInstance.Server.Event.Rule.AlertMessage"]$</AlertMessageId> <AlertParameters> <AlertParameter1>$Data/Params/Param[1]$</AlertParameter1> <AlertParameter2>$Data/EventDescription$</AlertParameter2> </AlertParameters> </WriteAction> </WriteActions> </Rule>

 

The rule is simple.  It simply looks in the Application Event log for an event ID 999, with a event source of “TEST”.  If found, run the write action.  If you scroll down, you can see the write action part, which I will explain:

 

In my rule, I am targeting the workflow to run on the “Server” class.  However, in my write action, I want the alert generated by instances of the “Client” class.  So on my <ManagedEntityTypeId> line, I am using Example.AlertFromAnotherInstance.Client.Class which is my client class ID.

Next, I map the key property for Windows Computer (Principal Name) to the machine I want to appear to generate the alert.  In this case, the name of the affected machine is in Param 1 of my test event, so I am mapping whatever name is in Param1 of the event to generate the alert.

Next, I map the key property of my custom class to the SAME FQDN value.

That’s it!

 

In this example – I create an event on my “Server”, and param 1 of the event will have the name of the client I want the alert to come from:

image

 

Note:  in the above image – the event was logged on a Server named “STORAGE.opsmgr.net” but param1 contained a name of “RD01.opsmgr.net”.

As long as RD01.opsmgr.net hosts an instance of my “Client” class, an alert will be generated as if it came from this server:

 

image

 

 

If you want to test my example XML out in your own environment, simply create some reg keys to be the “Server” and the “Client” instances to be discovered:

HKEY_LOCAL_MACHINE\SOFTWARE\Demo\Server

HKEY_LOCAL_MACHINE\SOFTWARE\Demo\Client

 

The example management pack is available for download at:  https://gallery.technet.microsoft.com/Management-pack-sample-How-8b6741e3

How to remove OMS and Advisor management packs

$
0
0

 

When testing OMS (Previously called Advisor) with SCOM, there is one side effect:  Once connected, the OMS rules import management packs into your management group with no notification or change control process for you.  Furthermore – if you want to remove OMS Management packs from a SCOM management group, there is a rule that will actually re-download them while you are trying to delete them!  This makes OMS very difficult to remove by default.

Brian Wren posted a method to control this behavior here, and I will demonstrate the same.

https://blogs.technet.microsoft.com/msoms/2016/03/16/control-management-pack-updates-between-ms-oms-and-operations-manager/

 

First, create a new management pack to store our temporary overrides – called “OMS Temp Overrides”

Then in the console, go to Authoring > Rules, and set your scope only to “Operations Manager Management Group”

Disable the following two rules:

image

 

This will stop new OMS/Advisor packs from coming down automatically.

 

Now you can start removing the packs as needed from your management group.    You can use PowerShell to do this in bulk, but it will fail for any MP’s with dependencies.  Here is a simple example:

Get-SCOMManagementPack -name “*advisor*” | Remove-SCOMManagementPack

Get-SCOMManagementPack -name “*IntelligencePack*” | Remove-SCOMManagementPack

get-SCOMManagementPack -name “Microsoft.EnterpriseManagement.Mom.Modules.AggregationModuleLibrary” | Remove-SCOMManagementPack

Be VERY careful using the above statements – they are provided as examples only.  Make SURE they return only the ones you wish to remove and not any custom packs you created that happen to match the naming scheme.

Now – that should leave you with just the following MP’s:

 

image

 

Delete your temp Override MP you created, then (quickly) delete the above MP’s in the order above.

That’s it.

 

If you want to bring OMS back into a Management Group – simply import the Advisor Packs in whatever current UR (Update Rollup) you are on, such as these from UR9:

image

How to monitor for event logs and use a script to modify the output – a composite datasource

$
0
0

 

A common request I hear is the customer wants to monitor for events in a Windows Event log.  That part is easy.  We have simple event rules and monitors for that activity.

However – what if the data in the event log needed to be parsed, or modified in some way, before passing to the alert?

 

For instance, I had a customer who needs to monitor for an event from a central server – backup software platform, about job failures from backup clients.  The event is just a single parameter, a big blob of text that has all the data, and the FQDN of the client is in the event description, but surrounded by a lot of other information. 

 

The challenge is – that for ticketing, the customer needs to place ONLY the FQDN of the CLIENT machine (which is in the event body) into a custom field of an alert.

 

SCOM doesn’t have any good data manipulation capability in the native modules, so in this case we will execute a script in response to the event.  We do this by creating a composite datasource, combining the event log module and the script probe action module.

When an event shows up that matches our criteria, we then execute a script to parse the event description, and create a propertybag in order to output this customized data to the Alert write action.

Obviously – one must take care not to put something like this in place for events that might flood the log, because the SCOM agent will try and run a script for each and every event, which could overwhelm the system.  I actually tested this with a pretty significant event flood, and it was not a big deal at all, the system kept up very nicely.

 

For my “test” event – I have created a block of text with the FQDN of the remote client machine in the body of the description:

Log Name:      Application
Source:        TEST
Date:          4/1/2016 5:13:47 PM
Event ID:      888
Computer:      RD01.opsmgr.net
Description:
foo db01.opsmgr.net foo

Then I wrote a simple script to parse this event and gather the second block of text, which will reliably contain my FQDN:

 

''''''''''''''''''''''''''''''' ' ' Basic SCOM vbscript to accept event data and parse/modify it for ouput via propertybag ' ''''''''''''''''''''''''''''''' Option Explicit Dim oAPI, oBag, sParam1, StartTime, EndTime, ScriptTime, CompNameArr, CompName 'Capture script start time StartTime = Now 'Gather the argument passed to the scriptand set to variable sParam1 = WScript.Arguments(0) 'Split the event data into multiple delimited strings in an array CompNameArr = split(sParam1," ") 'Assume the FQDN is always the 2rd "word" in the event data CompName = CompNameArr(1) 'Load the SCOM script API and propertybag Set oAPI = CreateObject("MOM.ScriptAPI") Set oBag = oAPI.CreatePropertyBag() 'Add the CompName into a propertybag oBag.AddValue "CompName", CompName oBag.AddValue "EventDescription", SParam1 'Return the bag for output oAPI.Return(oBag) 'Capture script runtime EndTime = Now ScriptTime = DateDiff("s", StartTime, EndTime) 'Log event with script outputs and runtime Call oAPI.LogScriptEvent("EventParse.vbs from Example.EventAndScript.DS", 9877, 0, "Event Data Passed to script = " & sParam1 & " -- Output after parsing for CompName = " & CompName & " -- Script Execution Completed in " & ScriptTime & " seconds") Wscript.Quit

 

So first off – we need to create the data source.  The easiest tool for creating Composite data sources is the SCOM 2007 R2 Authoring Console.  I’ll just show snippets of XML that you can forklift into your own MP’s:

 

<DataSourceModuleType ID="Example.EventAndScript.DS" Accessibility="Internal" Batching="false"> <Configuration> <xsd:element minOccurs="1" name="LogName" type="xsd:string" /> <xsd:element minOccurs="1" name="EventID" type="xsd:integer" /> <xsd:element minOccurs="1" name="EventSource" type="xsd:string" /> </Configuration> <ModuleImplementation Isolation="Any"> <Composite> <MemberModules> <DataSource ID="EventDS" TypeID="Windows!Microsoft.Windows.EventProvider"> <ComputerName /> <LogName>$Config/LogName$</LogName> <Expression> <And> <Expression> <SimpleExpression> <ValueExpression> <XPathQuery Type="UnsignedInteger">EventDisplayNumber</XPathQuery> </ValueExpression> <Operator>Equal</Operator> <ValueExpression> <Value Type="UnsignedInteger">$Config/EventID$</Value> </ValueExpression> </SimpleExpression> </Expression> <Expression> <SimpleExpression> <ValueExpression> <XPathQuery Type="String">PublisherName</XPathQuery> </ValueExpression> <Operator>Equal</Operator> <ValueExpression> <Value Type="String">$Config/EventSource$</Value> </ValueExpression> </SimpleExpression> </Expression> </And> </Expression> </DataSource> <ProbeAction ID="ScriptDS" TypeID="Windows!Microsoft.Windows.ScriptPropertyBagProbe"> <ScriptName>EventParse.vbs</ScriptName> <Arguments>"$Data/Params/Param[1]$"</Arguments> <ScriptBody><![CDATA[ ''''''''''''''''''''''''''''''' ' ' Basic SCOM vbscript to accept event data and parse/modify it for ouput via propertybag ' ''''''''''''''''''''''''''''''' Option Explicit Dim oAPI, oBag, sParam1, StartTime, EndTime, ScriptTime, CompNameArr, CompName 'Capture script start time StartTime = Now 'Gather the argument passed to the scriptand set to variable sParam1 = WScript.Arguments(0) 'Split the event data into multiple delimited strings in an array CompNameArr = split(sParam1," ") 'Assume the FQDN is always the 2rd "word" in the event data CompName = CompNameArr(1) 'Load the SCOM script API and propertybag Set oAPI = CreateObject("MOM.ScriptAPI") Set oBag = oAPI.CreatePropertyBag() 'Add the CompName into a propertybag oBag.AddValue "CompName", CompName oBag.AddValue "EventDescription", SParam1 'Return the bag for output oAPI.Return(oBag) 'Capture script runtime EndTime = Now ScriptTime = DateDiff("s", StartTime, EndTime) 'Log event with script outputs and runtime Call oAPI.LogScriptEvent("EventParse.vbs from Example.EventAndScript.DS", 9877, 0, "Event Data Passed to script = " & sParam1 & " -- Output after parsing for CompName = " & CompName & " -- Script Execution Completed in " & ScriptTime & " seconds") Wscript.Quit ]]></ScriptBody> <TimeoutSeconds>30</TimeoutSeconds> </ProbeAction> </MemberModules> <Composition> <Node ID="ScriptDS"> <Node ID="EventDS" /> </Node> </Composition> </Composite> </ModuleImplementation> <OutputType>System!System.PropertyBagData</OutputType> </DataSourceModuleType>

 

The above composite datasource allows a rule or monitor to call on it, and pass the three required data items to the Microsoft.Windows.EventProvider DS:  Event ID, Event Log Name, and Event source.  Obviously you could modify this as needed.

Next, the Microsoft.Windows.ScriptPropertyBagProbe is called.  I place my script in here, along with the event argument from the event DS, which I will just use “$Data/Params/Param[1]$”.  When an event just contains a single parameter, everything in the event description is treated as Param 1.

In the script – I am outputting two propertybags:

1.  The original event description

2.  The parsed out “CompName” which is the fqdn I am after.

 

Next – I create my rule.  This is much easier creating from scratch using the SCOM 2007 R2 authoring console, or VSAE if you are used to that.  Most of the time I find myself just forklifting XML if I can find a close enough match doing what I want.  Here is the rule:

 

 

<Rule ID="Example.EventAndScript.Rule" Enabled="true" Target="Windows!Microsoft.Windows.Server.OperatingSystem" ConfirmDelivery="true" Remotable="true" Priority="Normal" DiscardLevel="100"> <Category>Custom</Category> <DataSources> <DataSource ID="DS" TypeID="Example.EventAndScript.DS"> <LogName>Application</LogName> <EventID>888</EventID> <EventSource>TEST</EventSource> </DataSource> </DataSources> <WriteActions> <WriteAction ID="Alert" TypeID="Health!System.Health.GenerateAlert"> <Priority>1</Priority> <Severity>2</Severity> <AlertMessageId>$MPElement[Name="Example.EventAndScript.Event888.rule.AlertMessage"]$</AlertMessageId> <AlertParameters> <AlertParameter1>$Data/Property[@Name='EventDescription']$</AlertParameter1> <AlertParameter2>$Data/Property[@Name='CompName']$</AlertParameter2> </AlertParameters> <Custom1>$Data/Property[@Name='CompName']$</Custom1> </WriteAction> </WriteActions> </Rule>

 

 

VERY simple.  I pass my three required configuration items:

            <LogName>Application</LogName>
            <EventID>888</EventID>
            <EventSource>TEST</EventSource>

 

And I configured the Alert write action to output both the full event description, and the parsed FQDN to the alert description.

image

 

Furthermore – to meet the requirements of putting the FQDN in a consistent location for my Ticketing system, I put the parsed FQDN into Custom Field 1 in the alert.

image

 

Now the ticketing system can look for this and make sure the ticket is assigned to the correct server owner.

And interesting option with this kind of workflow – is that you could also potentially make the alert in SCOM “appear” as if it originated from the FQDN in the event, as long as that server has a SCOM agent installed and is part of the same management group:    How to generate an alert and make it look like it came from someone else

 

 

You can download the entire sample MP which contains everything above on TechNet gallery:

https://gallery.technet.microsoft.com/SCOM-Management-Pack-to-dec108c6

Writing events with parameters using PowerShell

$
0
0

 

When we write scripts for SCOM workflows, we often log events as the output, for general logging, debug, or for the output as events to trigger other rules for alerting.  One of the common things I need when logging, is the ability to write parameters to the event.  This helps in making VERY granular criteria for SCOM alert rules to match on.

 

One of the things I HATE about the MOM Script API LogScriptEvent method, is that it places all the text into a single blob of text in the event description, all of this being Parameter 1.

Luckily – there is a fairly simple method to create paramitized events to output using your own PowerShell scripts.  I got this from Mark Manty, a fellow PFE.

 

Here is a basic script that demonstrates the capability:

 

#Script to create events with parameters #Define the event log and your custom event source $evtlog = "Application" $source = "MyEventSource" #These are just examples to pass as parameters to the event $hostname = "computername.domain.net" $timestamp = (get-date) #Load the event source to the log if not already loaded. This will fail if the event source is already assigned to a different log. if ([System.Diagnostics.EventLog]::SourceExists($source) -eq $false) { [System.Diagnostics.EventLog]::CreateEventSource($source, $evtlog) } #function to create the events with parameters function CreateParamEvent ($evtID, $param1, $param2, $param3) { $id = New-Object System.Diagnostics.EventInstance($evtID,1); #INFORMATION EVENT #$id = New-Object System.Diagnostics.EventInstance($evtID,1,2); #WARNING EVENT #$id = New-Object System.Diagnostics.EventInstance($evtID,1,1); #ERROR EVENT $evtObject = New-Object System.Diagnostics.EventLog; $evtObject.Log = $evtlog; $evtObject.Source = $source; $evtObject.WriteEvent($id, @($param1,$param2,$param3)) } #Command line to call the function and pass whatever you like CreateParamEvent 1234 "The server $hostname was logged at $timestamp" $hostname $timestamp

 

The script uses some variables to set which log you want to write to, and what your custom source is.

The rest is pretty self explanatory from the comments.

You can add additional params if needed to the function and the command line calling the function.

 

Here is an event example:

 

image

 

 

But the neat stuff shows up in the XML view where you can see the parameters:

 

image

Update: Automating Run As Account distribution dynamically

$
0
0

 

Just an FYI – I have updated the automatic run as account distribution script I published, to make it more reliable in large environments, limit resources used and decrease the chance of a timeout, along with adding better debug logging.

 

Get the script and read more here:

Automating Run As Account Distribution – Finally!

 

I also published this script in a simple management pack with a rule, which will run the script once a day in your management group.  It targets the All Management Servers Resource Pool so this will have high availability and only run on the single Management Server that is hosting that object.

 

Get the Management Pack here:

https://gallery.technet.microsoft.com/Management-Pack-to-06730af3

Viewing all 158 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>