Hi All,
just wanted to share some alerting logic with you
firstly the problem is that when an alert triggers it will eventually reset, my issue is that it could reset straight away and before you know it another trigger, ramping up the ticket count, incorrectly logged tickets and missed alerts. solarwinds ships with some cool features to help prevent alerts with some advance logic but this will require making a few dozen additional alerts,
for the purpose of this article ill just use node status as the alert condition
just below, the scenario is noisy alerts. the first trigger is at 9 then reset, then an alert at 11 then reset, then alert an alert at 2 and so on... you could simply say in the alert reset condition node must be UP for x minutes or hours before resetting then re-triggering some time later.
My issue is that i don't really like this static value and to change this i would need to create another alert with different times, opting for this type of alerting will require making duplicates of interfaces alerts, volume alerts,node status basically everything, at this point things start to get messy specially when you implement your workflow else where, if like me i just want 1 single alert.to the solution ill try to keep high level as possible
Just to recap i want an alert to trigger then reset then custom property delay then alert before the next trigger, seems reasonable enough. the logic for the alert is simple really, just check events from previous alerts and if the current time is greater then the custom property of the last trigger then alert.
your need to create a custom property.
Name | Format | Description |
---|---|---|
n_hrs_DelayNxtTrigrStatus | integer | number of hours to delay node status alert after 1st trigger |
SQL condition:
--comment out the select when pasting in the alert. SELECT Nodes.Caption, Nodes.NodeID FROM Nodes --here you are joining on Events table and checking for events for the RESET syntax generate from the NPMeventlog WITH(NOLOCK)LEFT JOIN Events p WITH(NOLOCK) ON p.NetObjectID = Nodes.NodeID AND p.NetObjectType = 'N' AND (p.Message like '%Rule: "- Node Status" | Reset:%') WHERE nodes.Status = 2 AND nodes.n_mute <> '1' group by Nodes.Caption, Nodes.NodeID HAVING --check last alert trigger from events compare against custom property value IF null then set hardcoded 16 hours delay ( ( DATEDIFF(HOUR, MAX(p.EventTime), getdate()) > ISNULL(MAX(nodes.n_hrs_DelayNxtTrigrStatus),MAX(16)) ) ) OR --no record has been logged for the above 'Message' (MAX(p.eventtime) IS NULL)
due to this type of logic for the trigger if the device is down for longer 16 hours then it will reset regardless of node status because the 16 hours have passed resetting the condition. so what i use for node status resets alerts is to check for the actual node status, like this:
how to generate a NPM event this is what you are looking for in the message field.(p.Message like)
that's pretty much it, now once you set a value in the custom property 'n_hrs_delayNetTrigrStatus' on the node details it will check the events in future before triggering just like below.
bare in mind that this article is just showing you node status you might have 15 alerts so you will need 15 custom properties one for each alert with the above logic checking for the last NPMevent generated by that alert.
i would like to thank alexslv for his report Repetitive Email Alerts (Noise) - MUST HAVE REPORT!!! because this was for me the next logical thing when finding noise lol
please let me know if you have any issues/questions i had to trim allot of the alert i use but runs as expected.
thanks,
Dan