More Problems with Graphing Alerting

[Deleted User][Deleted User]
Hi,

I am again having problems with the graphing and alerting features of Serverscheck...



Today the rules show that several checks on main infrastructure servers are down, and have been 3:30am! - All showing as red!



I check the servers however - and all is running fine.



I return to serverscheck, open each rule and press the 'Test Settings' button... I get a status return of 'OK'.



I check the rules view again, if the servers did indeed have a problem between 3:30 and 3:40am (unlikely as I've checked logs and network devices) they should have since issued a further check and show as OK.



I then checked the log file for the rule on the serverscheck server - it shows right up until a few minutes ago the rule IS being executed, but returns a 'DOWN' error.



Now I'm confused.



I've not restarted the server yet (which normally resolves this types of issues incase you want any logs etc, but I have noticed there are many instances of memory_check.exe running under task manager of the serverscheck server.

Also, there are duplicate processes for monitoring_rule.exe.



I cant see why the rules are returning a status of OK, the Rules View show them (currently) as down and on top of all this I haven't recieved a single alert to warn me of the problems.



I have a document with screenshots if it helps...?

(Servercheck Enterprise Ver. 5.10.5)





Thank you

Comments

  • AdministratorAdministrator
    The issue is indeed related to the abnormal high number of threads of the memory_check.exe

    The reason why it is not alerting is because it is an issue of obtaining results by ServersCheck and it did not receive an error from the check, hence assuming an internal issue and not related to the remote machine. Sending a DOWN alert is such a circumstance would be incorrect.



    I have asked development to investigate with high priority as to why the thread is not closed by itself but remains open (probably a Windows dialog box waiting for input but invisible due to the fact that it is launched by a service)



    Please stop/restart the ServersCheck Monitoring service (this will normall also kill all open threads of the memory_check.exe)



    The fact that you have multiple instances of the monitoring_rule.exe is because you are using an Enterprise version of ServersCheck.
  • [Deleted User][Deleted User]
    OK - I've restarted my server and all rules have (as expected) returned to Green, and are showing OK.



    Annoyingly for me, the graphs now show each critical server as being down between 3am and 10am this morning, and our service level has dropped to 99.88% - when in fact there was no problem! :(



    Can you let me know the outcome from Development as I really need to prevent this from happening in future.



    Thanks
  • AdministratorAdministrator
    We have sent you an update for the memory_check.exe



    It is possible to amend the stats and graphs to take that outage time out of the reports.
  • [Deleted User][Deleted User]
    Hi,

    Updated file received, renamed old file - copyed new one to Agents sub directory.



    Restarted the server so all services have now restarted too.



    How do I amend the graphs and stats to prevent the unexplained downtime?



    Thanks for the prompt assistance and update by the way.
  • AdministratorAdministrator
    Stats: open the related dat file to with notepad (data subdirectory). SLA is calculated as Down checks over total checks. Modifying those numbers will result in a different down time.



    Regarding the graphs: you will need to manually update the RRD files. For this please read the RRD manual available here:

    http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/
  • [Deleted User][Deleted User]
    Hi,

    This morning I have the exact same problems as yesterday!!



    * Serverscheck showing checks as down, even though they are NOT down.



    * On running the 'Test Settings' - I get a status of OK.



    * No alerts recevied



    * Many Memory_Check.exe processes shown in Task Manager.



    The updated file you sent yesterday has not resolved the problem.
  • [Deleted User][Deleted User]
    Further to my last post...



    It is only the 'SERVICES' checks I have running that are affected. - All other checks appear to be fine.



    Every SERVICES rule is showing as Down, even though they fine and all seem to have gone down around 7am (UK Time)



    Does serverscheck do anything around 6-8am each morning?



    Thanks
  • AdministratorAdministrator
    ServersCheck does nothing built-in in terms of doing something at a specific timing.



    Could you run the Monitoring Service into interactive mode (meaning changing its settings to allow it to interact with the desktop) so that we can see what the dialog boxes are shown by the memory_rule.exe?



    I would also need at the point where you see such a dialog box, all the *.conf and *.log files in the agents subdirectory
  • [Deleted User][Deleted User]
    Ok - I have changed the Serverscheck Monitoring Service logon properties to use the local sytem account and allow interaction with desktop.



    The service has been restarted - although all the SERVICES checks still show as down, and the lots of memory_check.exe processes remain.
  • AdministratorAdministrator
    Topic continued as per customer request in helpdesk.



    Resolution will be posted in this forum.
  • AdministratorAdministrator
    New build of the s-service.exe was sent to customer.



    This provides a workaround in order to track hanging threads of the memory_check.exe which appears to cause the issue reported.



    Waiting on feedback.
  • [Deleted User][Deleted User]
    Hi,

    I replaced file with the new file emailed to me and restarted the server...



    ..that was on Friday, today is Tuesday and I have had NO false readings!!



    Also, there are no memory_check.exe processes running in task manager.



    Something in that new file has solved the problem, thanks for your help.
This discussion has been closed.