How-To: Policy Server Trace Analysis Tool

Greetings: This article is an introduction to the Policy Server Trace Analysis Tool, which I feel many people in the SiteMinder community are still not aware of.  Just to preface, the tool was created by Mark O'Donohue, and is maintained in CA SiteMinder Support.  This is a high level overview on how to get started with this really great utility, and is not intended for those who already have significant experience with it.

First, I would like to point out that Mark has an instructional Youtube video on how to run the trace analysis at the following link: http://www.youtube.com/watch?v=LEYhw5g-Yog

There is also a discussion forum on the CA Communities page where you can ask Mark and other SiteMinder Support engineers questions on the utility.  Navigate to https://communities.ca.com/web/ca-identity-and-access-mgmt-distributed-global-user-community/message-board/-/message_boards/view_message/97560186?&.

The above link is also where you can download the latest binary.  Look for 'SMPolicyTraceAnalysis-1_5_0-BETA-349_bin.zip'.  Click to download

folder

Unzip the folder, and launch 'run.bat' to execute the program:If you would like to customize the JVM at all just modify the run.bat file and add any parameters you want.  You can also increase the JVM max size to accommodate memory needs.  By default this version starts with 1024megs, which should be sufficient.

run_bat

The drop down menu at the top allows you to choose what type of trace analysis you would like to perform.  We are going to be covering the Policy Server Trace Analysis portion, but you can also perform analysis on smps logs if they contain stats output, and you can perform analysis on the web agent trace logs as well.

ready_to_run

 

select_log_dir

Under 'Input Dir' click the 'Select' button and browse to the directory containing the policy server trace files you would like to use. In this case I put the smtrace logs I wanted to analyze under a directory called test:

 

 

When back on the main screen there is a button called 'Edit more Report Options'.  This gives you additional options to enable:

edit_more_report_options

  • Lock detect looks for deadlocks occurring because threads are waiting on each other for actions to be performed.
  • EnqueuAndProcessRequest just shows transaction times from the time the request was loaded into the message queue to the time the process was completed.
  • EnqueueAndHighPriRequest show same thing for high priority agent requests.
  • The JVMRequest option would only be applicable if you are running things that require a JVM such as Federation or Custom Java Code.
  • Click 'ok' to save

 

The two last options on the main screen are called Graph Timestamp, and Extract Transactions Over.  These are important. By default, graph timestamp is set to 1000 milliseconds (1 second).  This just means that the graph points for displaying data will be set every one second.  This can cause problems if your trace files span a long period of time such as days or even weeks.  If that is the case you may need to increase this number to a minute or so (in milliseconds) depending on the time period or else there will be too many graph points, which can cause slowness or even failure of the utility to run.

By default, 'Extract Transactions Over: ' is set to 10000 milliseconds (10 seconds).  All this means is that any transactions that take longer than 10 seconds to process are broken down with additional detail in the report.  You might want to lower this to 1000 ms to capture more potentially unhealthy transactions depending on your environment.

If the utility thinks either of these settings are going to be problematic it is pretty good about telling you. Below is a warning the utility displays if there is a large gap between files in the form of minutes:

error_timegap_files

If there is a large gap in the form of days it will warn you to fix the log files before proceeding:

error_large_gap_between_files

 

The first section of the report is the ProcessRequest section.  This represents statistics around processing the entire request.  Meaning, it tracks the whole request from the time it was dequeued to the time it was completed.  The whole transaction includes Authentication, Authorization, LDAP queries, formatting and returning of attributes, etc… When all that has been processed for one request it is considered complete.  This is pretty straightforward and shows the total number of requests, the amount of requests processed per second, and the average time taken to process a single request.

It is pretty easy to identify major performance degradation right away simply by the ProcessRequest throughput. If requests per second are low, and average transaction time is high then performance is poor.  If requests per second are high, and average transaction time is low then performance is healthy.

If you look at the CA Community Forum there is a discussion about establishing a performance baseline for your environment.  There is no exact number that defines whether your environment is operating at ideal performance, or when it has crossed the line to being problematic.  This needs to be determined on an individual environment basis.  I would recommend running the tool at busiest times of day to gather data and determine a healthy baseline for system performance.  You could then deduce a target threshold to identify when things may be becoming problematic.  This allows you to be proactive, and if/when a problem does occur you will know what to compare the numbers to.

One of the most important graphs in the report is the chart showing the top 25 functions where worker threads spent the most time while processing requests.  Again, this is pretty self explanatory, but it is important to understand the big picture.  The whole point of this utility is to break down performance and throughput, and to identify any potential bottlenecks in the policy server performance.

In the example below, load is very light and things look exactly as they should.  Most thread time is spent on authentication, authorization, and then a mixture of other functions.

[This was taken from a small test environment using scripts to produce load]

top_25_func_procRequest

A very common bottleneck that most people experience with SiteMinder at some point is Policy Server degradation caused by backend LDAP slowness.  In that scenario, you will most likely see the top functions being LDAP related.  Below shows Report data taken from a real environment experiencing LDAP issues. Worker threads spend the majority of their time in the SmDsLdapConnMgr (LDAP connection manager).  This would definitely indicate connection issues with LDAP, and it could be caused by several things such as a shortage of connections, network issues, etc.  Further deduction can be made based on data from the LDAP Request section.

slow_ldap_25_func

If you selected 'Lock Detect' under 'More Report Options' you will see another graph under section 1 called 'LockDetect'.  It takes the most heavily called functions, which we saw in the top 25 chart explained above, and it shows how many threads on average are simultaneously trying to process the same part of the transaction. Basically, it just shows which functions are consuming the most thread time.  A detailed description is given by the utility below the graph.

LockDetect

 

Section two focuses solely on Authentication Requests: AuthRequests.  Again, this is pretty straightforward, and shows statistical information regarding the Auth portion of requests. IE, how many Auth calls were made, throughput (Auth trans per second), and average transaction time.

Poor throughput on Auth transactions could be the result of LDAP slowness, user directory object configuration, search lookups, etc. Further deductions can be made in LDAP request section.  If Auth throughput is low but LDAP throughput looks normal then the problem is most likely on the SiteMinder end - If LDAP request throughput is low then problem is most likely on directory end.

Authentication requests can also be decreased by adjusting Web Agent User cache settings.  If a user already has a session and can be authenticated from cache the Agent will not even send an Auth request to the Policy Server.

 

Section three focuses solely on the Authorization portion of requests: AzRequests. IE, how many Az calls were made, throughput (Az trans per second), and average transaction time.  If throughput is poor on Az transactions it means you may need to review the design of your policies and possibly cache settings.

The High Priority Connection Request section shows stats for incoming Agent connections.

If LDAP slowness or another performance bottleneck has caused the policy server to fall short of it's throughput threshold, then you may see poor statistics in this category as Agents time out and try to reconnect.  Remember that the high priority thread pool is dedicated to incoming Agent connection requests.

The LDAP Requests section contains statistics pertaining to LDAP transactions only..

As I mentioned above, this section also contains a chart showing the top 25 functions where threads spent the most time.  These are for LDAP queries only.  If there is an issue with LDAP performance you may be able to deduce the problem from the function/s consuming the most time.

Below is what this looks like in the environment with the LDAP issue:

top_25_ldap_unhealthy

As you can see, the Policy Server spent over 104 seconds issuing the LDAP search ‘objectclass=*’.  This search is conducted by the Policy Server Ping Connection to monitor the health of the directory.  It sends a very basic search, and if it gets a response, then the directory is considered healthy, so the Policy Server will continue to send requests there.  Here we can see the ping connection is being issued heavily, and/or appears to be getting stuck awaiting responses.  Either way, this chart allows you to identify exactly where the problem is.

The report contains many more additional graphs, which I will not go over in detail.

LDAPWait: Contains transactional data around LDAP requests waiting to be processed.

LDAPRequestPlusWait: LDAP Request + LDAP Wait Time

SQLRequests: Only pertinent if using RDBMS for stores.  Provides transactional data around SQL Requests

EnqueuAndProcessRequest: This is optional and records transaction from the time it was first received by the policy server and enqueued to the time it was processed completely.

Normal and High Priority Queue Wait Time: Portrays data around transactions waiting to be picked up

Queue Depth: Data for High Priority, and Normal Priority Queue depths.  Spikes in queue depths can indicate bottlenecks

Concurrent Reports: These graphs average concurrent threads per second, and show how many threads are working simultaneously at a given time. I had max threads set to 6 in my environment so you can see the average count hovers right below 6 while the policy server is busy.

concurrent_threads

Concurrent Auth Threads hovers around 3, and concurrent Az threads hovers around 2.  That makes sense as the worker threads are spending most of their time processing Auth/Az, which leaves another 1-2 threads simultaneously working on the remaining transactions.

concurrent_threads_auth

concurrent_threads_az

 

Lastly, I would like to show how the Report breaks down any requests that exceed the Grace Time setting.  In this scenario I dropped the Grace Time to 1000 milliseconds from 10000.  As you can see the Report highlights transaction length by color.  Transactions over 10 seconds in red, over 5 seconds in orange, etc.

slow_transaction

The report highlights the sections of the log where the slow transactions occurred.  As you can see these slow transactions are LDAP queries.

slow_transaction1

I hope this helps provide a little insight on how you can benefit from this utility in your own environment.  Please reply to the post if you have any questions, or you can contact Mark on the CA Communities forum mentioned above.  There are times where you may not be able to determine exactly where the problem is, but the Reports should be able to at least point you in the right direction, and they should be able to provide some potentially invaluable insight on performance of your SiteMinder environment.