By Steve Parsons
Introduction
Traditionally, APM services have been difficult to configure to their optimum. Over time, an initially clean APM service configuration tends to accrue more and more services, until 20-30 are not uncommon. Then, there is a high level of operational risk associated with this layout. Also, it becomes difficult to identify old, redundant configuration that could be removed. Performance is affected, resulting in slower data delivery times.
This article explains why this problem occurs and considerations to take into account when rectifying the problem. Please note that an assumption of some familiarity with APM has been made when writing this article.
Why does the problem occur?
An initial APM configuration tends to have a handful of services. Two typical layouts exist.
- APM services by desk
This layout has the advantage of being intuitive for support staff. i.e. if a desk requires a change in configuration it is easy to identify the services affected. It also allows desk specific scheduling of APM server processes (e.g. American desk services can be scheduled for different times to European desk services). The disadvantage is that as new desks or users on-board the number of services tends to rapidly increase.
- APM services by Package (a package maps to a simulation result)
This layout allows a similar group of simulation results (e.g. delta & MTM) to be run together over all the portfolios of interest. Fewer services are initially configured. However, if there are a lot of portfolios this layout can be more cumbersome and slower than a layout based around desk. For organisations with multiple trading desks, each with their own support staff, it is unclear who owns and is responsible for maintenance of shared APM services.
There are also combinations and variations of these layouts.
In both cases, as time goes by and new businesses or users join, new services are added to avoid “destabilising” existing services. In a dynamic environment, a large number of services can quickly build up. Knowledge of why services exist is quickly lost and new support staff don’t want to risk breaking something by reconfiguration. It is not easy to get insight into how the APM services are configured via the Services Manager, and the APM Data Source Monitor only provides limited information.
There are several other factors that make adding more services an attractive option.
- Lack of grid licenses. To generate the initial baseline of data every day a “batch” process has to be run to create the simulation results that APM requires. This typically happens as part of a “Start of Day” process, which occurs after the date has been rolled. This process loops around the selected portfolios and generates the date for each in turn. Depending on what simulation results are being used (e.g. delta or Power positions), this can take a substantial amount of computation time. To shorten the calculation time, several APM customers without grid licenses split the APM services into multiple smaller services that run on different appservers. This allows the calculations to be parallelised, but makes more services inevitable.
- The version of Endur/Findur being run. In V12.1 and below the batch processes can only be split across a grid within a portfolio. If the customer only has a few large portfolios then this is a good splitting strategy, but if there are many small portfolios then this is sub optimal. In V12.2+ the batch can be split within portfolio, or by portfolio for a service, but not both within the same service. To work around this limitation customers split the APM services into “large” and “small” services, allowing different splitting strategies to be used. In V14.1 and up it is possible to configure an APM service to split by portfolio and also within portfolio inside the same service. So the necessity for this workaround is removed (although sometimes a custom splitter script is still required for absolutely optimal performance).
- Simulation result parameters. The physical packages in APM (Gas – which covers commodities, and the Power package) can potentially generate large numbers of rows of simulation results. To optimise these results Openlink provides simulation result parameters to allow customers to only generate data for limited time ranges at the granularity required. E.g. hourly for 3 days, daily for 6 months and monthly for 3 years. But depending on the desk and market then different granularities of data may be required for different timelines. This forces support staff to add more desk based services.
- Shared H/W versus dedicated H/W. The standard Openlink recommendation is for APM services to be run on dedicated H/W for the time period during which they need to be online. This recommendation is in place to avoid other processes interfering (delaying) with APM and vice versa. However, real world considerations often mean that the H/W is shared intraday. As well, the H/W is often shared with EOD processes. i.e. the H/W is handed over to the SOD processes, but only if the EOD has finished. If the EOD runs late then the SOD starts late and support staff are under pressure to meet the SOD SLA. Support staff will then split the services up to allow the important portfolios to be run. Those additional services tend to hang around in case the problem occurs again, and never get cleaned up.
- Intraday Market Data Updates. If a customer has an intraday universal price upload mechanism, the APM batches need to be rerun for price sensitive portfolios and results (e.g. delta on options books). It is imperative that the batch runs as quickly as possible once the prices have been uploaded. This encourages a layout where the price sensitive portfolios are split out onto their own services. Queries are also sometimes specified on the services to minimise the number of deals that have to be rerun.
- Deal update performance. The Openlink APM delta result (based on Tran gpt delta by leg) can take a significant amount of time to calculate. Sometimes it makes sense to isolate delta (or other computationally heavy calculations) onto their own services to ensure that deal updates can process as quickly as possible.
All of the above factors can (and usually do) result in an accumulation of APM services. At some point someone in the support desk will attempt to clean this up but then the next problem occurs. It is very difficult to identify which services are still useful. The reason for this is twofold:
- The APM pages (which consume the data from the services) can be held as pages in the Windows file system or in the Openlink database file system. In either case users can create their own variants of pages and store them away in their own folders (assuming the APM security privs have not been locked down). This then means that gathering a complete set of used pages is difficult, as it relies on traders actually giving the pages to support staff.
- The APM page file format is an XML format. This makes it easy to read but some analysis has to take place to generate information about which data the pages are actually consuming, and to then compare it to the datasets being generated by the various APM services.
The effort involved in this cleanup is often not deemed worthwhile and so the situation is allowed to continue.
Recommended APM Service Layout
A package based service layout is recommended in V14.1 and above. This should be split into physical services and financial services. It is currently optimal to separate the Gas and Power packages onto their own services as this results in improved performance (memory consumption). The other packages can be combined into singular services. The ability to configure the APM service to split within or by portfolio allows a much greater degree of flexibility than was previously available.
A package based service layout is also recommended in V12 and lower – but with caveats. The reason for the caveats is that the grid splitting strategy required for different types of portfolios may result in the ideal APM services having to be subdivided to allow this flexibility.
Reconfiguration of an existing APM installation
The following steps are recommended …
- The first step in any remediation strategy is to have a test environment in place, which is as similar to production as possible in terms of APM service hardware. Then baseline existing performance in that environment. This permits performance verification when configuration changes are made, before they go into production.
- Identify which pages and APM services are actually required. Openlink does provide some useful utilities which can be built upon to analyse an existing APM service configuration.
- Identification of used pages
- Whenever a page is sent online it is stored in a table in the Openlink schema, along with the time it was sent online and the user. These pages and associated information can be exported and then analysed.
- “AnalyseConfiguration” utility.
- This takes as inputs a range of pages on the file system (ideally the exported pages) and the APM services. It compares the pages against the APM services to identify inefficiencies in the APM configuration. The utility is primitive but does provide useful information.
- Batch Statistics.
- The APM Data Source Monitor can be used to export statistics about the performance of the APM batches. The information is raw, but can be useful when thinking about whether additional services are really required.
- Running subsets of portfolios
It is possible to rerun the batches for subsets of the portfolios on an APM service via API’s. This eliminates the requirement to split services up to allow finer control of the scheduling of a batch.
- Make sure any identified APM service redundancies are eliminated. This will help the performance of the existing installation and clear up the noise caused by old services.
- The last step is to then analyse the services that are left and to combine them where commonalities exist. i.e. if the same packages are being run across multiple services with very little difference in the service criteria, then it makes sense to merge them (as long as the overall batch time is not excessive when tested). Multiple iterations of this task will often take place in the test environment as batch performance is verified and tweaked. Deal update performance should also be checked. Sizing of computation grid clusters to ensure optimal deal update performance will be tackled in a subsequent article.
Summary
It is possible to reconfigure an existing APM configuration to eliminate the many services that tend to exist without a complete reimplementation. It is also possible to build monitoring utilities that can be rerun regularly to identify inefficiencies that have been introduced. As requirements change, this allows the APM configuration to remain optimal.