OpenLink & Logging Hell ( Part 1 )

OpenLink & Logging Hell ( Part 1 )

By Greg Eves

 
 

In this, the first of several articles concentrating on logging, I intend to highlight some of the pitfalls users often experience when dealing with logs in the context of the Endur/Findur application and provide some guidelines around the implementation of a logging framework to help you avoid falling into the same traps.

As most of those who support, develop or maintain an Endur or Findur instance will know; the platform has the ability to overwhelm and confuse its users with copious logs of varying size quality and format, often output to different locations. This can unnecessarily protract issue investigation and in more extreme cases can lead to the curtailment of investigations because a true picture of events cannot be extracted from the logs.

Consider the following (if slightly extreme) scenario. Bob is a front-office user, every day he runs a task in Endur to produce a critical report, it’s Friday and Bob’s report isn’t working. All we have to work with is a text editor, access to the log files and an angry Bob.

 

 

 

 

 

After establishing that the report is running locally on Bob’s session, a sensible place to start might be the core OpenLink error logs. Navigating to the reports directory yields the following picture.

Something is clearly wrong here, Bob happens to have the largest log file but other users are also producing large log files. Even if we conservatively estimate that 1mb represents around 7,500 messages that equates to over 17 million messages in these four files alone.

We can estimate our log throughput and in turn how much processing time might have been wasted generating these logs by writing a simple JVS script that we know will generate error in the core log. The method below creates a table and attempts to retrieve a cell that does not exist 10,000 times causing 10,000 writes to the core log.

    private
void invalid() throws OException {

      Table tblTest = Table.tableNew();

     
 

      SimpleDateFormat sdf = new SimpleDateFormat(“HH:mm:ss.SSS”);

     
 

      OConsole.message(“Start – “ + sdf.format(new Date()) + “\n”);

     
 

      for (int loop = 1; loop <= 10000; loop++) {

            tblTest.getInt(1, 1);           

      }

       
 

      OConsole.message(“End – “ + sdf.format(new Date()) + “\n”);

     
 

      tblTest.destroy();

    }

 

We can then take the start and end times over a number of executions to find our average throughput. Running the same script across a number of instances logging to a network location provides a real-world average execution time of around 5 seconds for 10,000 messages or 0.5 milliseconds per message. If we instead direct the output to a local disk, the total execution time drops to less than half a second or 0.05 milliseconds per message making it comparable to logging frameworks such as Log4J or LogBack. If we then edit the script so that the ‘getInt’ method is wrapped in a check for cell existence the execution time drops to an average of 5 milliseconds for the entire method. Incidentally, there is no noticeable performance difference between pre-checking for cell existence versus retrieving a valid row from the table in the above example.

We can take our analysis one step further by removing the pre-check wrapping and configuring OpenLink to suppress the core error message instead. With this configured the execution time is now around 140 milliseconds for the entire method, suggesting there is a cost for core message suppression of around 135 milliseconds. With these figures we can attribute 90% of the logging cost to network performance, 7% to disk and around 3% to the OpenLink application.

These numbers were gathered based on averages taken from several instances of OpenLink of varying size, version and maturity, all instances produced similar results and there were no obvious anomalies in the data that skewed the results one way or another. It is also worth mentioning that the intentionally generated error message (Invalid row on table…) is arguably one of the simplest for the application to handle, it involves no round-trips to the database, all processing is carried out in-session and therefore makes a good candidate for this type of test. Almost all of the application processing will be attributable to logging. For other types of error, particularly those that are generated after making database calls or other more time consuming operations it is difficult to split the logging cost from the operation cost.

What does this mean for Bob and the other users of the system? Based on the above figures, the generation of 17 million messages would have taken around 2 hours 20 minutes on average with around 40 minutes of this attributable to Bob’s own log. Without examining the logs, it’s difficult to say how noticeable this behaviour might have be to the user, if the situation has always been like this or if the system has been slowly degrading over time they may not have noticed at all and just put it down to the nature of the system. It is however highly likely that addressing the issues in the logs at their source to eradicate them from the logs would yield a noticeable performance boost to the front end user often at a relatively low development cost.

The reduction in log generation is not only beneficial to the user (or server) at the time of execution, it can also help during investigation. Grappling with a large log file over a slow network can be extremely frustrating and time consuming, especially when the file is filled with little more than the noise generated by an ill designed script or badly defined date sequence.

Returning to our investigation of Bob’s issue, after waiting 10 minutes for it to load, we finally have the log open and after some cross-checking with the task execution times in the database we eventually find the pertinent error.

Whilst the degree of effort required to reduce the volume of log messages is likely to be low, i.e. extending a date sequence or converting it to a formula, wrapping function calls that might be invalid in pre-checks etc. There is another option that can be changed quickly that may make opening a log file less painful. OpenLink provide the environment variable AB_FLOG_FILE_TRUNCATE_SIZE, this variable takes a number that represents the size in kilobytes the log file can grow to before it is truncated, it behaves in the same manner as a rolling file appender provided in many logging frameworks by appending a timestamp to file name of the appended logs. There is one caveat with this environment variable, the truncate will only occur if the log file is larger than the specified size when the master process starts. For application servers where daily restarts are common this is likely to be acceptable, however if your users stay logged in all week it is unlikely to be of much benefit.

Now that we have found the core error message related to the failure of Bob’s task it is probably a good time to check the logs for the task itself. Bob happens to work with a fairly mature instance of Endur, as a result there’s a happy mix of AVS, JVS and OC code spread across the database so we need to establish which language the report is using, a quick look at the task config tells us the report is written in AVS. Our experience tells us that over the years four different developers thought they could do a better job of writing a logging include script and some developers didn’t even know there was a logging include script let alone four, unless we had prior knowledge of this task we are unlikely to know where the log file will be output and in what format. It is important to establish a set of rules that dictate how logging should be accomplished across all languages in use. The developer should be able to initiate logging with a single simple method call, only the name of the process (class name / script name) and perhaps the type (ops service, task, UDSR etc.) if it cannot be inferred should be passed. The intricacies of the logging framework such as output location, message format and file name should all be taken care of by the framework itself and hidden from the developer.

In this case our developer has decided to use the NinjaLogging include script, we know that this include script uses the OpenLink output directory as a base and then selects a folder based on the name of the task. In this case the output ends up in %OUTDIR%\ErrorLogs\FO\ in a file based on the name of the task.

OK, so the log fails to tell us what the script might have been doing when the script engine crashed, in fact it fails to tell us very much at all. But we can infer a few things from its contents. On 25th Feb the number of ‘started’ messages is equal to the number of ‘ended’ messages, so we can assume that under normal operation no other log messages are generated. Also looking at 25th Feb messages it appears that the report takes about 30 minutes to complete, however we should be cautious with this assumption, the execution that started at 09:30:34 was initiated before the execution from 09:05:12 had finished, we can’t be 100% sure which of the ‘ended’ messages generated at 09:32:27 and 10:01:34 correspond to which ‘start’ messages. The fact that two executions were in progress simultaneously also suggests that more than one person is running the report but of course we won’t know who unless we dig around in the database. We can also see that the process has completed once on 26th Feb, at this point we should be asking what they are doing differently from Bob that means the report completes for them, but not for Bob.

The above log file gives us the ability to make some assumptions about the failing process and generates some leads worth pursuing but it could have given us a lot more with a few small tweaks and saved us a lot of time. The first issue is that individual executions cannot be traced, the addition of a unique ID number prepended to each message in the same process would immediately solve this. The next issue is that there is no way to identify who has initiated the process or where it is running, again if the user ID was prepended to the message it would make investigation much easier, further to this we know that the process is succeeding for at least one other user but we don’t know why, it might be a good idea to include any user defined parameters as part of the log as well as pertinent session details such as process ID/name, machine architecture and available memory. If we consider the way the OpenLink core logs are generated at least the first two of these items are included; the log file itself is named after a user and the messages in the logs are always prepended with the module name, we do not have to make the same assumptions we are forced to make here.

There is one final issue with our Important Report log, it fails to tell us anything useful about what the script is actually doing. All that we’ve determined so far could have been determined by other methods such as querying the OpenLink database to build a picture of task execution, the log should be able to tell us something new that we cannot find out through other means. An execution time of 30 minutes suggests the script is doing something that is resource intensive, this is backed up by the initial ‘out of memory’ error in the core logs. 30 minutes is a long time to wait without any form of feedback, it is a good idea to log the start and end of any long running/resource intensive methods in order to provide feedback to the user during execution and some clues as to the point of failure should there be any issues.

Whilst much of the above will seem like a lesson in the obvious, the issues described in this article are a common problem for many OpenLink users often putting busy support teams under significant strain. It is also worth noting that without a clear rules based logging framework and a set of core logs that are free from noise it is very difficult to extract performance statistics and utilise some of the more interesting log analysis tools that can assist with real-time error detection and longer term capacity planning.

The perfect logging framework is not something that can be implemented without up-front consideration and a degree of foresight as to what will be beneficial at a later date. With this in mind, perhaps the most important trait of any logging framework is its ability to be flexible. If a change in output location or message format is required it should not be necessary to change any code that implements the framework, only the framework itself should require change.

New OpenLink users and those venturing into JVS or OC for the first time should consider using industry proven logging frameworks such as LogBack, Log4J or Log4N, Java users can go even further and use a framework like SLF4J that can act as a façade to the logging framework and allow the in-use underlying framework to be chosen at deployment. These frameworks make use of file-based configuration to control output level, message format, message location and rolling strategy. They also allow for more advanced configuration allowing multiple strategies to be implemented for different types of activity. They can also be extended to allow messages to be routed to the Error Log Viewer and OListen and provide the ability to dynamically control logging configuration from inside the OpenLink application. When making use of the framework, developers should just need to initiate logging, stop logging and log messages at a specific level.

For heavy AVS users particularly where multiple logging mechanisms exists the transition will be a little more fraught. It should be possible to align all existing logging mechanisms to a single framework with minimal functional impact. A framework should first be created (or an existing mechanism extended or enhanced) to provide all the functionality that matches (as far as possible) that provided by your JVS and/or OC framework(s), as with JVS/OC logging, it should hide the implementation from the developers, the developer should just be able to start logging, stop logging and log messages at a specific level. Once in place, existing code can be transitioned to the new framework. Ideally this should be provided in a single release to avoid the confusion that would likely ensue if the same was done over a longer period of time in a piecemeal fashion.

Once in place rules should be set and enforced, it is very easy to miss a rogue print statement introduced during development that should really be a proper log message or removed altogether. In AVS there are a finite number of ways to log or print so can be easily checked as part of your build or release process. The same is true of JVS and OC, although the options available to the developer are more varied so the process of adding checks to your build process will likely evolve over time.

Finally and perhaps most importantly, developer education is vital when it comes to consistently implementing a logging framework. If more than one logging framework is in use it is likely education fell short at some point in the past, a developer wasn’t aware of the existing logging framework so decided to write their own. A developer’s guide should be written for the logging framework that covers all languages. It should explain how logging should be invoked, when to log and at what level. Log levels should be clearly defined, a default level (usually Info) should be chosen and expectations around the content and quantity of messages should be set, if a process takes 30 minutes to complete but only provides a start and end message this is probably not enough. However a Tran Field Notification script that takes less than half a second to execute and produces 200 log messages is probably logging too much. Generally speaking, long running and complex operations should be logged immediately before and immediately after execution at the default level, more granular details should be logged at a lower level so that they are only output when the debug level is increased. For those supporting the application and for those making use of the log files, a support guide should be written that explains where the logs will be located, what the levels mean and how to change the level. Finally a technical guide should be written that describes the framework in detail that can be used if issues with the framework materialise or changes to the framework are required.

Please feel free to post your comments, questions and experiences below. In my next article I’ll be taking a look at cleaning up noisy core OpenLink logs.


Have your say:

You Must Be A Member To Read This Article

We regularly publish great content from our experts advising you on how to maximise the efficiency of your trading software and business intelligence suites.  Become a member today to access them all, for free.

Download PDF version

This field is for validation purposes and should be left unchanged.