Blitz4j at a glance

What is Blitz4j?

Blitz4j is a logging framework built on top of log4j to reduce multithreaded contention and enable highly scalable logging without affecting application performance characteristics.

At Netflix, Blitz4j is used to log billions of events for monitoring, business intelligence reporting, debugging and other purposes. Blitz4j overcomes traditional log4j bottlenecks and comes built with a highly scalable and customizable asynchronous framework. Blitz4j also comes with the ability to convert the existing log4j appenders to use the asynchronous model without changing the existing log4j configurations.

Blitz4j makes runtime reconfigurations of log4j pretty easy. Blitz4j also tries to mitigate data loss and provides a way to summarize the log information during log storms.

Why is scalable logging important?

Logging is a critical part of any application infrastructure. At Netflix, we collect data for monitoring, business intelligence reporting etc. There is also a need to turn on finer grain of logging level for debugging customer issues.

In addition, in a service-oriented architecture you depend on other central services and if those services break unexpectedly, your applications tend to log orders of magnitude higher than normal. This is where the scalability of the logging infrastructure comes to the fore. Any scalable logging infrastructure should be able to handle these kind of log storms providing useful information about the breakages without slightly affecting the application performance.

History of Blitz4j

At Netflix, log4j has been used as a logging framework for a few years. It had worked fine for us, until the point where there was a real need to log lots of data. When our traffic increased and when the need for per-instance logging went up, log4j's frailties started to get exposed.

Problems with Log4j

Contended Synchronization with Root Logger

Log4j follows a hierarchical model and that makes it easy to turn off/on logging based on a package or a class level (if your logger definition follows that model). In this model, root logger is at the top of the hierarchy. In most cases, all loggers have to get access the root logger to log to the appenders configured there. One of the biggest problems here is that locking is needed on the root logger to write to the appenders. For a high traffic application that logs lot of data this is a big contention point as all application threads have to synchronize on the root logger.

public
  void callAppenders(LoggingEvent event) {
    int writes = 0;
   for(Category c = this; c != null; c=c.parent) {
     // Protected against simultaneous call to addAppender, removeAppender,...
      synchronized(c) {
	if(c.aai != null) {
	  writes += c.aai.appendLoopOnAppenders(event);
	}
	if(!c.additive) {
	  break;
	}
      }
    }

    if(writes == 0) {
      repository.emitNoAppenderWarning(this);
    }
  }

This severely limits the application scalability. Even if the critical section is executed quickly, this is a huge bottleneck in high volume logging applications.

The reason for the lock seems to be for protection against potential change in the list of appenders which should be a rare event. For us, the thread dump has exposed this contention numerous times.

Asynchronous appender to the rescue?

The longer the time spent in logging to the appender, more threads wait on the logging to complete. Any buffering here helps the scalability of the application tremendously. The log4j File Appender comes with buffering helps this problem quite a bit. The built in log4j asynchronous appender alleviates this problem quite a bit, but it still does not remove this synchronization. Thread dumps revealed another point of contention when logging to the appender.

It was quite clear that the built-in asynchronous appender was less scalable because of this synchronization.

public
  synchronized 
  void doAppend(LoggingEvent event) {
    if(closed) {
      LogLog.error("Attempted to append to closed appender named ["+name+"].");
      return;
    }
    
    if(!isAsSevereAsThreshold(event.getLevel())) {
      return;
    }

    Filter f = this.headFilter;
    
    FILTER_LOOP:
    while(f != null) {
      switch(f.decide(event)) {
      case Filter.DENY: return;
      case Filter.ACCEPT: break FILTER_LOOP;
      case Filter.NEUTRAL: f = f.getNext();
      }
    }
    
    this.append(event);    
  }

Deadlock Vulnerability

This double locking (rootlogger and appender) also makes the application vulnerable to deadlocks if your appender by any chance tries to take a lock on a resource and if that resource tries to log to the appender at the same time.

Locking on Logger Cache

In log4j, loggers are cached in a Hashtable and that needs to be locked for any retrieval of a cached logger. When you want to change the log4j settings dynamically, there are 2 steps in the process.

Reset and empty out all current log4j configurations
Load all configurations including new configurations

During the reset process, locks have to be held on both the cache and the individual loggers. If any of your appenders try to look up the logger from the cache at the same time, you have the classic case of locks trying to be held in opposite directions and the chance of a deadlock.

public
  void shutdown() {
    Logger root = getRootLogger();

    // begin by closing nested appenders
    root.closeNestedAppenders();

    synchronized(ht) {
      Enumeration cats = this.getCurrentLoggers();
      while(cats.hasMoreElements()) {
	Logger c = (Logger) cats.nextElement();
	c.closeNestedAppenders();
      }

Blitz4j to the rescue

Central to all the contention and the deadlock vulnerabilities is the locking model in log4j. If the log4j used any of the concurrent data structures with JDK 1.5 and above, most of the problems would be solved. That is exactly what blitz4j does.

Blitz4j overrides key parts of the log4j architecture to remove the locks and replace them with concurrent data structures.Blitz4j puts the emphasis more on application performance and stability rather than accuracy in logging. This means Blitz4j leans more towards the asynchronous model of logging and tries to make the logging useful by retaining the time-order of logging messages.

While the log4j's built-in asynchronous appender is similar in functionality to the one offered by blitz4j, Blitz4j comes with the following differences

Remove all critical synchronizations with concurrent data structures.
Extreme configurability in terms of in-memory buffer and worker threads
More isolation of application threads from logging threads by replacing the wait-notify model with an executor pool model.
Better handling of log messages during log storms with configurable summary.

Apart from the above, Blitz4j also provides the following.

Ability to dynamically configure log4j levels for debugging production problems with affecting the application performance.
Automatic conversion of any log4j appender to the asynchronous model statically or at runtime.

If your application is equipped with enough memory, it is possible to achieve both application and logging performance without any logging data loss. The power of this model comes to the fore during log storms when the critical dependencies break unexpectedly causing orders of magnitude increase in logging.

Why not use LogBack?

For a new project, this may be the way to go. It seemed like for existing projects there seemed to be considerable amount of work to achieve the promised scalability. Since, blitz4j achieves this without putting the onus on individual applications and since the contention problems we used to encounter before are not existent anymore, we deemed it is prudent to leave the well understood and time tested infrastructure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly