Faster and Cleaner Code since Java 7 - codecentric AG Blog

:

Every Java developer with more than a few months of coding experience has written code like this before:

try {
  "Hello World".getBytes("UTF-8");
} catch (UnsupportedEncodingException e) {
  // Every implementation of the Java platform is required to support UTF-8
  // Why the $!?% do I have to catch an exception which can never happen
}

try { "Hello World".getBytes("UTF-8"); } catch (UnsupportedEncodingException e) { // Every implementation of the Java platform is required to support UTF-8 // Why the $!?% do I have to catch an exception which can never happen }

What I realized just recently is that Java 7 already provided a fix for this ugly code, which not many people have adopted:

"Hello World".getBytes(StandardCharsets.UTF_8);

"Hello World".getBytes(StandardCharsets.UTF_8);

Yay! No Exception! But it is not only nicer, it is also faster! You will be surprised to see by how much!

Let us first look at the implementations for both getBytes() calls:

return StringCoding.encode(charset, value, 0, value.length);

return StringCoding.encode(charset, value, 0, value.length);

Not exciting. We shall dig on:

static byte[] encode(String charsetName, char[] ca, int off, int len)
    throws UnsupportedEncodingException
{
    StringEncoder se = deref(encoder);
    String csn = (charsetName == null) ? "ISO-8859-1" : charsetName;
    if ((se == null) || !(csn.equals(se.requestedCharsetName())
                          || csn.equals(se.charsetName()))) {
        se = null;
        try {
            Charset cs = lookupCharset(csn);
            if (cs != null)
                se = new StringEncoder(cs, csn);
        } catch (IllegalCharsetNameException x) {}
        if (se == null)
            throw new UnsupportedEncodingException (csn);
        set(encoder, se);
    }
    return se.encode(ca, off, len);
}

static byte[] encode(String charsetName, char[] ca, int off, int len) throws UnsupportedEncodingException { StringEncoder se = deref(encoder); String csn = (charsetName == null) ? "ISO-8859-1" : charsetName; if ((se == null) || !(csn.equals(se.requestedCharsetName()) || csn.equals(se.charsetName()))) { se = null; try { Charset cs = lookupCharset(csn); if (cs != null) se = new StringEncoder(cs, csn); } catch (IllegalCharsetNameException x) {} if (se == null) throw new UnsupportedEncodingException (csn); set(encoder, se); } return se.encode(ca, off, len); }

and

static byte[] encode(Charset cs, char[] ca, int off, int len) {
  CharsetEncoder ce = cs.newEncoder();
  int en = scale(len, ce.maxBytesPerChar());
  byte[] ba = new byte[en];
  if (len == 0)
      return ba;
  boolean isTrusted = false;
  if (System.getSecurityManager() != null) {
      if (!(isTrusted = (cs.getClass().getClassLoader0() == null))) {
          ca =  Arrays.copyOfRange(ca, off, off + len);
          off = 0;
      }
  }
  ce.onMalformedInput(CodingErrorAction.REPLACE)
    .onUnmappableCharacter(CodingErrorAction.REPLACE)
    .reset();
  if (ce instanceof ArrayEncoder) {
      int blen = ((ArrayEncoder)ce).encode(ca, off, len, ba);
      return safeTrim(ba, blen, cs, isTrusted);
  } else {
      ByteBuffer bb = ByteBuffer.wrap(ba);
      CharBuffer cb = CharBuffer.wrap(ca, off, len);
      try {
          CoderResult cr = ce.encode(cb, bb, true);
          if (!cr.isUnderflow())
              cr.throwException();
          cr = ce.flush(bb);
          if (!cr.isUnderflow())
              cr.throwException();
      } catch (CharacterCodingException x) {
          throw new Error(x);
      }
      return safeTrim(ba, bb.position(), cs, isTrusted);
  }
}

static byte[] encode(Charset cs, char[] ca, int off, int len) { CharsetEncoder ce = cs.newEncoder(); int en = scale(len, ce.maxBytesPerChar()); byte[] ba = new byte[en]; if (len == 0) return ba; boolean isTrusted = false; if (System.getSecurityManager() != null) { if (!(isTrusted = (cs.getClass().getClassLoader0() == null))) { ca = Arrays.copyOfRange(ca, off, off + len); off = 0; } } ce.onMalformedInput(CodingErrorAction.REPLACE) .onUnmappableCharacter(CodingErrorAction.REPLACE) .reset(); if (ce instanceof ArrayEncoder) { int blen = ((ArrayEncoder)ce).encode(ca, off, len, ba); return safeTrim(ba, blen, cs, isTrusted); } else { ByteBuffer bb = ByteBuffer.wrap(ba); CharBuffer cb = CharBuffer.wrap(ca, off, len); try { CoderResult cr = ce.encode(cb, bb, true); if (!cr.isUnderflow()) cr.throwException(); cr = ce.flush(bb); if (!cr.isUnderflow()) cr.throwException(); } catch (CharacterCodingException x) { throw new Error(x); } return safeTrim(ba, bb.position(), cs, isTrusted); } }

Wooha. Well it looks like the one taking a Charset is more complicated, right? Wrong. The last line of encode(String charsetName, char[] ca, int off, int len) is se.encode(ca, off, len), and the source of that looks mostly like the source of encode(Charset cs, char[] ca, int off, int len). Very much simplified, this makes the whole code from encode(String charsetName, char[] ca, int off, int len) basically just overhead.
Worth noting is the line Charset cs = lookupCharset(csn); which in the end will do this:

private static Charset lookup(String charsetName) {
  if (charsetName == null)
      throw new IllegalArgumentException("Null charset name");
 
  Object[] a;
  if ((a = cache1) != null && charsetName.equals(a[0]))
      return (Charset)a[1];
  // We expect most programs to use one Charset repeatedly.
  // We convey a hint to this effect to the VM by putting the
  // level 1 cache miss code in a separate method.
  return lookup2(charsetName);
}
 
private static Charset lookup2(String charsetName) {
  Object[] a;
  if ((a = cache2) != null && charsetName.equals(a[0])) {
      cache2 = cache1;
      cache1 = a;
      return (Charset)a[1];
  }
 
  Charset cs;
  if ((cs = standardProvider.charsetForName(charsetName)) != null ||
      (cs = lookupExtendedCharset(charsetName))           != null ||
      (cs = lookupViaProviders(charsetName))              != null)
  {
      cache(charsetName, cs);
      return cs;
  }
 
  /* Only need to check the name if we didn't find a charset for it */
  checkName(charsetName);
  return null;
}

private static Charset lookup(String charsetName) { if (charsetName == null) throw new IllegalArgumentException("Null charset name");Object[] a; if ((a = cache1) != null && charsetName.equals(a[0])) return (Charset)a[1]; // We expect most programs to use one Charset repeatedly. // We convey a hint to this effect to the VM by putting the // level 1 cache miss code in a separate method. return lookup2(charsetName); }private static Charset lookup2(String charsetName) { Object[] a; if ((a = cache2) != null && charsetName.equals(a[0])) { cache2 = cache1; cache1 = a; return (Charset)a[1]; }Charset cs; if ((cs = standardProvider.charsetForName(charsetName)) != null || (cs = lookupExtendedCharset(charsetName)) != null || (cs = lookupViaProviders(charsetName)) != null) { cache(charsetName, cs); return cs; }/* Only need to check the name if we didn't find a charset for it */ checkName(charsetName); return null; }

Wooha again. Thats quite impressive code. Also note the comment // We expect most programs to use one Charset repeatedly.. Well thats not exactly true. We need to use charsets when we have more than one and need to convert between them. But yes, for most internal usage this will be true.

Equipped with this knowledge, I can easily write a JMH benchmark that will nicely show the performance difference between these two String.getBytes() calls.
The benchmark can be found in this gist. On my machine it produces this result:

Benchmark                Mean      Mean error  Units
preJava7CharsetLookup    3956.537  144.562     ops/ms
postJava7CharsetLookup   7138.064  179.101     ops/ms

The whole result can be found in the gist, or better: obtained by running the benchmark yourself.
But the numbers already speak for themselves: By using the StandardCharsets, you not only do not need to catch a pointless exception, but also almost double the performance of the code 🙂