# Saturday, 10 December 2016

On REST services performance

Recently, I had to investigate a “performance issue” a customer was having with one of their web services.


To make it simple, the service is a REST API to get information about points of interest. The response is quite large (hundreds of KBs) but nothing exceptional.

Several clients can perform multiple requests for the same POI, and the response for a single POI is almost the same: it varies a little over time with real time updates (traffic, info, last minute additions or cancellations) but it is roughly the same. So, the code was already doing the right thing, and cached the answer for each POI.

Well.. more or less the right thing. For a single POI, with ~1000 sub-items, the response time for the first request was ~39 seconds. Subsequent requests required half a second, so the caching was working.

The API is for consumption by a service, so there is no need to be “responsive” (as in “users will need quick feedback or they will walk away”), but still: 39 seconds!


The API is implemented in Java (JAX-RS + JPA to be precise), so armed with the profiler of choice (VisualVM) I started hunting for hot spots. Here is a list of DOs and DON’Ts I compiled while investigating and fixing the issues which can came handy. The list is not tied to Java, it is very general!


  • DO instrument your code with Log calls with timings at enter/exit of “hot” functions.

  • DO it at a logging level you can leave on in production (e.g.: INFO. But then leave INFO on!)

  • If you didn’t do that… you don’t have times :( But you need timing to see where you need to improve, so DO use a profiler!

  • DON’T just go for the function you believe is the slowest: a profiler trace may surprise you.

  • DO use a profile with instrumentation, not with sampling. In my experience, sampling is never precise enough.

  • When you have found your hot spots, DO move all costly, repeated operations you find to a place where they are done once (a constructor or initialization method). In this case, the offender was an innocent looking Config.get(“db.name”) method. Just to get the DB name from a config class. Which ended up opening a property file, reading it, parsing it every time. The method was doing a lot under the hood, but would you have looked at it without the hint from a profiler? See the previous point :)

  • DO cache data that does not change, if you are reading it from a DB or web service. Cache on the output is the basics, but it is often not nearly enough. You have to avoid multiple lookups for the same resource inside a single request!

  • DON’T do a DB (or even a cache!) lookup if you can find another way to get the same information, even when you need to re-compute a result (i.e. spend CPU time). In this service, each POI sub-item could be categorized in one of two classes using some of its attributes. The old implementation used a subset of attributes that needed to be checked with a DB lookup; I changed it to use a different set of attributes that needed a (simple) computation.

  • DO load the cache in bulk for small-ish sets of data. In this service, each string which contained information to be displayed to the user was looked up in a DB of “special cases” using complex fallback rules, each time generating a less refined query (up to 4). If nothing was found (~80% of the times), a default string was loaded from a Web Service. This operation alone accounted for 10 seconds, or 25% of the total time. The “not default” DB contains just around 4k items; a bulk query for all the rows requires only 100ms, can be easily stored in memory, and doing the filtering and matching in memory costs just a few ms more.

  • DO use simple libraries: communication with other Web Services was done using a very easy to use but quite heavy library (Jersey + Jackson for JSON deserialization). I switched to a custom client written with OkHttp and GSON, and the net save was 4 whole seconds.

  • DO enable compression on the result (if the user agent says it supports compression - most do!)

  • DO minimize copy and allocations: in this case (but this advice applies to Java in general), I used streams instead of lists whenever possible, down to the response buffer.

  • DON’T use the DB, especially NOT the same DB you use for you primary data, to store “logs”. In this case, it was access logs for rate limiting. A client hitting the service hard could consume a lot of resources just to generate a 429 Too Many Requests response.
    Recording such an event to your primary DB is the perfect opportunity for a DoS attack.


Remember the times?

  • 39 seconds for the first request

  • 0.5 seconds for subsequent request son the same object

Now:

  • 1 second for the first request

  • 50 milliseconds (0.05 seconds) for subsequent requests on the same object


It is more than an order of magnitude. I’m quite happy, and so was the customer! The time can be brought time even further by ditching the ORM framework (JPA in this case) and going for native (JDBC) queries, changing some algorithms, using a different exchange format (e.g. protobuf instead of JSON), but with increasing effort and diminishing results. And for this customer, the result was already more than they asked for.

# Thursday, 08 December 2016

Containers, Windows, and minimal images

Recently I have watched with awe a couple of presentations on Docker on Windows. Wow… proper containers on the Windows kernel, I haven’t seen it coming! I thought that “porting” cgroups and namespaces from Linux was something hard to accomplish. Surely, all the bits where already almost there: Windows has had something similar to cgroups for resource control (Jobs, sets of processes which can enforce limits such as working set size, process priority, and end-of-job time limit on each process that is associated with the job) since NT 5.1 (XP/Windows Server 2003), and NT has had kernel namespaces since its beginning (NT 3.1). For details I recommend reading this excellent article: Inside NT Object Manager.


However, seeing this bits put together and exposed to the userland with a nice API, and Docker ported (not forked!) to use it, is something else.


Of course, I was instantly curious. How did they do that? The Windows Containers Documentation contains no clue: all you can find is a quick start.


There a couple of videos on presentations done at DockerCon EU 2015 and DockerCon 2016, but documentation is really scarce. Non-existent.

From the videos you understand that, as usual, Windows does not expose in an official way (at least, for now) the primitives needed to create the virtual environment for creating a container, but rather exposes a user-mode DLL with a simplified (and, hopefully, stable) API to create “Compute Systems”. One of the exposed functions is, for example, HcsCreateComputeSystem.


A search in MSDN for vmcompute.dll or, for example, HcsCreateComputeSystem, does reveal nothing… the only documentation is found in a couple of GitHub projects from Microsoft: hcsshim, a shim used by Docker to support Windows Containers by calling into the vmcompute.dll API, and dotnet-computevirtualization, a .NET assembly to access the vmcoumpute.dll API from managed languages.


Once it is documented, this “Compute Systems” API is surely something I want to try out for Pumpkin.


Meanwhile… there is a passage in the presentations and in the official Introducing Docker for Windows Server 2016 that left me with mixed feelings. You cannot use “FROM scratch”  to build your own image; you have to start with a “minimal” windows image.


Currently, Microsoft provides microsoft/windowsservercore or microsoft/nanoserver.

The Windows Server Core image comes with a mostly complete userland with the processes and DLLs found on a standard Windows Server Core install.  This image is very convenient: any Windows server software will run on it without modification, but it takes 10GB of disk space! The other base layer option is Nano Server, a new and very minimal Windows version with a pared-down Windows API. The API is not complete, but porting should be easy and it is less than 300MB.

But why do I need at least a 300MB image? The point of containers is to share the kernel of the host, isn’t it?


The explanation is buried in one of the DockerCon presentations, and it makes a lot of sense: the Win32 API is exposed to Windows programs through DLLs, not directly as syscalls.

(Note: I will call it the “Win32 API” even on x64, because there really isn’t any “Win64” API: it is the same! Just look at the names of the libraries: kernel32, gdi32, user32, …)


Of course, internally those DLLs will make syscalls to transition to kernel mode (sort of, more on this later), but the surface you program against in Windows is through user mode DLLs.

What really hit me is the sheer number of “basic” components required by Windows nowadays. I started to program using the Win32 API when there were only 3 of these DLLs. OK, 4 if you count advapi32.dll.

Sure, then there was ole32, if you wanted to use OLE, and comctr32, if you wanted “fancy” user controls, and Ws2_32 if you wanted sockets… But they were “optional”, while now without  crss.exe, lsass.exe, smss.exe, svchost, wininit, etc. you cannot even run a “no-op” executable.


Or can you?


I will take an extremely simple application, which “touches” (creates) a very specific file (no user input, to make things easier), and try to remove dependencies and see how far you can get (Spoiler alert: you can!)


I will divide my findings in three blog posts, one for each “step” I went through, and update this post with links every time I post a new one.

  • Step 1: no C runtime
  • Step 2: no Win32 API
  • Step 3: no Native API (no ntdll.dll)
For the impatient: there is a github repo with all the code here :)