# Saturday, 10 December 2016

On REST services performance

Recently, I had to investigate a “performance issue” a customer was having with one of their web services.


To make it simple, the service is a REST API to get information about points of interest. The response is quite large (hundreds of KBs) but nothing exceptional.

Several clients can perform multiple requests for the same POI, and the response for a single POI is almost the same: it varies a little over time with real time updates (traffic, info, last minute additions or cancellations) but it is roughly the same. So, the code was already doing the right thing, and cached the answer for each POI.

Well.. more or less the right thing. For a single POI, with ~1000 sub-items, the response time for the first request was ~39 seconds. Subsequent requests required half a second, so the caching was working.

The API is for consumption by a service, so there is no need to be “responsive” (as in “users will need quick feedback or they will walk away”), but still: 39 seconds!


The API is implemented in Java (JAX-RS + JPA to be precise), so armed with the profiler of choice (VisualVM) I started hunting for hot spots. Here is a list of DOs and DON’Ts I compiled while investigating and fixing the issues which can came handy. The list is not tied to Java, it is very general!


  • DO instrument your code with Log calls with timings at enter/exit of “hot” functions.

  • DO it at a logging level you can leave on in production (e.g.: INFO. But then leave INFO on!)

  • If you didn’t do that… you don’t have times :( But you need timing to see where you need to improve, so DO use a profiler!

  • DON’T just go for the function you believe is the slowest: a profiler trace may surprise you.

  • DO use a profile with instrumentation, not with sampling. In my experience, sampling is never precise enough.

  • When you have found your hot spots, DO move all costly, repeated operations you find to a place where they are done once (a constructor or initialization method). In this case, the offender was an innocent looking Config.get(“db.name”) method. Just to get the DB name from a config class. Which ended up opening a property file, reading it, parsing it every time. The method was doing a lot under the hood, but would you have looked at it without the hint from a profiler? See the previous point :)

  • DO cache data that does not change, if you are reading it from a DB or web service. Cache on the output is the basics, but it is often not nearly enough. You have to avoid multiple lookups for the same resource inside a single request!

  • DON’T do a DB (or even a cache!) lookup if you can find another way to get the same information, even when you need to re-compute a result (i.e. spend CPU time). In this service, each POI sub-item could be categorized in one of two classes using some of its attributes. The old implementation used a subset of attributes that needed to be checked with a DB lookup; I changed it to use a different set of attributes that needed a (simple) computation.

  • DO load the cache in bulk for small-ish sets of data. In this service, each string which contained information to be displayed to the user was looked up in a DB of “special cases” using complex fallback rules, each time generating a less refined query (up to 4). If nothing was found (~80% of the times), a default string was loaded from a Web Service. This operation alone accounted for 10 seconds, or 25% of the total time. The “not default” DB contains just around 4k items; a bulk query for all the rows requires only 100ms, can be easily stored in memory, and doing the filtering and matching in memory costs just a few ms more.

  • DO use simple libraries: communication with other Web Services was done using a very easy to use but quite heavy library (Jersey + Jackson for JSON deserialization). I switched to a custom client written with OkHttp and GSON, and the net save was 4 whole seconds.

  • DO enable compression on the result (if the user agent says it supports compression - most do!)

  • DO minimize copy and allocations: in this case (but this advice applies to Java in general), I used streams instead of lists whenever possible, down to the response buffer.

  • DON’T use the DB, especially NOT the same DB you use for you primary data, to store “logs”. In this case, it was access logs for rate limiting. A client hitting the service hard could consume a lot of resources just to generate a 429 Too Many Requests response.
    Recording such an event to your primary DB is the perfect opportunity for a DoS attack.


Remember the times?

  • 39 seconds for the first request

  • 0.5 seconds for subsequent request son the same object

Now:

  • 1 second for the first request

  • 50 milliseconds (0.05 seconds) for subsequent requests on the same object


It is more than an order of magnitude. I’m quite happy, and so was the customer! The time can be brought time even further by ditching the ORM framework (JPA in this case) and going for native (JDBC) queries, changing some algorithms, using a different exchange format (e.g. protobuf instead of JSON), but with increasing effort and diminishing results. And for this customer, the result was already more than they asked for.

# Thursday, 08 December 2016

Containers, Windows, and minimal images

Recently I have watched with awe a couple of presentations on Docker on Windows. Wow… proper containers on the Windows kernel, I haven’t seen it coming! I thought that “porting” cgroups and namespaces from Linux was something hard to accomplish. Surely, all the bits where already almost there: Windows has had something similar to cgroups for resource control (Jobs, sets of processes which can enforce limits such as working set size, process priority, and end-of-job time limit on each process that is associated with the job) since NT 5.1 (XP/Windows Server 2003), and NT has had kernel namespaces since its beginning (NT 3.1). For details I recommend reading this excellent article: Inside NT Object Manager.


However, seeing this bits put together and exposed to the userland with a nice API, and Docker ported (not forked!) to use it, is something else.


Of course, I was instantly curious. How did they do that? The Windows Containers Documentation contains no clue: all you can find is a quick start.


There a couple of videos on presentations done at DockerCon EU 2015 and DockerCon 2016, but documentation is really scarce. Non-existent.

From the videos you understand that, as usual, Windows does not expose in an official way (at least, for now) the primitives needed to create the virtual environment for creating a container, but rather exposes a user-mode DLL with a simplified (and, hopefully, stable) API to create “Compute Systems”. One of the exposed functions is, for example, HcsCreateComputeSystem.


A search in MSDN for vmcompute.dll or, for example, HcsCreateComputeSystem, does reveal nothing… the only documentation is found in a couple of GitHub projects from Microsoft: hcsshim, a shim used by Docker to support Windows Containers by calling into the vmcompute.dll API, and dotnet-computevirtualization, a .NET assembly to access the vmcoumpute.dll API from managed languages.


Once it is documented, this “Compute Systems” API is surely something I want to try out for Pumpkin.


Meanwhile… there is a passage in the presentations and in the official Introducing Docker for Windows Server 2016 that left me with mixed feelings. You cannot use “FROM scratch”  to build your own image; you have to start with a “minimal” windows image.


Currently, Microsoft provides microsoft/windowsservercore or microsoft/nanoserver.

The Windows Server Core image comes with a mostly complete userland with the processes and DLLs found on a standard Windows Server Core install.  This image is very convenient: any Windows server software will run on it without modification, but it takes 10GB of disk space! The other base layer option is Nano Server, a new and very minimal Windows version with a pared-down Windows API. The API is not complete, but porting should be easy and it is less than 300MB.

But why do I need at least a 300MB image? The point of containers is to share the kernel of the host, isn’t it?


The explanation is buried in one of the DockerCon presentations, and it makes a lot of sense: the Win32 API is exposed to Windows programs through DLLs, not directly as syscalls.

(Note: I will call it the “Win32 API” even on x64, because there really isn’t any “Win64” API: it is the same! Just look at the names of the libraries: kernel32, gdi32, user32, …)


Of course, internally those DLLs will make syscalls to transition to kernel mode (sort of, more on this later), but the surface you program against in Windows is through user mode DLLs.

What really hit me is the sheer number of “basic” components required by Windows nowadays. I started to program using the Win32 API when there were only 3 of these DLLs. OK, 4 if you count advapi32.dll.

Sure, then there was ole32, if you wanted to use OLE, and comctr32, if you wanted “fancy” user controls, and Ws2_32 if you wanted sockets… But they were “optional”, while now without  crss.exe, lsass.exe, smss.exe, svchost, wininit, etc. you cannot even run a “no-op” executable.


Or can you?


I will take an extremely simple application, which “touches” (creates) a very specific file (no user input, to make things easier), and try to remove dependencies and see how far you can get (Spoiler alert: you can!)


I will divide my findings in three blog posts, one for each “step” I went through, and update this post with links every time I post a new one.

  • Step 1: no C runtime
  • Step 2: no Win32 API
  • Step 3: no Native API (no ntdll.dll)
For the impatient: there is a github repo with all the code here :)

# Friday, 07 October 2016

Integrate an "IoT" device with a web application: The Good, the Bad and the Ugly

An interesting scenario that I keep bumping into is: there is a device, which is typically "headless" (no significant UI), but it performs some specific, useful function ("IoT" device. Note the quotes). There is a web app, with a rich UI. Your customers wants them to "talk". 

I smell trouble at the "talk" part. 

  • Data collection + display? Sure, IoT is born to do exactly this.
  • Control the device, issuing commands? You have to be careful with this one, naive solutions bring a lot of trouble.
  • Getting interactive feedback? (Command - response - Web UI update) Let's talk about this, because it can be done, but it is not so straightforward.

The trouble with this scenario is that it looks so simple to non technical people (and less technical people alike... I heard an IT manager ask once how he can wire up his scanner to our web application, but that's another story). However, it is so easy to come up with bad or ugly solutions!

Fortunately, with a couple of new-ish technology bits and some good patterns, it is possible to come up with a good solution.

The good 

Well... Yeah, it you follow The Good, The Bad and the Ugly trinity, you have to start with "The Good". But I don't want to talk about the good already, it will spoil all the fun!

Let's come back to the good later.


The bad

<do not do this!>

Sending commands to a device is quite easy. You open a socket on the device, listening. You somehow know where the device is, which address it has (which is an interesting problem in its own, but I digress), so from your web server you just connect to that socket (address/port) and send your commands. You probably don't have a firewall, but if you have one, just punch a hole through it to let message pass through.

</do not do this!>

UGH. BAD.

Even if you go trough all the effort of making it secure (using an SSH tunnel, for example, but I have seen plain text sockets with ASCII protocols. Open to the Internet.), you are exposing a single port, probably on a low-power device (like an ARM embedded device), possibly using a low-bandwidth channel (like GPRS). How much does it take to DoS it? Probably you don't even need the first D (as in DDoS), you could do it from a single machine.

But let's say you somehow try and cover this hole, maybe with a VPN, inserting a field gateway in front of your "IoT" device(s), or putting a VPN inside the devices themselves if they are powerful enough (and the aforementioned client exists for you platform/architecture. Good luck with some ARM v4 devices with 1MB (Mega) disk space I have seen, but I digress again).

Great, you are probably relieved because now you can have interactive feedback!

<do not do this!>

You see, it is easy. Your user click on the page. On the web server, inside the click handler (whatever this is: a controller, a handler, a servlet...) you open a socket and send a command trough TCP, and wait for a response. The client receives the command, process it, aswers back through the socket and closes the connection. The web server receives it, prepares an HTTP response and returns it to the web browser. Convenient!

</do not do this!>


Now you have thread affinity all the way: the same thread of execution spawns servers, programs, devices. Blocking threads is a performance bottleneck in any case but it is a big issue on a server. 

UGH. BAD.

If the network is slow (and it will be), you may end up keeping the web server thread hanging for seconds. Let's forget about hanging the web browser UI (you can put up a nice animation, or use Ajax), but keeping a web server thread hung for seconds doing nothing is BAD. Like in "2 minutes, and your web server will crash for resource exhaustion" bad. 


The ugly

IoT is difficult, real-time web applications are difficult, so let's ditch them. 

We go back one or two decades, and write "desktop" applications. Functionalities provided by the former web applications are exposed as web services, and we consume them from our desktop app. Which is connected directly to the device. Which is not an "Internet of Things" device anymore, maybe a "Intranet of Things" device (I should probably register that term! :) ), if it is not connected by USB. 

It makes sense in a lot of cases, if the device and the PC/Tablet/whatever are co-located. But it imposes a lot of physical constraints (there is a direct connection between the device and the PCs/Tablets that can control that device). Also, if the app was a web app to begin with, there are probably good reasons for that: easy of deployment, developers familiar with the framework, runs on any modern web browser, ...

Especially if you discover that half your clients are using Windows PCs, the other half Linux, and a third half Android tablets. Now you need to build and maintain three different desktop applications. Which is an ugly mess.

Besides, how do you reach your "IoT"-device now, if it is on a private Intranet? How do you update it, collect diagnostics and logs in a central location? You can not, or you have to setup complicate firewall rules, policies, local update servers. Again, feasible, but ugly.


The good (finally)

The solution I came up with is to use WebSockets (or better, a multi-transport library like SignalR) + AMQP "response" queues to make it good.

AMQP is a messanging protocol. It is a raising standard, and it is implemented by many (most) queuing servers and event hubs (see my previous post). An interesting usage for AMQP is to create "response queues". A hint on how this might work is given, for example, in the RabbitMQ tutorial. The last tutorial in the series describes how to code an RPC mechanism using AMQP. 


The interesting part of the tutorial is in the client code:


   var consumer = new QueueingBasicConsumer(channel);

   var replyQueueName = channel.QueueDeclare(exclusive: true, autoDelete: true).QueueName;

   

   channel.BasicConsume(queue: replyQueueName,

                        noAck: true,

                        consumer: consumer);

          

   // Do "Something special" here with replyQueueName                          

                             

   while(true)

   {

      var ea = (BasicDeliverEventArgs)consumer.Queue.Dequeue();

      if(ea.BasicProperties.CorrelationId == corrId)

      {

          return Encoding.UTF8.GetString(ea.Body);

      }

   }
   

   

The client declares a name-less (name is auto generated) queue. Plus, the queue should be deleted on disconnection. There are two flags for that: exclusive and autoDelete.

Than it does something with the queue name, and then continuously waits and reads messages from this queue.

The "something special" is: communicate to the server the device availability, and specify the name of the queue. This queue will be like an inbox for our device: a place on a server where who wants to communicate with us will place a message. The device (or better, a thread/task on the device) will wait for any incoming message, read it, and then dispatch it. The device will act accordingly.

It is important to note that the client is establishing the connection to a server (the queueing system server), not the other way around. This prevents the problem highlighted in the "Bad" section.

WebSockets (and related transport mechanism, like Server-Sent events, long polling, etc.) allow code on the server side to push content to the connected clients as it happens, in real-time.

So the device communicates with the web server using some standard way (direct HTTP POST to the server, or even better posting to a queue, and then having a worker read from the queue and POST to the web server, so you have queue-based load levelling, and the server pushes the update to the client. 

Note that the server knows which devices are "on" and can be controlled by the client (because the first thing a device does is to announce to the server its availability, and where it can be contacted), and it can also know which client is talking to which device, because traffic passes through the server:

Put the two together, and you have a working system for real time command + control of "IoT"-like devices, with real-time feedback and response, from a standard web application.

# Thursday, 01 September 2016

Modernizing a legacy IoT device: AMQP, but how?


As in most IoT/cloud processing scenarios, we collect data for multiple producers, and then let a bunch of consumers process the data.
This is a very common design, every (professional) IoT solution I have seen recommends it, and for good reasons: when load increases (you have more producers, or each of them produces more traffic and/or bigger messages), you scale out: you have multiple, concurrent (competing) consumers.

You can do this easily if you decouple consumers and producers, by inserting something that "buffers" data in between the consumer and producer. The benefits are increased availability and load levelling.
An easy way to have this is to place a simple, durable data store in the middle. The design is clean: producers place data into the store, consumers take them out and process them. 
The data store is usually FIFO, and usually a Queue. 

In our case, it wasn't a queue. The design was similar, the implementation... different :)
Still durable, still FIFO-ish, but not a Queue in the sense of Azure Queues or Service Bus Queues or RabbitMQ (typical examples of queues used in IoT projects).

It was just a file server + a TCP client writing to an SQL DB in a very NoSQL-ish way (just a table of events, one event per row, shreded by date). Devices write data to an append-only file, copy them securely using an SSH/SCP tunnel, place them in a directory; a daemon (service) with a file-system watcher (inotify under Linux) takes new files and uploads them to a proper queue for processing; the TCP client notifies about relevant events (new file present, files queued but not uploaded yet, all uploaded, etc.).

Our goal is to change this structure with something much more standard, where possible, so we can use some modern middleware and ditch some old, buggy software. Or even "get rid" of the middleware and let some Cloud Service handle the details for us.
If you have something similar, in order to modernize it and/or move it to a Cloud environment such as Azure (or if you plan to make your move in the future), you want to use a modern, standard messaging protocol (AMQP, for example), and a middleware that understands it natively (RabbitMQ, Azure Events Hub, ...). But how do you use this protocol? How do you "change" your flow from the legacy devices (producers) to the consumers?
You have three options:
  1. Insert a "stub/proxy" on the server (or cloud) side. Devices talk to this stub using their native protocol (a custom TCP protocol, HTTP + XML payload, whatever). This "stub" usually scales well: you can code it as lightweight web server (an Azure Web Role, for example) and just throw in more machines if needed, and let a load balance (like HAProxy) distribute the load. It is important that this layer just acts as a "collector": take data, do basic validation, log, and throw it in a queue for processing. No processing of messages here, no SQL inserts, so we do not block and we can have rate leveling, survive to bursts, etc.
    This is the only viable solution if you cannot touch the device code.
  2. Re-write all the code on the device that "calls out" using the legacy protocol/wire format, and substitute it with something that talks in a standard supported by various brokers, like AMQP or MQTT. In this way, you can directly talk to the broker (Azure Event Hub, IoT Hub, RabbitMQ, ..), without the need of a stub. 
    This solution is viable only if you fully control the device firmware.
  3. Insert a "broker" or gateway on the device, and then redirect all existing TCP traffic to the gateway (using local sockets), and move the file manager/watcher to the device. Have multiple queues, based on priority of transmission. Have connection-sensitive policies (transmit only the high-priority queue under GPRS, for example). Provide also a way to call directly the broker for new code, so the broker itself will store data and events to files and handle their lifetime. Then use AMQP as the message transport: to the external obsever (the Queue), the devices talks AMQP natively.
    This is a "in the middle solution": you can code / add your own programs to the device, but you do not have to change the existing software.
In our case, the 3rd option is the best one. It gives us flexibility, the ability to work on a piece of functionality at the time while keeping some of the old software still running.
Plus, it makes it possible to implement some advanced controls over data transmission (a "built-in" way to transmit files in a reliable way, have messages with different priorities, transmission policies based on time/location/connection status, ...).
But why would you want to design a new piece of software that still writes to files, and not just keep a transmission (TX) queue in memory? For the same reason queues in the middleware or in the cloud are durable: fail and recover. Device fields are battery powered, work in harsh conditions, are operated by non-professional personnel. They can be shut down at any moment (no voltage, excess heat, manually turned off), and we have no guarantees all the messages have been transmitted already; GPRS connections can be really slow, or we may be in a location that has no connectivity at all at the moment.

I was surprised to discover that this kind of in-process, durable data structures are ... scarce!
I was only able to locate a few:
  • BigQueue (JVM): based on memory mapped files. Tuned for size, not reliability, but claims to be persistent and reliable.
  • Rhino.Queues.Storage.Disk (.NET): Rhino Queues are an experiment from the creator of the (very good) RavenDB. There is a follow up post on persistent transactional queues, as a generic base for durable engines (DB base).
  • Apache Mnemonic (JVM): "an advanced hybrid memory storage oriented library ... a non-volatile/durable Java object model and durable computing"
  • Imms (.NET): "is a powerful, high-performance library of immutable and persistent collections for the .NET Framework."
  • Akka streams + Akka persistence (JVM): two Akka modules, reactive streams and persistence, make it possible to implement a durable queue with a minimal amount of code. A good article can be found here.
  • Redis (Any lang): the famous in-memory DB provides periodic snapshots. You need to forget snapshot and go for the append-only file alternative, which is the fully-durable persistence strategy for Redis.

The last one is a bit stretched.. it is not in-process, but Redis is so lightweight, so common and so easy to port that it may be possible to run it (with some tweaks) on an embedded device. Not optimal, not my first choice (among the other problems, there is a RAM issue: what if the queue exceeds the memory size?), but probably viable if there is no alternative.

Most likely, given the memory and resource constraints of the devices, it would be wise to cook up our own alternative using C/Go and memory mapped files. This is an area of IoT were I have seen little work, so it would be an interesting new project to work on!


# Wednesday, 31 August 2016

What is an "architect" anyway?

A little break from the series of posts on Pumpkin. Today I had to explain what I do for a living, and it was longer then I expected. But it gave me the time and opportunity to think about what I really do.

I always pause a second when people ask me "what is your job?"
I usually go for a very simple "software developer" or "programmer". After all if it is good enough for Scott Hanselman, I should be fine with it.

Since I have some experience, I sometimes may add "Senior" to it. But I don't feel that "Senior" anyway: I feel young, and I feel like I have always something new to learn, something new to do. "Senior" seems a little too accomplished to me.

Unfortunately, if you speak to people in the same or in a related field, this is rarely enough.
"But which is you role?" "Don't you manage a team of 7?" yes I do, but "manager" is way too nontechnical; "managing" the team, for me, is a mix of architectural and code reviews, coaching, mentoring, everything necessary to ensure the team delivers great things and customers are happy.

"So you are a tech lead!"
Well... I love the technical side of my job. I turned down good offers in the past because the "step-up" role was a pure management one, a common evil in Italy - management is the only "way up".
But my current role is made from pure technical parts (code, design, run sanity check -check the proper patterns are used and anti-patterns avoided, for example-) and also "soft" parts (be a bridge between customers and tech-speaking people, talk to upper management, customers and stakeholders, present figures and help making informed decision, advocate for my team).

This is why my "official" title of "Software architect" at my current job is kind of OK. It is technical, but not purely technical.

But what is an "architect", or a "team lead", anyway?

IMO, or at least in my case, it means being "Primus inter pares" and a "Servant leader".

"Primus inter pares" is a latin expression which roughly translates to "first among equals". It does not really matters if your leadership is sanctioned by the corporate ladder, or if it is an honorary title and you are formally equal to other members of their group. You act as a member of the group; you keep coding and share chores (debugging, bug fixing), otherwise you will lose what it really matters among developer: (unofficial) respect for your skills and knowledge.

Keep your hands dirty is key for me. I try to keep the balance between the technical and soft side of my job 50/50, and I code whenever I can. Because I like it (I think I will never give up coding, even if I win a billion euros and I can retire), and because I need it. It is like physical exercise, or training for a sport: both you and your body know when you need it, that you need it.

A good objective of leadership is to help those who are doing poorly to do well and to help those who are doing well to do even better.
– Jim Rohn, American entrepreneur.

Robert K. Greenleaf first coined the phrase "servant leadership" in his 1970 essay, "The Servant as a Leader."

As a servant leader, you are a "servant first".
In practice, I try to focus on the needs my team mates, before considering my own. I acknowledge other people's perspectives, give them the support they need to meet their work and personal goals, involve them in decisions where appropriate, and build a sense of community. I still call the shots (design-by-committee does not really work), but I listen before speaking, and use persuasion over authority.

A great side effect from acting in this way is that it gives you the necessary skills to deal with people "above" you: the ones you cannot use your authority upon, either because they are your boss, or because they are your peers (customers, for example). Your persuasion and reasoning skills are honed, and you are in a fantastic shape to be able to make your point, make your message pass, make them listen and consider what you say.

# Monday, 29 August 2016

Cloud, at last!

After some time experimenting, studying, designing (but mostly: presenting possible scenarios to management), we are preparing to move a central part of our systems to the cloud!
Cost savings (especially OPEX - especially linked to sourcing: finding and hiring a good DBA is very hard!), increased availability and resistance to HW failures/catastrophes are the key points I presented to management to help them decide.

On the downside, to be ready to move will require a good engineering effort; our systems are very old, but the general architecture built during the years is sound. It was good (surprising and pleasant) to discover how we already used  many of the patterns listed in the Azure Cloud Design Patterns Architecture Guidance in our systems.



The legacy components of the system have been extensively extended during the years, and the new parts and paths developed since I joined the company in 2012 always followed a classic pattern which you may recognize from several IoT designs:
  •   Field devices -> Queue (Inbox/Outbox)
  •   Queue -> Processing -> SQL
  •   Commands -> Queue (Inbox) <- Device
  
More precisely:
  • Field devices communicate to a "central" server, which just collects the data, buffers them on a durable (temporary) store. Little or no processing here (basics validation only)
  • On different machines, "consume" the items in the temporary store: pull things from there, persist each event in an "append-only" data store (Event Sourcing)
  • Process the events: generate domain objects through a series of steps (3), from the append-only store events to the final objects persisted in SQL tables (Pipes and Filters)
  • Generate "synthesized" data for reporting and statistics queries (Materialized View)
The back-end is already decomposed in several "medium" services: not really "micro" services, but several HTTP-based services talking through a REST API.
These services are already quite robust: they have to, since they are already exposed to the Internet. In particular, they implement Cache-aside for performance, Circuit Breaker/Retry with exp. backoff when they talk to external services (and, in most cases, even when they talk internally to each other), sharding for big data, throttling for some of the public-facing APIs.

Technically, the challenge is so interesting. The architecture is really apt to be ported to the cloud, but to make it really competitive (and to minimize running costs), some pieces will have to be rewritten.
To make the transition as smooth as possible, initially most of the pieces will be less than optimal (mostly IaaS - VMs, SQL storage where NoSQL/Cloud storage would suffice, Compute instances, ..) but will be slowly rewritten to be more efficient, more "cloudy" (App Fabric, Tables, Functions, ...).

Really excited to have begun this journey!

# Saturday, 06 August 2016

Old school code writing (sort of)

As I mentioned in my previous post, online resources on Hosting are pretty scarce. 

Also, writing an Host for the CLR requires some in-depth knowledge of topics you do not usually see in your day-to-day programming, like for example IO completion ports. Same for AppDomains: there is plenty of documentation and resources compared to Hosting, but still some more advanced feature, and the underlying mechanisms (how does it work? How does a thread interact and knows of AppDomains?) are not something you can find in a forum. 

Luckily, I have been coding for enough time to have a programming library at home. Also, I have always been the kind of guy that wants to know not only how to use stuff, but how they really work, so I had plenty of books on how the CLR (and Windows) work at a low level. All the books I am going to list were already in my library!

The first one, a mandatory read, THE book on hosting the CLR:



Then, a couple of books from Richter:

  

The first one is very famous. I have the third edition (in Italian! :) ) which used to be titled "Advanced Windows". It is THE reference for the Win32 API.
If you go anywhere near CreateProcess and CreateThread, you need to have and read this book.

The second one has a title which is a bit misleading. It is acutally a "part 2" for the first one, focused on highly threaded, concurrent applications. It is the best explanation I have ever read of APCs and IO Completion Ports.

  

A couple of very good books on the CLR to understand Type Loading and AppDomains.
A "softer" read before digging into...

  

...the Internals. You need to know what a TEB is and how it works when you are chasing Threads as they cross AppDomains.
And you need all the insider knowledge you may get, if you need to debug cross-thread, managed-unmanaged transitions. And bugs spanning over asynchronous calls. 

My edition of the first book is actually called "Inside Windows NT". It is the second edition of the same book, which described the internals of NT3.1 (which was, despite the name, the first Windows running on the NT kernel), and was originally authored by Helen Custer. Helen worked closely with Dave Cutler's original NT team. My edition covers NT4, but it is still valid today. Actually, it is kind of fun to see how things evolved over the years: you can really see the evolution, how things changed with the transition from 32 to 64 bits (which my edition already covers, NT4 used to run on 64 bit Alphas), and how they changed it for security reasons. But the foundations and concepts are there: evolution, not revolution.

  

And finally two books that really helped me while writing replacements for the ITaks API. The first one to tell me how it should work, the second one telling me how to look inside the SSLCI for the relevant parts (how and when the Hosting code is called).

Of course, I did not read all these books before setting to work! But I have read them over the years, and having them in my bookshelf provided a quick and valuable reference during the development of my host for Pumpkin.
This is one of the (few) times when I'm grateful to have learned to program "before google", in the late '90/early '00. Reading a book was the only way to learn. It was slow, but it really fixed the concepts in my mind. 

Or maybe I was just younger :)


# Thursday, 04 August 2016

IL rewriting + AppDomain sandboxing + Hosting

So, in the end, what went into Pumpkin?

Control was performed at compilation time or execution time? And if it is execution, using which technique?

In general, compilation has a big pro (you can notify immediately the snippet creator that he did something wrong, and even preventing the code block from becoming an executable snippet) and a big con (you control only the code that is written. What if the user code calls down some (legitimate) path in the BCL that results in a undesired behaviour?)

AppDomain sandboxing has some big pros (simple, designed with security in mind) and a big con (no "direct" way to control some resource usage, like thread time or CPU time).
Hosting has a big advantage (fine control of everything, also of "third" assemblies like the BCL) which is also the big disadvantage (you HAVE to do anything by yourself).

So each of them can handle the same issue with different efficacy. Consider the issue of controlling thread creation:
  • at compilation, you "catch" constructs that create a new thread (new Thread, Task.Factory.StartNew, ThreadPool.QueueUserWorkItem, ...)
    • you have to find all of them, and live with the code that creates a thread indirectly.
    • but you can do wonderful things, like intercepting calls to thread and sync primitives and substitute them - run them on your own scheduler!
  • at runtime, you:
    • (AppDomain) check periodically. Count new threads from the last check.
    • (hosting) you are notified of thread creation, so you monitor it.
    • (debugger) you are notified as well, and you can even suspend the user code immediately before/after.

Another example:
  • at compilation, you control which namespaces can be used (indirectly controlling the assembly)
  • at runtime you can control which assemblies are really loaded (you are either notified OR asked to load them - and you can prevent the loading)

What I ended up doing is to use a mix of techniques. 

In particular, I implemented some compiler checks.
Then, run the compiled IL on a separate AppDomain with a restricted PermissionSet (sandboxing).
Then, run all the managed code in an hosted CLR.

I am not crazy...
 

Guess who is using the same technique? (well, not compiler checks/rewriting, but AppDomain sandboxing + Hosting?)
A piece of software that has the same problem, i.e. running unknown, third party pieces of code from different parties in a reliable, efficient way: IIS.
There is very little information on the subject; it is not one of those things for which you have extensive documentation already available. Sure, MSDN has documented it (MSDN has documentation for everything, thankfully), but there is no tutorial, or Q&As on the subject on StackOverflow. But the pieces of information you find in blogs and articles suggests that this technology is used in two Microsoft products: SQL Server, for which the Hosting API was created, and IIS.

Also, this is a POC, so one of the goals is to let me explore different ways of doing the same thing, and assess robustness and speed of execution. Testing different technologies is part of the game :)

Building barriers: compilation time VS execution time

In order to obtain what we want, i.e. fine grained resource control for our "snippets", we can act at two levels:

  • compilation time
  • execution time

Furthermore, we can control execution in three ways:

  1. AppDomain sandboxing: "classical" way, tested, good for security
  2. Hosting the CLR: greater control on resource allocation
  3. Execute in a debugger: even greater control on the executed program. Can be slower, can be complex

Let's examine all the alternatives.

Control at compilation time

Here, as I mentioned, the perfect choice would be to use the new (and open-source) C# compiler.

It divides well compilation phases, has a nice API, and can be used to recognize "unsafe" or undesired code, like unsafe blocks, pointers, creation of unwanted classes or call to undesired methods.

Basically, the idea is to parse the program text into a SyntaxTree, extract the node matching some criteria (e.g. DeclarationModifiers.Unsafe, calls to File.Read, ...), and raise an error. Also, it a possibility is to write a CSharpSyntaxRewriter that encapsulates (for diagnostic) or completely replace some classes or methods.

Unfortunately, Roslyn is not an option: StackOverflow requirements prevents the usage of this new compiler. Why? Well, users may want to show a bug, or ask for a particular behaviour they are seeing in version 1 of C# (no generics), or version 2 (No extension methods, no anonymous delegates, etc.). So, for the sake of fidelity, it is required that the snippet can be compiled with an older version of the compiler (and no, the /langversion switch is not really the same thing).

An alternative is to act at a lower level: IL bytecode. 
It is possible to compile the program, and then inspect the bytecode and even modify it. You can detect all the kind of unsafe code you do not want to execute (unsafe, pointers, ...), detect the usage of Types you do not want to load (e.g. through a whitelist), insert "probes" into the code to help you catch runaway code.

I'm definitely NOT thinking about "solving" the halting problem with some fancy new static analysis technique... :) Don't worry!

I'm talking about intercepting calls to "problematic" methods and wrap them. So for example:

static void ThreadMethod() {
   while (1) {
      new Thread(ThreadMethod).Start();
   }
}
This is a sort of fork bomb

(funny aside: I really coded a fork bomb once, 15 years ago. It was on an old Digital Alpha machine running Digital UNIX we had at the university. The problem was that the machine was used as a terminal server powering all the dumb terminals in the class, so bringing it down meant the whole class halted... whoops!)

After passing it through the IL analyser/transpiler, the method is rewritted (compiled) to:


static void ThreadMethod() {
   while (1) {
      new Wrapped_Thread(ThreadMethod).Start();
   }
}

And in Wrapped_Thread.Start() you can add "probes", perform every check you need, and allow or disallow certain behaviours or patterns. For example, something like: 

if (Monitor[currentSnippet].ThreadCount > MAX_THREADS)
  throw new TooManyThreadException();

if (OtherConditionThatWeWantToEnforce)
  ...

innerThread.Start();


You intercept all the code that deals with threads and wrap it: thread creation, synchronization object creation (and wait), setting thread priority ... and replace them with wrappers that do checks before actually calling the original code.

You can even insert "probes" at predefined points: inside loops (when you parse a while, or a for, or (at IL level), before you jump), before functions calls (to have the ability to check execution status before recursion). These "probes" may be used to perform checks, to yield the thread quantum more often (Thread.Sleep(0)), and/or to check execution time, so you are sure snippets will not take the CPU all by themselves. 

An initial version of Pumpkin used this very approach. I used the great Cecil project from Mono/Xamarin. IL rewriting is not trivial, but at least Cecil makes it less cumbersome. This sub-project is also on GitHub as ManagedPumpkin.

And obviously, whatever solution we may chose, we do not let the user change thread priorities: we may even run all the snippets in a thread with *lower* priority, so the "snippet" manager/supervisor classes are always guaranteed to run.

Control at execution time

Let's start with the basics: AppDomain sandboxing is the bare minimum. We want to run the snippets in a separate AppDomain, with a custom PermissionSet. Possibly starting with an almost empty one. 

Why? Because AppDomains are a unit of isolation in the .NET CLI used to control the scope of execution and resource ownership. It is already there, with the explicit mission of isolating "questionable" assemblies into "partially trusted" AppDomains. You can select from a set of well-known permissions or customize them as appropriate. Sometimes you will hear this approach referred to as sandboxing.

There are plenty of examples on how to do that, it should be simple to implement (for example, the PTRunner project).

AppDomain sandboxing helps with the security aspect, but can do little about resource control. For that, we should look into some form of CLR hosting.

Hosting the CLR

"Hosting" the CLR means running it inside an executable, which is notified of several events and acts as a proxy between the managed code and the unmanaged runtime for some aspects of the execution. It can actually be done in two ways:

1. "Proper" hosting of the CLR, like ASP.NET and SQL Server do

Looking a what you can control through the hosting interface  you see that, for example, you can control and replace all the native implementations of "task-related" (thread) functions.
It MAY seem overkill. But it gives you complete control. For example, there was a time (a beta of CLR v2 IIRC) in which it was possible to run the CLR on fibers, instead of threads. This was dropped, but gives you an idea of the level of control that can be obtained.

2. Hosting through the CLR Profiling API (link1, link2)

You can monitor (and DO!) a lot of things with it: I used it in the past to do on-the-fly IL rewriting (you are notified when a method is JIT-ed and you can modify the IL stream before JIT) (my past project used it for a similar thing, monitor thread synchronization... I should have talked about it on this blog years ago!)

In particular, you can intercept all kind of events relative to memory usage, CPU usage, thread creation, assembly loading, ... (it is a profiler, after all!).
An hypothetical snippet manager running alongside the profiler (which you control, as it is part of your own executable) can then use a set of policies to say "enough!" and terminate the offending snippet's threads.

Debugging

Another project I did in the past involved using the managed debugging API to run code step-by-step.

This gives you plenty of control, even if you do not do step-by-step execution: you can make the debugger code "break into" the debugger at thread creation, exit, ... And you can issue a "break" any time, effectively gaining complete control on the debugged process (after all, you are a debugger: it is your raison d'etre to inspect running code). It can be done at regular intervals, preventing resource depletion by the snippet.

# Sunday, 31 July 2016

Choices, choices, choices...

How would you design and write a system that takes some C# code and runs it "in the browser"?

In general, my answer would be: Roslyn. Roslyn was already quite hot and mature at the end of 2014; having something like scriptcs would give you complete control on each line of code you are going to execute.

But this particular project, being something that must work for StackOverflow, had several constraints, most of which were in stern contrast one with the other:
  • High fidelity: if I am asking a question about a peculiar problem I am having with C# 1 on .NET 1.1, I want my "snippet" to behave as if it is compiled with C# 1 and run on .NET CLR 1.1
  • Safe: can you just compile and execute your snippet inside your IIS? Mmmm.. not a great idea...
  • High performance: can you spin up a VM (or a container), wait for it to be ready, "deploy" the snippet, execute it, get it back? That would be very safe, but a bit slow.

Safety/security is particularly important. For example: you do not want users to use WMI to shutdown the machine, or open a random port, install a torrent server, read configuration files from your machine, erase files...
For safety, we want to be able to handle dependencies in a sensible way. Also, some assemblies/classes/methods just do no make any sense in this scenario: Windows Forms? Workflow Foundations? Sql?
For safety and performace, we want to monitor and cap resource usage (no snippet that does not terminate).

Going a deep further, I stared to dash out some constraints. It turns out that we need to disallow something, even if this means going againt the goal of "high-fidelity":
  • no "unsafe", no pointers
  • no p/invoke or unmanaged code
  • nothing from the server that runs the snippet is accessible: no file read, no access to local registry (read OR write!)
  • no arbitrary external dependency (assemblies): whitelist assemblies

Also, we need control over some "resources". We cannot allow snippets to get a unlimited or uncontrolled amount of them.
  1. limit execution time
    • per process/per thread?
    • running time/execution time
  2. limit kernel objects
    • thread creation (avoid "fork-bombs")
    • limit other too? Events, mutexes, semaphores...
    • deny (or handle in a sensible way) access to named kernel objects (e.g. named semaphores.. you do not want some casual interaction with them!)
  3. limit process creation (zero?)
  4. limit memory usage
  5. limit file usage (no files)
  6. limit network usage (no network)
    • in the future: virtual network, virtual files?
  7. limit output (Console.WriteLine, Debug.out...)
    • and of course redirect it
Does it sounds familiar? For me, it was when I learned about something called cgroups. Too bad we don't have it in windows! Yes, there are Job Objects, but they do not cover every aspect.

Could we have cgroups-like control for .NET applications?