# Tuesday, 10 January 2006

On openMosix


The first clustering architecture I am going to speak about is openMosix. openMosix is a Linux-only solution, for reasons that will be clear, but the concepts are applicable to every OS with virtual memory architecture. I think that a port of these ideas to the Windows OSes can be very interesting, but enormously challenging (at least for developers that cannot access the sources) and maybe not so paying for: other architectures that require a shift in the concurrent/distributed programming paradigm can bring more benefits at last.

Anyway, openMosix is unique for its (almost) complete tranparency: processes can be migrated to other nodes, and distributed computing could happen, without any intervent on the user or programmer side. openMosix turns a cluster into a big multi-processor machine.

The openMosix architecture consists of two parts:
  • a Preemptive Process Migration (PPM) mechanism and
  • a set of algorithms for adaptive resource sharing. 
Both parts are implemented at the kernel level, thus they are completely transparent to the application level.
The PPM can migrate any process, at anytime, to any available node. Usually, migrations are based on information provided by one of the resource sharing algorithms.
Each process has an home node, the machine where it was created. Every process seems to run at its home node, and all the processes of a user's session share the execution environment of the home node. Processes that migrate to other nodes use the new nodes resources (memory, files, etc.) whenever possible, but interact with the user's environment through the home node.
Until recently, the granularity of the work distribution in openMosix was the process. Users where able to run parallel applications by starting multiple processes in one node, and then the system distributed these processes to the best available nodes at that time; then the load-balancing algorithm running on each node decided when to relocate resources due to changes on nodes load. Thus, openMosix has no central control or master/slave relationship between nodes.

This model makes openMosix not so different from MPI-Beowulf clusters. Fortunately, recent work brought openMosix granularity down to thread level, enabling "migration of shared memory", i.e. the migration of pages of the process address space to other nodes. This feature permits to migrate multi-threaded applications.

Processes and threads in Linux
(Figures from the MigShm technical report and presentation: The MAASK team (Maya, Asmita, Anuradha, Snehal, Krushna) designed and implemented the migration of shared memory on openMosix)


For process migration, openMosix creates a new memory descriptor on the remote node. This is fine for normal processes, but could cause problems for threads. Because a thread shares almost all of its memory pages with its parent (all but the thread stack and TLS) when threads of the same parent process are migrated, they need to share a common memory descriptor. If they have different descriptors, these threads could point to false segments.
When a thread is migrated, openMosix migrates only the user mode stack of that particular thread. The heap is migrated "on demand", paying attention to the case in which the same node is already executing threads of the same process to ensure consistency.




openMosix + MigShm control flow
Other features of the process are the ridefinition of shared-memory primitives (shalloc() etc.) and linux thread primitives, a transparent Eager Release consistency policy, and the addition of an algorithm for adaptive resource sharing based on the frequency of shared memory usage and the load across the cluster, so that threads are migrated in a way that decreases the remote accesses to the shared memory.

Processes, Threads and Memory space

This piece of software is a very interesting and good technical quest, however the question is: it is really worth the effort? Could it scale well? Making processes, and above all developers, thinking that they only have to add threads can be misleading. And multi-thread programming requires locking, explicit synchronization, and to scale well a thoughtful administration of running threads. Threads and semaphores are starting to become uncomfortable even for multi-thread programming on a single machine.
My personal opinion is that the future is going in the other direction. There will be no shared memory, and distributed, multithreaded or clustered computation will all have the same interface, with no shared memory. The problem is that memory is lagging behind.

Processes where created for having different units of execution on the same CPU. When they were introduced, we had multiple processes all runnig in the same address space (directly into the physical address space, at that time).
Then, fortunately there was the advent of Virtual Memory, and of private virtual address spaces. We had a balanced situation: every process thought to be the only one in the machine, and to have a whole address space for its own purposes. Communication with other processes was possible, mainly message based. At that time, IPC was substantially the same if processes where on the same machine or in different machines: the main methods where sockets and named pipes.
The introduction of threads put again the system out of balance: every process had many threads of execution, sharing the same address space.

According to my historic Operating System textbook, a process is a program in execution
"with it’s current values of program counter, registers and variables; conceptually every process has it’s own virtual CPU" - A.S.Tanenbaum.
This is very close to the way modern OSes treat processes, running them in a context of execution virtually independent from the others.
Threads instead
"allow a lot of executions in the environment of a process, in wide measure independent the one from the others" - A.S. Tanenbaum.
However this definition for threads is not so close to reality: threads are not so independent among them, because they always share a primary resource (the common addressing space of the parent process).
openMosix "solves" this problem (making threads "independent" again) migrating trasparentely the required memory pages. 
But it is possible to restore the balance again? What about changing the affinity of memory from process to thread? Notice that here I am not talking about reintroducing the concept of virtual memory space for threads; modern OS uses the processor architecture to enforce and enable virtual memory for processes, at the overhead we all know; furthermore, you can't "box" addresses space one inside the other. What I am thinking about is a "light" thread that encapsulate both its code, its state (the stack) AND its data. If another thread want those data, it must ask them, and the thread that owns the data must be willing to share them. Like in the IPC case back in the '80, but without the burden of context switch unless necessary (i.e. when the thread and its data resides in another process or on another machine).

Application Domains

To complicate this design, .NET brought us Application Domains. Application Domains are designed to be "light-weight" precesses, as Chris Brumme explains. But they are "unrelated" to threads.

Wires?

In my opinion, we need light threads, let call them wires, that live in the managed space (so they not clobber the scheduler), have their memory and their message based  primitives for communication. Use should be simpler than threads; a good starting point may be Join-calculus, or C-omega, or any other language that support asynchronous or active functions. Those fuctions should map directly to wires, and the runtime will map them to native "tasks" (processes or threads or fibers) so that users can finally stop to worry about hacks to mitigate thread performance limitations (number of threads, thread pools, completion ports) and explicit synchronization (sempahores, mutexes, race conditions).
Wires could also adapt very well to a distributed environment: since they carry with them their data, they can be "detached" from a computational node and "re-attached" to a different destination node.