# Wednesday, 05 February 2014

Lambdas in Java 8

Today I will introduce a feature of the upcoming Java 8, a programming language feature I really like: lambdas. Support for lambdas and higher order (HO) functions is what made me switch to Scala when I went back to JVM-land a couple of years ago: after tasting lambdas in C#, I wasn't able to go back to a programming language without them.
Now Java 8 promises to bring them to Java: something I was waiting for a long time!

Lambdas, higher order functions... what?


First thing: do not get scared by terminology. The whole concept is borrowed from functional programming languages, where functions are king (just like objects are king in Object Oriented Programming).
Functional programmers love to define every theoretical aspect in detail, hence the fancy words.
But here I want to keep it simple; the whole concept around lambdas and HO functions is that you can pass functions as arguments to other functions.

Functional Java


Passing functions around is incredibly useful in many scenarios, but I want to focus on the very best one: handling collections.

Suppose we have a collection of Files, and we want to perform a very common operation: go through these files, do something with them. Perhaps, we want to print all the directories:

static void printAllDirectoriesToStdout(List<File> files) {
  for (File f: files) {
      if (f.isDirectory()) {
          System.out.println(f.getName());
      }
}


Or print all the “big” files:

static void printBigFilesToStdout(List<File> files) {
  for (File f: files) {
      if (f.getTotalSpace() > threshold) {
          System.out.println(f.getName());
      }
}


Have you spotted the problem? Yes, there is some code duplication.
In Java, we already have a tool to go around it: object orientation.

interface IFilter<T> {
   boolean isOk(T file);
}

public static void printFilesToStdout(List<File> files, IFilter<File> filter) {
  for (File f: files) {

     if (filter.isOk(f) {
        System.out.println(f.getName());
  }
}


Now we can implement our original functions using specific “Filter” classes, or even anonymous classes:

printFilesToStdout(files, new IFilter<File>() {
  public boolean isOk(File file) { return  f.isDirectory(); }
});


This is already quite close to passing a function around; instead, you pass a functional interface, an interface that contains only one abstract method. Anonymous classes  derived from functional interfaces are just single inline functions... lambdas!

printFilesToStdout(files, (File f) -> f.isDirectory());

Aggregate Operations


So far.. cool! Shorter, more general, readable.
But what really matters is the ecosystem built around this simple concept. It is possible to write general functions accepting functions, and the new JDK already provides the most useful ones, Aggregate Operations.

As an example, take our “IFilter” functional interface. With it, you can build a “filter” function:

Collection<T> filter(Collection<T> c, IFilter<T> filter) { … }

which is one of these Aggregate Operations. The Stream class defines it and many others as member functions, making them even easier to compose through chaining.

I just want to give you an hint on how they are used for complex collections  processing.
Do you want to get all the big tar files, open them, print every word in each text file you find inside?

files.filter(f -> f.getTotalSize() > 100 && isTar(f)).
      flatMap(f -> openTarStream(f).getEntries()).
      filter(entry -> isText(entry.getFile()).
      flatMap(entry -> ReadAllLines(entry.getFile())).
      flatMap(line -> Stream.of(line.split(" "))).
      forEach(word -> System.out.println(word))

Compact, clean.

Now try to do it in Java 7... look at the indentation! It is easy to get lost in the mechanics of collection handling.

We only scratched the surface of what can be done with lambdas; together with aggregate operations and generics, they are a very powerful tool that will make most of the data transformation operations easy. And they are very efficient too! But this is something for another post.

# Thursday, 02 January 2014

The MiniPoint Workflow language

The workflow language was, for me, the funniest part of MiniPoint. I love working on languages, small or big ones it's not important, as long as they are interesting. In fact, MiniPoint has more than one language/parser: workflows, document templates, AngularJS expressions, ... Most are tiny, but every one makes the code cleaner and the user experience more pleasant.

Take, as an example, the simple boolean language to express visibility of a field on a view; as I mentioned in my previous post, you can make individual fields visible or not (and required or not) using boolean expressions  (or, and, not, <, >, ==, <>) over constants and other fields in the view (or in the schema).
The expression is parsed and then analyzed  to produce two things: an AngularJS expression, which will be inserted into the ng-required and ng-show/ng-hide attributes to make it work entirely on the client side, and the list of affected fields.
Which is the purpose of this list? Remember that a view is only a subset of the schema, but in these visible/required expression you can refer to other member of the schema as well (from previous views, usually).
AngularJS initializes its "viewmodel" (the $scope) with an ajax request (getting JSON data from a ASP.NET controller); For efficiency, we keep this data at a minimum, which usually is a subset of the fields in the view (readonly fields, for example, are rendered on the server and not transmitted). When we have an expressions, however, fields referenced in the expression need to end up in the $scope too, hence the reason of parsing and analyzing the expressions.

But I am digressing; I will write more about the interaction and integration of AngularJS and Razor (and MVC) in another post.

Now I would like to talk about some aspects of the workflow language that needed a bit of thinking on how to best implement them.

I wanted it to be simple, natural to use (i.e. you can use statements/expressions/constructs wherever it makes sense and expect them to work) but still powerful enough. And have a clean grammar too :)

I wrote some langauges in the past, but this is the first one where statement terminators (';') are optional, and you can just use returns.
The people that are going to write the schemas and workflows (so not the end-users, but the "power-users", or site administrators) have a strong background in ... VBA. Therefore, when a decision about the language came up, I tried to use a VBA-like syntax, to give them a familiar look. So, for example, If-EndIf instead of braces { }.
And I wanted to do it because it was interesting, of course! I had to structure my semantic actions a bit differently, as I was getting reduce-reduce conflicts using my usual approach.

On the surface, you it seems that you have statements (very similar to other programming languages), choices (if-then-else-endif) and gotos. I know.. Ugh! gotos! Bear with me :)

step ("view1")

var i = 10

if (i + me.SomeField > 20)
  i = i - 20
  goto view1
else
  goto end
endif

//Generate a report, using the "report1" template
report ("report1")

step ("final"): end 

U
nder the hood, things are a bit.. different. Remember, this is a textual language for a flowchart. So, "step" is actually an input/output block (parallelogram); statements and reports are generic processing steps (rectangles); the "if-then-else" is a choice (rhombus). Therefore if-then-else has a stricter than usual syntax, and it's actually:
IF (condition) [statements] GOTO ELSE [statements] GOTO ENDIF
so that the two possible outcomes are always steps in the workflow.

Therfore, under the hood you have a list of "steps", which may be a statement list (like "var i = 10"), an input/output step ("step" or "delay"), or a branch.
As a consequence, the language has somehow two-levels; at the first level you have "steps"; then, among "steps" or inside them (look at the if-then-else in the example) you can have expressions and statements like in most other programming languages. The two levels appear quite clearly in the grammar, but I think it's difficult to tell from the syntax. And this is what I wanted to accomplish. Who used it to author the workflows was quite pleased, and used it with no problems after very little training.

Tranlsation to WF activities was fun as well: I built a custom Composite Activity to schedule all the steps; also statements (instead of receiving their own activity) where merged together and executed bu the main composite activity, to improve efficiency (and make it easier to add other statements: a new one does not require a new activity).

# Monday, 10 June 2013

A "Mini" workflow engine. Part 4: "out-of-order" composite activities, internal bookmarks, and execution.

Last time, we have seen how a sequence of activities works with internal bookmarks to schedule the execution of the next step, introducing a queue to hold these bookmarks.
We left with a question: what if execution is not sequential?
In fact, one of the advantages of an extensible workflow framework is the ability to define custom statements; we are not limited to Sequence, IfThenElse, While, and so on but we can invent our own, like Parallel:

public class Parallel : CompositeActivity
{
    protected internal override ActivityExecutionStatus Execute(WorkflowInstanceContext context) {
        foreach (var activity in children)        
           context.RunProgramStatement(activity, this.ContinueAt);       
        return ActivityExecutionStatus.Executing;
    }

    private void ContinueAt(WorkflowInstanceContext context, object arg) {
        foreach (var activity in children)        
            //Check every activity is closes...
            if (activity.ExecutionStatus != ActivityExecutionStatus.Closed)
                return;
        
         CloseActivity(context);
     }
}

And even have different behaviours here (like speculative execution: try several alternative, choose the one that complete first).

Clearly, a queue does not work with such a statement. We can again use delegates to solve this issue, using a pattern that is used in several other scenarios (e.g. forms management).
When a composite activity schedules a child activity for execution, it subscribes to the activity, to get notified of its completion (when the activity will finish and become "closed"). Upon completion, it can then resume execution (the "internal bookmark" mechanism).

child.Close += this.ContinueAt;
Basically, we are moving the bookmark and (above all) its continuation to the activity itself. It will be the role of CloseActivity to find the delegate and fire it. This is exactly how WF works...

context.CloseActivity() {
  ...
  if (currentExecutingActivity.Close != null)
     currentExecutingActivity.Close(this, null);

}
... and it is the reason why WF needs to keep explicit track of the current executing activity.

Personally, I find the "currentExecutingActivity" part a little unnerving. I would rather say explicitly which activity is closing:

context.CloseActivity(this);
Moreover, I would enforce a call to CloseActivity with the correct argument (this)

public class Activity {
   private CloseActivity() {
      context.CloseActivity(this); //internal
Much better!

Now we almost everything sorted out; the only piece missing is about recursion. Remember, when we call the continuation directly, we end up with a recursive behaviour. This may (or may not) be a problem; but what can we do to make the two pieces really independent?

We can introduce a simple execution queue:

public class ExecutionQueue
{
    private readonly BlockingCollection<Action> workItemQueue = new BlockingCollection<Action>();

    public void Post(Action workItem) {
           ...
    }
}
Now, each time we have a continuation (from an internal or external bookmark), we post it to the execution queue. This makes the code for RunActivity much simpler:

internal void RunProgramStatement(Activity activity, Action<WorkflowContext, object> continueAt)
{
    logger.Debug("Context::RunProgramStatement");

    // Add the "bookmark"
    activity.Closed = continueAt;

    // Execute the a activity
    Debug.Assert(activity.ExecutionStatus == ActivityExecutionStatus.Initialized);
    ExecuteActivity(activity);
} 
As you may have noticed, we are giving away a little optimization; a direct call when the activity just completes should be much faster.
Also, why would we need a queue? We could do just fine using Tasks:

Task.Factory.
    StartNew(() => activity.Execute(this)).
    ContinueWith((result) =>
    {
        // The activity already completed?
        if (result.Result == ActivityExecutionStatus.Closed)
        continueAt(this, null);
        else
        {
        // if it didn't complete, an external bookmark was created
        // and will be resumed. When this will happen, be ready!
        activity.OnClose = continueAt;
        }
    });


internal void ResumeBookmark(string bookmarkName, object payload) {            
    ...
    Task.Factory.
        StartNew(() => bookmark.ContinueAt(this, payload));
}

internal void CloseActivity(Activity activity) {
    if (activity.OnClose != null)
    Task.Factory.
        StartNew(() => activity.OnClose(this, null));
}
Which makes things nicer (and potentially parallel too). Potentially, because in a sequential composite activity like Sequence, a new Task will be fired only upon completion of another. Things for the Parallel activity, however, will be different. We will need to be careful with shared state and input and output arguments though, as we will see in a future post.

Are there other alternatives to a (single threaded) execution queue or to Tasks?

Well, every time I see this kind of pause-and-resume-from-here kind of execution, I see an IEnumerable. For example, I was really fond of the Microsoft CCR and how it used IEnumerable to make concurrent, data-flow style asynchronous execution easier to read:

    int totalSum = 0;
    var portInt = new Port<int>();

    // using CCR iterators we can write traditional loops
    // and still yield to asynchronous I/O !
    for (int i = 0; i < 10; i++)
    {
        // post current iteration value to a port
        portInt.Post(i);

        // yield until the receive is satisifed. No thread is blocked
        yield return portInt.Receive();               
        // retrieve value using simple assignment
        totalSum += portInt;
    }
    Console.WriteLine("Total:" + totalSum);

(Example code from MSDN)
I used the CCR and I have to say that it is initially awkward, but then it is a style that grows on you, it is a very efficient, and it was way of representing asynchronous with a "sequential" look, well before async/await.

The code for ReadLine would become much clearer:

public class ReadLine : Activity
{
    public OutArgument<string> Text = new OutArgument<string>();       

    protected override IEnumerator<Bookmark> Execute(WorkflowInstanceContext context, object value)
    {
        // We need to wait? We just yield control to our "executor"
        yield return new Bookmark(this);

        this.Text.Value = (string)value;

        // Finish!
        yield break;
        // This is like returning ActivityExecutionStatus.Close,
        // or calling context.CloseActivity();            
    }
}
Code for composite activities like sequence could be even cleaner:

protected internal override IEnumerator<Bookmark> Execute(WorkflowInstanceContext context)
{
    // Empty statement block
    if (children.Count == 0)
    {
        yield break;
    }
    else
    {
        foreach (var child in children)
        {
            foreach (var bookmark in context.RunProgramStatement(child))
                yield return bookmark;
        }                
    }
}

Or, with some LINQ goodness:

return children.SelectMany(child => context.RunProgramStatement(child));
The IEnumerable/yield pair have a very simple, clever implementation; it is basically a compiler-generated state machine. The compiler transforms your class to include scaffolding and memorize in which state the code is, and jump execution to the right code when the method is invoked.

But remember, our purpose is exactly to save and persist this information: will this state machine be serialized as well?
According to this StackOverflow question, the answer seem to be positive, with the appropriate precautions; also, it seems that the idea is nothing new (actually, it is 5 years old!)
http://dotnet.agilekiwi.com/blog/2007/05/implementing-workflow-with-persistent.html

...and it also seems that the general thought is that you cannot do reliably.

In fact, I would say you can: even if you do not consider that 5 years passed and the implementation of yield is still the same, the CLR will not break existing code. At IL level, the CLR just see some classes, one (or more) variables that holds the state, and a method with conditions based on those variables. It is not worse (or better) that serializing delegates, something we are already doing.

The iterator-based workflow code sits in a branch of my local git, and I will give it a more serious try in the future; for now, let's play on the safe side and stick to the Task-based execution.

Until now, we mimicked quite closely the design choices of WF; I personally like the approach based on continuations, as I really like the code-as-data philosophy. However the serialization of a delegate can hardly be seen as "code-as-data": the code is there, in the assembly; it's only status information (the closure) that is serialized, and the problem is that you have little or no control over these compiler generated classes.

Therefore, even without IEnumerable, serialization of our MiniWorkflow looks problematic: at the moment, it is tied to BinarySerializer. I want to break free from this dependency, so I will move towards a design without delegates.

Next time: bookmarks without delegates

# Friday, 07 June 2013

A "Mini" workflow engine. Part 3: execution of sequences and internal bookmarks

A couple of posts ago, I introduced a basic ("Mini") workflow engine.
The set of classes used an execution model that uses bookmarks and continuations, and I mentioned that it may have problems.

Let's recap some basic points:
  • Activities are like "statements", basic building blocks.
  • Each activity has an "Execute" method, which can either complete immediately (returning a "Closed" status),
// CloseActivity can be made optional, we will see how and why in a later post
context.CloseActivity();
return ActivityExecutionStatus.Closed;

or create a bookmark. A bookmark has a name and a continuation, and it tells the runtime "I am not done, I need to wait for something. When that something is ready, call me back here (the continuation)". In this case, the Activity returns an "Executing" status.
  • There are two types of bookmarks: external and internal.
    External bookmarks are the "normal" ones:
context.CreateBookmark(this.Name, this.ContinueAt);
return ActivityExecutionStatus.Executing;
They are created when the activity needs to wait for an external stimulus (like user input from the command line, the completion of a service request, ...).
Internal bookmarks are used to chain activities: when the something "I need to wait for" is another activity, the activity request the runtime to execute it and call back when done:
context.RunProgramStatement(statements[0], ContinueAt);
return ActivityExecutionStatus.Executing;

In our first implementation, external bookmarks are really created, held into memory and serialized if needed:

public void CreateBookmark(string name, Action<WorkflowInstanceContext, object> continuation)
{
    bookmarks.Add(name, new Bookmark
        {
            ContinueAt = continuation,
            Name = name,
            ActivityExecutionContext = this
        });
}
From there, the runtime is able to resume a bookmark when an event that "unblocks" the continuation finally happens
(notice that WF uses queues, instad of a single-slot bookmarks, which is quite handy in many cases - but for now, this will do just right):

internal void ResumeBookmark(string bookmarkName, object payload)
{            
    var bookmark = bookmarks[bookmarkName];
    bookmarks.Remove(bookmarkName);
    bookmark.ContinueAt(this, payload);
}
Internal bookmarks instead are created only if needed:

internal void RunProgramStatement(Activity activity, Action<WorkflowInstanceContext, object> continueAt)
{
    var result = activity.Execute(this);
    // The activity already completed?
    if (result == ActivityExecutionStatus.Closed)
        continueAt(this, null);
    else
    {
        // Save for later...
        InternalBookmark = new Bookmark
        {
            ContinueAt = continueAt,
            Name = "",
            ActivityExecutionContext = this
        };
    }
}
But who resumes these bookmarks?
CreateBookmark - ResumeBookmark are nicely paired, but RunProgramStatement seems not to have a correspondence.

When do we need to resume an internal bookmark? When the activity passed to RunProgramStatement completes. This seems like a perfect job for CloseActivity!
public void CloseActivity()
{
    logger.Debug("Context::CloseActivity");
    // Someone just completed an activity.
    // Do we need to resume something?
    if (InternalBookmark != null)
    {
        logger.Debug("Context: resuming internal bookmark");
        var continuation = InternalBookmark.ContinueAt;
        var context = InternalBookmark.ActivityExecutionContext;
        var value = InternalBookmark.Payload;
        InternalBookmark = null;
        continuation(context, value);
    }
}

However, this approach has two problems.
The first one is not really a problem, but we need to be aware of this; if the activities chained by internal bookmarks terminate immediately, the execution of our "sequence" of activities result in recursion:

public class Print : Activity
{
    protected override ActivityExecutionStatus Execute(WorkflowInstanceContext context)
    {
        Console.WriteLine("Hello");
            context.CloseActivity();
        return ActivityExecutionStatus.Closed;
    }
}

var sequence = new Sequence();
var p1 = new Print();
var p2 = new Print();
var p3 = new Print();
sequence.Statements.Add(p1);
sequence.Statements.Add(p2);
sequence.Statements.Add(p3);
Its execution will go like:

sequence.Execute
-context.RunProgramStatement
--p1.Execute
--sequence.ContinueAt
---context.RunProgramStatement
----p2.Execute
----sequence.ContinueAt
-----context.RunProgramStatement
------p3.Execute
------sequence.ContinueAt
-------context.CloseActivity (do nothing)
The second problem is more serious; lets' consider this scenario

var s1 = new Sequence();
var p1 = new Print();
var s2 = new Sequence();
var r2 = new Read();
var p3 = new Print();
s1.Statements.Add(p1);
s1.Statements.Add(s2);
s2.Statements.Add(r2);
s2.Statements.Add(p3);

And a possible execution which stops at r2, waiting for input:

s1.Execute
-context.RunProgramStatement
--p1.Execute
--s1.ContinueAt
---context.RunProgramStatement(s2, s1.ContinueAt)
----s2.Execute
-----context.RunProgramStatement(r2, s2.ContinueAt)
------r2.Execute
------(External Bookmark @r2.ContinueAt)
------(return Executing)
-----(Internal Bookmark @s2.ContinueAt)
-----(return)
----(return Executing)
---(Internal Bookmark @s1.ContinueAt) <-- Oops!

What happens when the external bookmark r2.ContinueAt is executed?
r2 will complete and call context.CloseActivity. This will, in turn, resume the internal bookmark.. which is not set to s2.ContinueAt, but to s1.ContinueAt.

We can avoid this situation by observing how RunProgramStatement/CloseActivity really look the like call/return instructions every programmer is familiar with (or, at least, every C/asm programmer is familiar with). Almost every general purpose language uses a stack to handle function calls (even if this is an implementation detail); in particular, the stack is  used to record where the execution will go next, by remembering the "return address", the point in the code where to resume execution when the function returns. Sounds familiar?

call fRunProgramStatement(a)
- pushes current IP on the stack-call a.Execute
- jump to entry of function -saves continueAt in and internal bookmark
  
... at the end of f ...... at completion of a ...
  
return CloseActivity
-pops return address from stack-takes continueAt from internal bookmark
-jump to return address-executes continueAt

Notice that we swapped the jump to execution and saving the return point; I did it to perform an interesting "optimization" (omitting the bookmark if the a.Execute terminates immediately; think of it as the equivalent to inlining the function call - and not really an optimization, as it comes with its problem as we have just seen).

Is it a safe choice? Provided that we are swapping the stack used by function calls with a queue, yes.
In standard, stack based function calls what you have to do is to pinpoint where to resume execution when the function being called returns. The function being called (the callee) can in turn call other functions, so point-to-return to (return addresses) need to be saved in a LIFO (Last In, First Out) fashion. A stack is perfect.

Here, we provide a point where to resume execution directly, and this point has no relation to the return point of execute (i.e. where we are now).
If the call to Execute does not close the activity, that means that execution is suspended. A suspended execution can, in turn, lead to the suspension of other parent activities, like in our nested sequences example. Now, when r2 is resumed, it terminatates and calls its "return" (CloseActivity). The points-to-return to, in this case, need to be processed in a FIFO (First In, First Out) fashion.

Next time: a sequence is not the only kind of possible composite activity. Does a queue satisfy additional scenarios?

# Sunday, 02 June 2013

A "Mini" workflow engine. Part 2: on the serialization of Bookmarks

Last time, we have seen that Activities are composed in a way that allows them to be paused and resumed, using continuations in the form of bookmarks:
[Serializable]
internal class Bookmark
{
    public string Name { get; set; }
    public Action<ActivityExecutionContext, object> ContinueAt { get; set; }  
    public ActivityExecutionContext ActivityExecutionContext { get; set; }
}
A Bookmark has a relatively simple structure: a Name, a continuation (in the form of an Action) and a Context. Bookmarks are created by the Context, and the continuation is always a ContinueAt method implemented in an activity. Thus, serializing a Bookmark (and, transitively, its Context) should lead to the serialization of an object graph that holds the entire workflow code, its state and its "Program Counter".

However, serializing a delegate sounds kind of scary to me: are we sure that we can safely serialize something that is, essentially, a function pointer?
That's why I started looking for an answer to a particular question: what happens when I serialize a delegate, a Func<T> or an Action<T>?

I found some good questions and answers on StackOverflow; in particular, a question about serialization of lambdas and closures (related to prevalence, a technique to provide transactional implementing checkpoints and redo journals in terms of serialization) and a couple of questions about delegate (anonymous and not) serialization (1, 2 and 3).

It looks like:
  • delegate objects (and Func<>) are not XML serializable, but they are BinarySerializable.
  • even anonymous delegates, and delegates with closures (i.e. anonymous classes) are serializable, provided that you do some work to serialize the compiler-generated classes (see for example this MSDN article, and this blog post)
  • still, there is a general agreement that the whole idea of serializing a delegate looks very risky

BinarySerializable looks like the only directly applicable approach; in our case, we are on the safe side: every function  we "point to" from delegates and bookmarks is in classes that are serialized as well, and objects and data that accessed by the delegate are serialized as well (they access only data that is defined in activities themselves, which is saved completely in the context). BinarySerilization will save everything we need to pause and resume a workflow in a different process.

I implemented a very simple persistence of workflow instances using BinarySerializer for MiniWorkflow, and it does indeed work as expected.
You can grab the code on github and see it yourself.

However, I would like to provide an alternative serialization mechanism for MiniWorkflow, less "risky". I do not like the idea of not knowing how my Bookmarks are serialized, I would like something more "visible", to understand if what I need really goes from memory to disk, and to be more in control.
Also, I am really curious to see how WF really serializes bookmarks; workflows are serialized/deserialized as XAML, but a workflow instance is something else. In fact, this is a crucial point: in the end, I will probably end up using WF4 (MiniWorkflow is -probably- just an experiment to understand the mechanics better), but I will need to implement serialization anyway. I do not want to take a direct dependency on SQL, so I will need to write a custom instance store, alternative to the standard SQL instance store.

I will explore which alternatives we have to a BinarySerializer; but before digging into this problem, I want to work out a better way to handle the flow of execution. Therefore, next time calling continuations: recursion, work item queue or... call/cc?

# Friday, 24 March 2006

Rotor 2.0!

Finally! From JasonZ through Brad Adams. I was waiting for it... I'm downloading it right now, and this week (if work premits me to do it) I'm going to digg into some new cool features, like anonimous methods and delegates (C# compiler) and Lightweight Code Generation (LCG - BCL).


# Saturday, 04 February 2006

On anonymous delegates

Recently, doing some searches on anonymous methods/delagates, I came into an interesting puzzle posted by Brad Adams
The puzzle (and its explaination) are very interesting to understand what happens under the hood when an anonymous delegate is created. However, it took me some effort to understand what the compiler did using only the post and its comments, so I put Brad's code into Visual Studio, compiled an assembly, and inspected it with Reflector. The first case, the one with the delegator binding the loop variable i (num1, in the reverse engineered code) is translated in the following way:



private static void Main(string[] args)
{
      for (int num1 = 0; num1 < 10; num1++)
      {
            Program.<>c__DisplayClass1 class1 = new Program.<>c__DisplayClass1();
            class1.j = num1;
            ThreadPool.QueueUserWorkItem(new WaitCallback(class1.<Main>b__0), null);
      }
      Console.ReadLine();
}

the anonymous delegate shares i between caller and callee. It is a closure, in a wide sense, in that it captures the state when it is created. However, since the C# compiler capture i "by reference" and in this case the closure is created before the for loop, the behaviour is different from the one expected in real closures (in fact, anonymous delagates are not closures)

The second example however (delegate bound to a variable local to the for loop) is different: the lifetime of j (the for block) forces the compiler to create and initalize a state object for the delegate at each iteration inside the loop:


private static void Main(string[] args)
{
WaitCallback callback1 = null;
Program.<>c__DisplayClass2 class1 = new Program.<>c__DisplayClass2();
class1.i = 0;
while (class1.i < 10)
{
if (callback1 == null)
{
callback1 = new WaitCallback(class1.<Main>b__0);
}
ThreadPool.QueueUserWorkItem(callback1, null);
Thread.Sleep(100);
class1.i++;
}
Console.ReadLine();
}


public void <Main>b__0(object)
{
Console.WriteLine(this.i);
}

internal class Program
{
// Methods
public Program();
private static void Main(string[] args);

// Nested Types
[CompilerGenerated]
private sealed class <>c__DisplayClass2
{
// Methods
public <>c__DisplayClass2();
public void <Main>b__0(object)
{
Console.WriteLine(this.i);
}

// Fields
public int i;
}
}

In a comment of Brad post, a member of the C# team pointed out that

"Anonymous methods do not capture read-only locals at a point in time, instead they actually capture the local by putting it on the heap, to share it between the outer method and all anonymous methods."

Surely this statement makes clear what is happening: however, this example prove one points I always believed: the only way to be sure of what a compiler do is to look at the assembler code it generates.

# Friday, 03 February 2006

More on yield

Yesterday I discovered and introduced the yield keyword. Today, I wanted to see some examples and to discover how it is implemented. Let's start with a classic example, the Fibonacci function, both in Python and in C#.
def fib():
   a, b = 0, 1
   while 1:
       yield b
       a, b = b, a + b
 
static IEnumerable<int> Fib(int max)
{
int a = 0;
int b = 1;
yield return 1;

for (int i = 0; i < max - 1; i++)
{
int c = a + b;
yield return c;

a = b;
b = c;
}
}

The two method are essentially equivalent. How are they implemented? The first thought that came into my mind when I saw the Fibonacci example was: the compiler traslate it to static variables. But  clearly this is not the case: if two clients call the same function, they will share its internal state, which is not desiderable.
The second guess was: well, as in anonymous delegates the current state is captured by an anonymous class generated by the compiler. The compiler generates it inseting one field for every local variable in the function, and then instantiate an object of that class for every call site. This explains the Fibonacci function, but what about this one?

static IEnumerable<string> gen2()
{

yield return "A";

string s = "A" + "B";
yield return s;

s = 1.ToString();
yield return s;

for (int i = 0; i < 5; ++i)
yield return i.ToString();
}
Clearly, execution of the function body must stop after a yield statement, like after a return statement, but it must also be resumed in the same place, not at the begininng of the function. So, we have to save also the point at which execution stopped, and resume it (with a jump right at the beginning of the function) right at the same point. Using the excellent Reflector tool by Lutz Roeder you can see in pseudo-code (reverse engineered C#) how this class is generated and how the function is wrapped inside MoveNext, with a big switch right at the beginning of the function that allows resuming execution at different points based on a state variable:
[CompilerGenerated]
private sealed class <gen>d__7 : IEnumerable<string>, IEnumerable, 
   
IEnumerator<string>,IEnumerator, IDisposable { // Methods [DebuggerHidden] public <gen>d__7(int <>1__state);
      private bool MoveNext()
      {
         switch (this.<>1__state)
         {
            case 0:
            {
                  this.<>1__state = -1;
                  this.<>2__current = "A";
                  this.<>1__state = 1;
                  return true;
            }
            case 1:
            {
                  this.<>1__state = -1;
                  this.<s>5__8 = "AB";
                  this.<>2__current = this.<s>5__8;
                  this.<>1__state = 2;
                  return true;
            }
            case 2:
            {
                  this.<>1__state = -1;
                  int num2 = 1;
                  this.<s>5__8 = num2.ToString();
                  this.<>2__current = this.<s>5__8;
                  this.<>1__state = 3;
                  return true;
            }
            case 3:
            {
                  this.<>1__state = -1;
                  this.<i>5__9 = 0;
                  while (this.<i>5__9 < 5)
                  {
                        this.<>2__current = this.<i>5__9.ToString();
                        this.<>1__state = 4;
                        return true;
                  Label_00DE:
                        this.<>1__state = -1;
                        this.<i>5__9++;
                  }
                  break;
            }
            case 4:
            {
                  goto Label_00DE;
            }
         }
         return false;
      }
[DebuggerHidden] IEnumerator<string> IEnumerable<string>.GetEnumerator(); [DebuggerHidden] IEnumerator IEnumerable.GetEnumerator(); [DebuggerHidden] void IEnumerator.Reset(); void IDisposable.Dispose(); // Properties string IEnumerator<string>.Current { [DebuggerHidden] get; } object IEnumerator.Current { [DebuggerHidden] get; } // Fields private int <>1__state; private string <>2__current; public int <i>5__9; public string <s>5__8; }
All this reasoning happened yesterday with a guy that work on dynamic languages (I don't want to say who right now... I want only to say "cross your fingers for me!" I may have great news for you in the future). This guy correctly pointed out an issue with this approach: what if the yield statement is inside a try block? In CIL, you can't jump inside a try block: you can only jump at the first instruction of the block. I was puzzled at first: what can we do? The simplest solution was to not permit the mixing of try/catch and yield. But surely this is limitating (Python allow yield to be used almost everywhere, the only exception being yield not allowed in try-finally). The guy then gave me an hint: obviously, once you are inside the block, you can jump whenever you like inside that very block. So, the execution can be resumed using a sort of multiple-dispatch. It was all very clear at once: you have to make a first jump inside the right try block, then at the correct location inside the block. In a function like this one
def f():
   try:
       yield 1
       try:
           yield 2
           1/0
           yield 3  # never get here
       except ZeroDivisionError:
           yield 4
           yield 5
           raise
       except:
           yield 6
       yield 7     # the "raise" above stops this
   except:
       yield 8
   yield 9
   try:
       x = 12
   finally:
       yield 10
   yield 11

The compiler will translate this code in something like:

 L_0001: ldfld int32 YieldTest.Program/<gen>d__7::<>1__state
 L_0006: stloc.1 
 L_0007: ldloc.1 
 L_0008: switch (L_0031, L_0023, L_0025, L_0027, L_002c)
 L_0021: br.s L_0033
 L_0023: br.s L_0059
 L_0025: br.s L_0085
 L_0027: br L_00b2
...
 L_0033: switch (L_0034, L_0037)
 L_0034: ldc.i4 2 
 L_0035: stfld int32 YieldTest.Program/<gen>d__7::<>2__current
 L_0036: br L_0105


The first switch (at position L_0008) selects the try block, the second one (position L_0033) selects the yield statement inside that block. I tried to translate this Python function in C#, to see how the IL code generated by the C# compiler looks like, and with my great surprise the following code

static IEnumerable<int> f()
{
try
{
yield return 1;
try
{
yield return 2;
int k = 1 / 0;
yield return 3; //never get here
}
catch (System.DivideByZeroException)
{
yield return 4;
yield return 5;
throw;
}
catch
{
yield return 6;
yield return 7; //the "raise" above stops this
}
}
catch
{
yield return 8;
yield return 9;
}
try
{
int x = 12;
}
finally
{
yield return 10;
yield return 11;
}
}

did not compiled at all! The compiler spits out the following errors:

error CS1626: Cannot yield a value in the body of a try block with a catch clause
error CS1631: Cannot yield a value in the body of a catch clause
error CS1625: Cannot yield in the body of a finally clause

Apparently, the C# team glissed over the problem by not allowing the same behaviour as in Python.

# Thursday, 02 February 2006

The yield statement

C# 2.0 introduced a new keyword, yield. I didn't paid many attention to this new keyword, assuming that anonimous delegates and generics were more interesting and that yield was only a way to wrap an "iterator pattern". I was wrong (but I have the excuse that the name given to this feature in C# is 'enumerators', and I think now that this name is a bit reductive and misleading). Yield exposes a feature known by the Python community as Generators.

They are a bit like continuations, because they return (or better, yield) a value and then when they are called again they resume execution right after the last yield.

In C#, yield must be used inside an iterator block. An iterator block is more or less a function whore return type must be IEnumerable, IEnumerator, IEnumerable, or IEnumerator (see MSDN).

using System;
using System.Collections.Generic;
using System.Text;

namespace YieldTest
{
class Program
{
static void Main(string[] args)
{
foreach(string s in gen())
Console.WriteLine(s);

Console.ReadLine();
}

static IEnumerable<string> gen()
{
Console.WriteLine("In the generator function");

Console.WriteLine("Give A");
yield return "A";

Console.WriteLine("Give B");
yield return "B";
}
}
}

It is easy to see why C# enumerators are very useful when implementing iteration over a sequence: as Raymond Chen points out in one of his blog posts there are two "models" for the enumerator-consumer pattern:


"The COM model for enumeration (enumeration objects) is biased towards making life easy for the consumer and hard for the producer. The enumeration object (producer) needs to be structured as a state machine, which can be quite onerous for complicated enumerators, for example, tree walking or composite enumeration."

"On the other hand, the callback model for producer (used by most Win32 functions) is biased towards making life easy for the enumerator and hard for the consumer. This time, it is the consumer that needs to be structured as a state machine, which is more work if the consumer is doing something complicated with each callback. (And even if not, you have to create a context structure to pass state from the caller, through the enumerator, to the callback.) "

In the first model, the caller calls Next() repeatedly and the enumerator has to find the next item and return it. Since the enumerator returns, it has to record state informations with a stack data structure, mimicking the call stack.

In the callback model, on the other and, the producer performs the operations it needs on the data structure (walks a tree, for example) and calls back the consumer through the callback as it finds items. This makes the producer implementation straightforward (in the case of the tree, a simple recursive function will do) but life is made harder for the consumer: it needs to maintain state across each callback.

It would be great to have the simpler approach on both sides, with the caller seeing a simple enumerator that returns items in order and the enumerator seeing a callback that it can throw item into. Raymond solution is a great piece of software based on fibers (user-scheduled threads), but as he points out fibers are hard to use, and it is very easy to make subtle error difficult to debug.

C# solution is in the yield statement and in generators.

# Wednesday, 04 January 2006

What am I reading right now?

  • What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg. Eh, I knew it from my first Fortran program that floats (the standard 32 bit precision using REALs in that language) will give you rounding precision. And they accumulate pretty well! Very interesting reading, though.

  • Practical Common Lisp: my struggling to become good at this (for me, and for now) weird language. Very good till this point.

  • Language Support for Lightweight Transactions, by Tim Harris and Keir Fraser. I am  always interested in concurrency and distributed computing, so this is a must after MSDN Magazie January's End Bracket, by Joe Duffy

  • Speaking of parallel computing, I learned a lot at work on current standards for grid computing. Basically, we are building a computing cluster, and two different projects come into the scene: openMosix and MPI (the second one is the protocol choosen for Windows Server 2003 CCE, too). The two use two very different approaches, each with his own drawbacks and strenghts. I want to study some more, especially on the openMosix front, and then expose here what I learned and my own ideas.

  • And, last but not least, Hackers and Painters, by Paul Graham. Very interesting and stimulating reading, made even more stimulating by the fact that I agree with many of his opinions, but I totally disagree with many others. I find it difficult to understand how an open minded persons could fall trapped in the same mistakes he points out in other people. But maybe Paul wrote some of his pages only to please an audience.. He was going to sell his book, after all. I want to discuss this topic more deeply in the future; it surely deserves a post.

# Tuesday, 03 January 2006

Happy new year!

Finally 2006 arrived!
Wew, it was a rather tiring December. I had a lot of "collateral" work: new servers, a web infrastructure to build, porting a web application to mysql (and Castor still refuses to collaborate!). I spent a very good Christmas with my family, made some good reads (my heap of to-read papers/articles/books diminished a little)  and went to Salzburg with my girlfriend.
On the informatics side, I designed my 2006 LOTY: Lisp! I will dedicate a post on my decision (Why Lisp?), but for the moment let me say that Practical Common Lisp is a very good book! I look forward to learn using macros: in the second chapter Peter Seibel gives you a whole bunch of database bindings (with very easy to use select, update and delete functions) with only 50 lines of code! Amazing... It works only on attributed lists, but it resembles a lot Linq.
I am really thrilled to see if Lisp turn out to be the language for building languages that claims to be. I always thought that the most annoying thing for a programming language is to NOT expose the language, i.e. to not make available to the programmer constructs of the language itself.
Take java: I find highly annoying that the language has cast operators for classes and primitive types, and operator overloading for classes in the Java class library (like + on String), and do not let the programmer to use them! Till now, C++ was my language of choice for this very reason: it lets you to "adapt" the language, build a meta-language that fits your needs and your application domain.
Maybe Lisp macros will get an hold on me.. =)

# Thursday, 22 December 2005

LOTY:2005

My 2005 experience with languages was less interesting. I finished university, and started to work at a research center on Bioinformatics, trying to decide what should I do next (work? PhD? try to make a startup?).

I really liked bioinformarics how they introduced us back at the university, being all a thing of concurrency, synchronizzation and programming languages (as I learned later, this is a branch called System Biology).
Well, in the real world (TM) the language of choice of Bioinformaticians is.. perl.
And they (too often) use it NOT how it was meant to be: the 'duct tape' of programming language, to glue together different pieces written by different people in different envirnoment. For this task, perl works like a charm. And it is very easy too... but Larry Wall claim (Perl make easy things easy, and difficult things not impossible) stops at the first sentence. The second one is not wrong, but having seen some perl hacks, it is better said "and difficult things very messy".

And biologists use it because it is messy. Declare varaibles? Why? Not using goto? It is one of the instructions, why can't I use it?

All together, I am glad I learned perl. It has some very interesting stuff (hash tables, file manipulation, a neat database interface that many should learn from) and, above all, is the father of Perl regular expressions. It is un-matched in the field of text-file manipulation.
But surely, it is better remeber that you don't build an house out of duct tape..

# Wednesday, 21 December 2005

LOTY: 2004

2004 was my C# year. I think I have done a nice work.. Coming from C++ was not so difficult, at first it was like using a "scripting C++", i.e. C++ as a two level language. (Little digression: one of the thing I like most of C++ is its ability to be used as a language for languages. Nobody should use C++ low-level features to build a commercial application, but she shuold create a very well designed and optimized library, with all the quick and dirts of the language, and then use this newly created enviroment as her production language. That's the power of C++!)

But C# is much more than a simplified version of C++. It has functional seeds in it, that are growing more and more through the various versions: first delegates, now full closures (anon delegetes), iterators and yields, and for the future anon classes, extension methods, expressions trees and lambda functions! I am very thrilled!
   
You may wonder why it took me one year (and why I think I have done a nice work) to learn C#. Well, I was productive in an evening (is a very simple language) bbut I wanted to go "under the hood". I think that the best starting point is still Don Box's "Essential .NET". Is a .NET book, but it describes very well the interactions between the language and the compiler. Then I downloaded and take a peek inside Rotor (like you can see here.. Boy, that was interesting!). And then I went for C# variations: Spec#, Comega... and my own contract language! =)

2004 was a nice year. I like C#, and I found it very procuctive (truth to say, it is mainly thanks to .NET), and the functional flavour it is taking is good!

# Monday, 19 December 2005

Language of the Year

"Learn at least one new [programming] language every year. Different languages solve the same problems in different ways. By learning several different approaches, you can help broaden your thinking and avoid getting stuck in a rut."
   --- The Pragmatic Programmer



I embraced this philsophy some years ago, and I found it very fruitful. Learning a new programming language is surely simpler than learning a new language: you can write simple programs in a night, and being productive in a week (surely, it takes a lot more to "master" it, henece the "one language for year").It has however the same advatages. Knowing many languages makes you able to speak directly with different people, easing your job; the same is true form programming languages: you can "speak" directly to many different code, which will definitely ease your work!

Another advantage is that you can use the right tool for the right job. This was not perfectly true in the past: if you program was in C and you had to write a little reasoner, it was hard to write it in another language (say Lisp) and then integrate it in the main program: often the choice did not paid off (for performace, reliability, the difficulty to interface the two worlds using strange mechanisms, sometimes even the necessity to write by yourself such interfaces).

If .NET succeeded in making to me a really good impression, its ability to integrate seamlessy different programming languages without imposing you a "standard language" was surely an important point. I found myself re-opening my CS books on functional programming and use ML and Caml again, with my greatest pleasure.

I think my two points can be summoned with this code snippet I found in an interesing blog post:

sort           [] = []
sort (pivot:rest) = sort [y | y <- rest, y < pivot]
                    ++ [pivot] ++
                    sort [y | y <- rest, y >=pivot]


It's Haskell. If you don't know Haskell, or a similar programming language, you may have the wrong reaction, put this code in the thrash, and write a 20 lines long version of the algorithm in your "standard language". Or, if you have learned a different programming language once in your life, you can appreciate the beauty and the simplicity of it.

Of course, Haskell is terrible for other things: but you can compile the code in an assembly, and reuse it from another language. The right tool for the right thing.

Last but not least: like the opening citation said, learning a new language is food for your mind.