Java expert, working on Parasuite product.
Error handling in JavaFBP
Up to and including revision 63 in SVN, error handling has been very difficult in JavaFBP. In Java it is typically done using exceptions. But the execute() method of the Component class did not allow the component developer to throw an exception in case of an error and let the JavaFBP framework react accordingly. Instead, the developer was responsible for cleanly shutting down the component and making sure the component did not get activated again (by closing and draining all incoming connections). To let the caller (or user) know what had happened, I came up with the following idea: I introduced a special component which was connected to all components of a network and collects all occurring error messages (exceptions). After the run this component could be asked if there had been an error during execution of the network. In short, error handling was done without the framework as it did not support it properly.
- Sven has since decided to go a different route, described below. Here is what he has written about his thinking:
- ...I used the JavaFBP facilities to do error handling, i.e. added an error port to many components and created a component collecting exceptions. This all works fine until something happens to the network. You may argue that most or even all errors regarding the network are design errors (network or component design). But I our case the user is the designer and therefore it is not acceptable that the framework does not tell us about the errors (System.out.println() does not count). Therefore I integrated the error handling into the framework and now the Network.go() method throws an exception if something goes wrong.
- In an earlier correspondence (see below) we saw that there are at least two error handling strategies... So far, I've only implemented #1 as I am not really sure how to do #2 properly at the moment. But it may be a good idea to be able to choose between several strategies.
- To sum it all up, I think error handling belongs into the framework and should not be done around it as I did it earlier. What I've done so far is providing a foundation for error handling in general and implemented one of two or more possible strategies.
In my opinion, it would be nice if a developer could simply let the component throw an exception in case of an error and let the framework handle the rest. In conversation with Paul it turned out that it may be sensible to provide more than one error handling strategy:
- "All-Or-Nothing" strategy: In case of an error shut down the whole network immediately.
- "Graceful Degradation" strategy: In case of an error only shut down the parts of the network which depend on the erroneous component and let the rest keep on going.
The All-Or-Nothing strategy would normally be used for testing purposes or with small networks whose results are only of interest when they are complete. For production, especially when big networks are involved, one would use the Graceful Degradation strategy.
Implementation considerations and preparations
- Allow the Component.execute() method to throw exceptions. Exceptions thrown in this method are considered as component errors by the framework.
FBP networks can be considered as distributed systems. A component does not have knowledge about other components and networks only know about their direct descendants. Therefore we have to implement a distributed shutdown.
A component should behave like this:
If it encounters an error,
- it enters the error state and
- tells its mother that an error has occured.
If it is asked to terminate, it interrupts its execution.
A subnet should behave like this:
If it encounters an error signal from one of its descendants,
- it enters the error state,
- tells its mother that an error has occured and
- asks its descendants to terminate.
If it is asked to terminate, it shuts down its descendants.
To implement this, I did the following:
- Surrounded the contents of the run() method with a try-catch block.
- Set the component status to ERROR.
- Called mother.signalError(this, e) with e = the occurred exception
- Introduced a terminate() method receiving the new status as parameter which
- sets the new status and
- calls the thread's interrupt() method
- Introduced a signalError() method which receives the occured exception as parameter and checks whether an error already occured. If not, it
- stores the exception in a field called "error" and
- asks the network's descendants to terminate (comp.terminate(StatusValues.ERROR)).
- Skipped the deadlock detection in case of an error (waitForAll() method).
- Allowed go() to throw Exceptions.
- Let go() pass on FlowErrors.
- Let go() throw the occured exception if any.
- Overrode signalError() with the following: If the subnet is not already in the error state,
- call the mother's signalError() method and
- call the subnet's terminate() method.
- Overrode terminate() with the following:
- Ask the components to terminate,
- set the new status and
- terminate the subnet.
- Calling the interrupt() method of a thread does not always stop the thread immediately. Especially when the component is doing some processing without running into a wait() or Lock.lockInterruptly() call it won't terminate immediately. Eventually, it will of course stop when the adjacent components have closed their ports. If this should be a problem, I don't know if it is, a solution would be to check the thread's interrupt flag on each invocation of a framework method. Or, the component developer could do this check at some places in his code. By the way, this can be done using the interrupted() or isInterrupted() method of Thread.
- The new terminate() method does not put the component in the ERROR state. Instead it receives a parameter containing the new state. I have done this to allow networks being shut down externally, e.g. when a user cancels the operation. For this to work, I've added a terminate() method (without parameters) to the Network class, which shuts down the network.
- The FlowError class does not extend Error anymore. Instead, it extends RuntimeException which also does not need to be catched by application code. But now we can handle FlowErrors with the new error handling code. Extending Error should be reserved for really severe errors which leave the system in an unrecoverable state (out of memory, vm errors, etc.).
Implementing Graceful Degradation
This needs some more thinking about.
- I added two helper methods, declareInputPort() and declareOutputPort(), to the SubNet class. We need this in our project because we turn XML network descriptions into JavaFBP network descriptions. For this to work we have to define a general subnet class without annotations and declare the ports at run-time.
- Tracing will now be disabled if the trace file cannot be created or opened.
- Added a static setTracePath() method to the Network class. This can be called to specify the path where the trace files should go.
Deadlock testing and subnets
As of revision 71 the deadlock test considered each network (main or subnet) separately. Now imagine a network consisting of a component producing packets very slowly (say one each second) connected to a subnet consisting of an input port IN and a packet swallowing component (say Discard). After the first packet was consumed by Discard, the status of the components in the subnet is the following:
Now the deadlock test would say: We have a deadlock, because nobody is active. But this is not true. The packet producer outside of the subnet is active, eventually producing the next packet and IN will get activated again. Thus, the deadlock test has to consider the whole network. Solution: Network.listCompStatus() now recursively queries the status of all components in all subnets.
When using subnets the main network and the subnets write concurrently to the trace file which may result in lines like these:
Network: Pass.IN: ReceivNetwork?: Active: discard
A possible solution would be to let subnets call the traceFuncs method. But I think it would be nice if each subnet would create its own trace file.
This would create a Thread-<x>-fulltrace.txt file for the main network and a <subnet name>-fulltrace.txt file for each subnet.
Tracing in JavaFBP-2.4 has now been enhanced with a number of new features, as has the latest (unreleased) version of C#FBP. These features are described in FBPTracing.
A future enhancement may be to route trace information over a (network-general) connection to a separate component which will look after writing out the trace data. This arises from the fact that sometimes production runs have to be run with tracing enabled, so you want tracing to slow down production components as little as possible.