[Home]EricGaumer

FlowBasedProgramming | RecentChanges | Preferences

Stuff about Pypes by Eric Gaumer

The original intent was to address an area in which I spend most of my professional time; aggregating, normalizing, and enriching disparate content in preparation for indexing into a search engine.

It's no surprise that information retrieval (search) is a hot topic. Most organizations now have entire divisions devoted to search. I've been lucky enough to be part of a wide range of enterprise search projects. Regardless of the underlying search platform, the area that typically determines the project timeline is moving content from enterprise management systems into the search index.

ETL is often too rigid to be used. For example, most ETL platforms are unable to extract content from enterprise content management systems such as FileNet? or Documentum. Traditional ETL applications are also batch oriented meaning they run at pre-configured intervals. While this works fine for data warehousing, it doesn't work well with enterprise search.

Most organizations have near real time requirements with regards to indexing. When an editor publishes a new story, it's expected to show up in the search results in a matter of minutes (if not seconds). For this we need to move from a batch oriented system to a transaction based one.

Pypes provides this transaction layer using a push notification model (eager data flow). It can be thought of as a publishing system that enables users to apply business logic in the form of pluggable components. Pypes can also publish to multiple end-points (i.e., users can design DAGs).

In order to achieve its goals, pypes uses flow-based design principles. Documents (data) is submitted to pypes through a REST interface using an HTTP POST. Pypes uses Adapters to convert different content formats to an internal representation (packet). A packet in Pypes is a JSON object which provides an efficient and standards based representation of the data. This also allows meta-data to be attached at the packet or field level.

Any business logic then operates on this common structure as the packet flows through the system. Publisher components then convert the packet stream into a format expected by the systems they publish to.

Python was the language of choice for several reasons. I wanted to design a system that allowed users to write custom components. Python is an excellent choice because it offers very rich text processing capabilities but is also a very expressive language. The syntax is simple and easy to read. Because most of the core libraries are written in C, Python performance is on par with languages like Java.

Python is also a dynamic language so it fits the Web 2.0 model nicely. Many organizations are moving toward more RESTful platforms and away from things like ESB, SOAP, WSDL, RPC, etc. Python is also highly object oriented and provides meta-programming semantics.

When it was decided that Python was the language of choice then Stackless seemed like a natural fit. Stackless Python provides some additional capabilities over the standard Python interpreter. Most notably, it offers true co-routines. Co-routines seem to almost go hand-in-hand with flow-based programming. The ability for one component to call another (pass a message) without having the notion of a return value (i.e., stack winding/unwinding) makes flow based designs seem almost effortless.

We did some tests with regards to message passing semantics. Stackless Python outperformed everything we tested. Consider a project I'm currently working on that requires us to index nearly 350 million documents. Given 20 components that perform business logic on the data stream, that equates to 20x350 million context switches (switch of execution between components). Although threads are lightweight compared to processes, they still incur a considerable amount of overhead.

In contrast, co-routine based micro-threads have almost no overhead. They happen at the VM level (as opposed to the OS) and simply require moving a pointer from one address space to another (stack manipulation).

In order to take full advantage of multi-core architectures, code was added to allow one instance of the DAG to be executed on each CPU/core. Documents pushed into the system are then load balanced across this instance pool so that we have documents being operated on in parallel. This allows the system to scale up.

Scaling out is simple because of the REST interface. One can add a new node and allow a load balancer to distribute requests equally across the cluster.

So now you have some background into how flow-based semantics are being applied today at the enterprise level.

I think it would be excellent to include some of this in the book. It would give people a concrete sense of flow-based principles applied to a very relevant problem facing many organizations today.

It seems like organizations are continually moving data around and that is where flow-based programming really shines (in my opinion). I looked at some research on ETL tools a while back and studies showed that nearly 70% of organizations surveyed rolled their own ETL solution. Many felt they could build an optimal system that was more specific to their needs. I haven't seen a commercial ETL system that boosts flow-based design principles. Many require a lot of configuration to support custom components.

Since Python syntax is terse yet expressive, adding some code examples might even be feasible. Sometimes it helps to understand the theory behind something if you can see some code. Showing some sample components in Pypes could help demonstrate the notion of packets, message passing, loopers vs non-loopers, batch vs loop type networks, etc.

I'd be willing to help out with any/all of this.

BTW: Have you heard about CouchDB??

http://couchdb.apache.org/

CouchDB? is a schema free distributed database that provides a flat document model (much like a search engine). There's been a lot of buzz surrounding it because it's a truly distributed database model. It's written in Erlang which offers concurrency inherent in the language. It offers a REST interface for document submission and Javascript is the query language. The interesting thing is that documents are submitted as JSON objects. This means that CouchDB? and Pypes play very nicely together.

I have some demos I'm working on that allow you to submit a spreadsheet to Pypes through a web interface, apply some normalization logic, and publish each row (packet) as a document in CouchDB? where you can then provide a structured interface offering search capabilities, views, etc.

Very cool, very cutting edge technologies that leverage the best the web has to offer but still grounded in decades old theory like flow-based programming and coroutines.


FlowBasedProgramming | RecentChanges | Preferences
This page is read-only - contact owner for a password | View other revisions
Last edited November 8, 2009 8:27 pm by PaulMorrison (diff)
Search: