[ Back to Sprockit home page ]

Introduction to Sprockit

Sprockit controls the execution of collections of ETL processes – SSIS packages and T-SQL stored procedures. It's free, open-source and written purely in T-SQL.

This page gives a brief overview of how Sprockit works. For more detailed information on how Sprockit differs from other ETL process control tools, take a look at What's different about Sprockit?. If you're ready to try it for yourself, have a go at the Sprockit tutorial.

Sprockit's process control is dependency-based – processes are executed in an order that respects dependencies between them. Dependencies between processes are not described directly (e.g. as “P1 depends on P2”) but in terms of resources used by and produced by each process. Resources are things like database tables and views.

Example If a process P1 uses table T, it cannot run until table T's data is up-to-date. If the data in T is prepared by another process, P2, then P1 cannot start until P2 has finished.

Sprockit refers to T as an output resource of P2, and as an input resource of P1. This information is enough for Sprockit to infer that P2 has to run before P1.

Resource dependencies are easier to maintain than process dependencies, because you don't need global dependency information – all you need is the list of resources that make up each individual process's inputs and outputs. They also allow index maintenance to be handled automatically by Sprockit.

Processes are executed by a process handler. A handler runs in a loop, in each iteration selecting a process ready for execution, starting it, then waiting for it to finish. If a handler finds nothing ready for execution at the end of an iteration, it quits.

Process handlers are created by the process manager. The process manager polls the list of ready processes at regular intervals, and whenever it finds ready processes waiting it creates a new handler. So:

  • a new handler is started when the process manager finds processes ready;
  • a handler stops when it looks for a new process to execute and finds nothing ready.

The combined effect of these two features is to cause the number of running handlers to rise and fall automatically in response to demand (up to a configured maximum).

When multiple processes are ready, a handler will select the one with the largest amount of work behind it (i.e. the work to complete not only the ready process but also any downstream processes that depend on the ready process, either directly or indirectly). Over successive runs this has the effect of load balancing a processing workload between multiple handlers, causing end-to-end processing time to fall.

When a process fails, its handler catches the underlying error and logs it. The process is marked as errored, and any processes dependent on it become blocked – but the handler is now free to look for its next ready process.

Some errors – for example transaction deadlocks – are recoverable. When a handler catches a recoverable error, the failed process is marked back to ready, re-joining the queue of processes for execution.

Unrecoverable errors eventually cause batch processing to fail to complete. After a batch failure – and after the underlying problem is resolved – processing can be resumed, picking up automatically from the point of the failed process.

A variety of diagnostic information is available:

Ready to try Sprockit for yourself? Have a go at the Sprockit tutorial!

You can find more detailed information on how Sprockit differs from other ETL process control tools in What's different about Sprockit?.