[ Back to Sprockit home page ]

What's different about Sprockit?

Sprockit differs from other ETL process control approaches in two principal areas – decentralised process handling and resource-based dependency management – each of which brings advantages for managing ETL processes:

Although Sprockit execution is overseen by a single process manager, the process manager itself is not responsible for the execution of ETL processes. Instead, the process manager creates and monitors a pool of independent process handlers running in parallel. Each process handler selects processes for execution from the list of Ready processes.

The process manager is the stored procedure sprockit.usp_ProcessManager, running in a SQL Server Agent job1). The SP loops indefinitely, inspecting the process list for Ready processes at each iteration. If any are found, and if fewer handlers are running than the configured maximum, the manager creates and starts a new handler. The process manager does not explicitly allocate the ready process to the new handler; instead the handler itself inspects the process list for any ready process – the process manager uses the existence of Ready processes merely as an indicator that more handler capacity may be required.

Handlers continue to run only as long as there are processes Ready – a handler exits if there is no work to be done, so is never left idle. If more processes later become Ready for execution in parallel, the process manager responds by starting new handlers to ramp up parallelism again. When there are no more Ready processes and every handler has finished, the process manager itself quits.

TL;DR: Just restart the process manager.

Because handlers operate solely by looking for Ready processes, restart after failure is very straightforward. For a process to have status Errored it must have been Running, and to have been Running it must have been Ready. Restart after error simply consists of marking an Errored process back to Ready, then starting a new process handler to execute it.

This is all handled automatically by the process manager – all you need to do is restart it.

TL;DR: No more ETL failures due to transaction deadlocks.

Because a handler can run any Ready process, setting the status of a failed process back to Ready will cause it to be retried. In practice this is useful for tolerating intermittent faults, in particular if a stored procedure is killed by the database engine to resolve a transaction deadlock (error 1205). When a process handler catches a 1205 error emerging from a stored procedure, it sets the process's status back to Ready so that it can be retried.

TL;DR: Increase or decrease parallelism by changing a single configuration value.

The maximum number of handlers operating in parallel is controlled by Sprockit's MaxConcurrentHandlers property, which can be increased or decreased as required, even during execution.

This means that when you implement a new, more powerful database server, you can immediately take advantage of your additional resources – simply by increasing the value of MaxConcurrentHandlers. If your ETL testing environment is significantly less powerful than production, you may be able to accelerate processing by specifying a lower degree of parallelism, without otherwise deviating from the production environment being simulated.

TL;DR: Sprockit automatically reduces end-to-end processing time.

Once started, handlers are completely independent of the process manager (and of one another) and are permitted to select any process for execution as long as it has [Status] = Ready. This means that the absolute order of process execution can change, within the constraints of the underlying process dependencies. For example, any of the following three orderings is permitted for the dependency graph to the right:

  • A, B, C, D, E, F, G, H, I, J
  • A, D, E, B, F, I, C, G, J, H
  • C, H, B, G, J, F, E, I, A, D

When multiple handlers are running in parallel, it is often possible to find an execution order that serves to reduce end-to-end ETL processing time. Decentralised, independent handlers can produce this ordering because they have freedom to select processes in any order (as long as that order respects the underlying process dependencies).

Furthermore, if the optimum ordering changes over time – because new processes are added to the ETL routine, or because certain processes' execution time changes – independent process handlers allow the absolute order of execution to change without requiring user intervention.

Selection of processes from the ready list is handled by the sprockit.usp_ReserveProcess stored procedure. This SP uses processes' branch weight to induce an absolute ordering with a lower end-to-end execution time.

Sprockit's process handlers need information about process dependencies to be able to execute processes in the correct order. Process dependencies are the arrows on the dependency graph above – e.g. process A has to finish before process D can start; process I cannot begin until processes E and F are complete.

Sprockit describes dependencies in terms of resources – the set of SQL Server objects that make up the inputs and outputs of a process. By joining an output resource to its subsequent uses as an input resource, handlers infer process dependencies from the underlying resource dependencies.

TL;DR: Rebuild your indexes at exactly the right time without writing a line of code.

For tables updated by ETL processes, the best time to perform index maintenance is immediately after the table has been updated. Being able to determine when that happens is a natural consequence of Sprockit's resource dependency model – a process's output resources are ready for index maintenance as soon as the process is complete.

This allows index maintenance to be wholly devolved to Sprockit, without needing to rely on scheduled maintenance plans or embedding index maintenance into your ETL processes.

TL;DR: Get new developers working quickly.

When adding a new process to your ETL, you may be used to having to think carefully about exactly where it belongs – working out which processes it has to follow, which others can't run until it finishes. This needs you to be very well-acquainted with your ETL landscape, and can be an obstacle to getting new developers up to speed.

When adding a new process, Sprockit's resource dependency model requires only that you provide the process's inputs and outputs (i.e. the SQL objects it uses when it runs, and the objects it updates). This information is immediately available from your process definition – you can read it directly out of your SQL code or SSIS package tasks – so new developers can integrate new processes with very little background knowledge about the landscape.

TL;DR: Automatically generate ETL pipeline visualisations.

Because Sprockit's dependencies are stored as structured data in tables, they are readily available for use in the visualisation tool of your choice. For example, the SQL below left builds the Graphviz script on the right (for the set of dependencies we've been using above) to generate a simple process dependency graph.

Don't want to build your own visualisations? Get sprockitviz to generate them for you!

-- nodes (processes)
  @gv += '
  n' + CAST(ProcessId AS VARCHAR) + ' [label="' + ProcessName + '"];'
FROM sprockit.Process
SET @gv += '
-- edges (process dependencies)
  @gv += '
  n' + CAST(PredecessorId AS VARCHAR) + ' -> n' + CAST(SuccessorId AS VARCHAR) + ';'
FROM sprockit.uvw_ProcessDependency
PRINT 'digraph G {' 
  + @gv 
  + '}'
digraph G {
  n1 [label="A"];
  n2 [label="B"];
  n3 [label="C"];
  n4 [label="D"];
  n5 [label="E"];
  n7 [label="G"];
  n8 [label="H"];
  n9 [label="I"];
  n6 [label="F"];
  n10 [label="J"];
  n1 -> n4;
  n2 -> n6;
  n2 -> n7;
  n3 -> n7;
  n3 -> n8;
  n5 -> n9;
  n6 -> n9;
  n7 -> n10;}

Using the Graphviz script at Webgraphviz produces the diagram below. It looks different to the one further up this page because that one was created by hand – the one below is drawn automatically by Graphviz from Sprockit's dependency information.

Ready to try Sprockit for yourself? Have a go at the Sprockit tutorial!

The Introduction to Sprockit provides a technical overview of how Sprockit operates.

The process manager job name is specified in Sprockit's ProcessManagerAgentJobName property