CloudConnect - Using lookups

Related Tags: cloudconnect best practices etl

One of more advanced graph elements are so-called lookups. You can imagine lookups as another way for data to enter the graph, apart from readers and other components for data input. A lookup is simply another layer of abstraction between your graph and the data you work with, allowing you to always use the same set of components and CTL functions, no matter what the real datasource behind the lookup is. I would also like to point out that I will only be covering a couple of most examples in this article - you might want to read documentation to learn about lookups in a more detailed manner.

Hint

When is it the right time to use lookups in your graph? If the data edges in your transformation cross too often and you access the same data files more than once, you should definitely give it a try!

Creating lookups

To access the lookup wizard, right click the Lookups in the graph Outline and continue to Lookup Tables -> Create Internal. There are currently seven types of lookups available, but I will only describe the creation of the first one, the Simple Lookup.

LookupWizard

The wizard has two panels, one for parameters and one for source data.

LookupWizard

Table definition

Parameter nameParameter description
NameLookup identifier.
MetadataMetadata describing the table structure. If your data are in a flat file, you might want to use metadata for the particular file.
KeyPrimary key of the lookup. Records will be joined according to this key and also deduplicated by default.

Data source

This panel allows to load data from file during graph initialization phase or even enter data manually if it fits your use case. It can be left blank in case you want to load the table during the graph run.

Components accessing lookups

LookupTableReaderWriter

LookupTableReaderWriterAs it name implies, this component can both write data to and read them from particular lookup (even at the same time if both input and output ports are connected). It can be also used to truncate the lookup.

Warning

Keeping large amounts of data within simple lookup may increase your graph memory requirements significantly! Keep this in mind if your transformation will be executed in a memory constrained environment.

LookupJoin

LookupJoinThe LookupJoin component is where you actually merge the lookup content with the data flow. As with other joiners, you need to specify the key and the join type. However, you don’t have to care about which lookup type is being used, unless you hit memory limit or some other performance issue.

Parameter nameParameter description
Join keyKey according to which the incoming data flows are joined.
Left outer joinIf set to true, also driver records without corresponding slave are parsed. Otherwise, inner join is performed.
Lookup tableID of the lookup table to be used as the resource of slave records. Number of lookup key fields and their data types must be the same as those of Join key. These fields values are compared and matched records are joined.
TransformTransformation in CTL or Java defined in the graph.

Using Simple lookup

Probably the most popular and also the easiest to set and use is the Simple lookup. Its core feature is the fact it is an in-memory table which is only persistent during the single graph run. The most popular usecases can be usually boiled down to one of the following:

  1. Reusing data
  2. If some local data are to be used more than once in the particular graph, loading it into simple lookup can decrease the number of components as well as save I/O, thus speeding execution.

  3. Appending data
  4. LookupAppendIf you are creating a dataset that is appended from several sources during the graph run, you might save a lot of time by appending it into in-memory lookup and only outputting it when the processing is finalized.

  5. Referential integrity assurance
  6. LookupIntegrity

    As the LookupJoin allows outputting unmatched records from an input port onto a slave output port, it can be well used to trap the unreferenced rows.

  7. Data deduplication
  8. One of the side-effects of the simple lookup is the fact it automatically deduplicates its content by key by default. This means it can also serve as a quick replacement for Ext/FastSort and Dedup component pair.

Using lookups in CTL

Apart from accessing the data stored in lookups using the components mentioned above, it is also possible to access the stored records from within CTL scripts. The get() function example below will search the lookup for specified key and return the whole record or the specified field.

recordName1 testRecord;
string testField;
testRecord = lookup(simpleLookup0).get($0.ID);
testField = lookup(simpleLookup0).get($0.ID).text;

Example project

Feel free to download the example project with practical examples on use cases mentioned!