|Related Tags: cloudconnect etl|
You must be familiar with basic CloudConnect concepts and components usage. Feel free to find more information in the reference documentation.
In some cases, you need to join multiple incoming data streams to the single output. There are multiple components and approaches that you can select. This article shows you best practice in using ExtMergeJoin, ExtHashJoin components and Lookup tables with CTL2 functions.
It uses no internal memory for joining of slave data to master data. Data isare used for processing in the order that they are flowing through edges. This serial processing does not allow a “rewind” of data going through edge; when new record is used on the slave port, the previous record is “forgotten”. This component has the following properties:
Data on each port (master, slave) must be sorted according to the key used in the join. These sorting activities can be memory and time exhaustive. There is the only one “master key” which could be used in the ExtMergeJoin component, all slave ports join to the same master key of the ExtMergeJoin component. RThe required sorting of all input data is the most important condition affecting the use of the ExtMergeJoin component. This sorting operation consumes time and memory in the CloudConnect engine. In some cases, these sort components must have their own phase inside of a graph, and processing of large data volumes(number of records x number of fields) can last too long.
If you need to use more than one master join key for the same data, you need to chain ExtMergeJoin components together, each with its own unique master join key and repeated sorts according to master join key used.
It uses internal memory for the slave data. There is direct access to each value of slave join key on each slave input port.
The basic characteristics are:
- There is no sorting necessary on the master and slave ports.
- Any slave input port can have its own master join on the master port, which may differ from master keys used on other slave ports.
- Total memory consumption is based on the sum of the memory needed for data on each slave port. The memory is freed after finishing of work of this component.
The overall memory consumption by the slave port data is the most important property of the
ExtHashJoin component. On the other side, there is no volume limit placed on the master data, as it does not need to be sorted.
Memory consumption can be reduced by narrowing the metadata on slave ports. This will limit input to only include the slave key fields and the fields that are being transformed into output data.
- A Lookup table is the “backend” for the LookupJoin component and for the lookup function.
- A Lookup table is an in-memory table which can be filled in multiple ways: an explicit list, a file, or by using the `LookupReaderWriter` component. This table resides in the memory until the time that it is explicitly freed, so it is convenient for small volume lookup operations.
It can be used in the LookupJoin component and in the lookup function, as described below, or in a
LookupReaderWriter component (which operates differently and is not included in this article).
Most used variant of the lookup table is Simple Lookup Table, which is located in-memory as a whole. If memory usage is a concern, a lookup table can be emptied, which will release all of the memory used. It is important to note that referencing and empty lookup table will not throw an error, but the operation will just return no results. If there is need to put more records into the lookup table, the Persistent Lookup Table is convenient replacement, as it serves a greater number of records than the Simple Lookup Table. It uses internal memory only for storing of the lookup table keys, as table data is stored on disc.
LookupJoin uses a master join key. The slave key is defined as the key of the Lookup table that is used in a LookupJoin component, no change is possible.
LookupJoin can be described in short as an
ExtHashJoin with only one slave port, where the slave port is populated by data from a designated lookup table, or as a call of the lookup function within a
Reformat component at a convenient point in the related data flow.
CTL2 lookup function
The CTL2 lookup function uses a lookup table and its key. The lookup function has the following properties:
A Lookup table must be fully populated before it is used by the lookup function.
There is clear indication of which master key is used for selection and which lookup table is referenced (although the slave key is not apparent within call context). This lookup table consumes memory during the graph run from the moment of population until the phase when it is explicitly freed. The volume of memory used for the lookup table does not depend of the number of related calls of that particular lookup.
A Lookup function is highly recommended for repeated use of small lookup tables throughout a whole graph. One typical example is replacing RecordTypeId with the related RecordType.Name in more SalesForce datasets simultaneously.
The LookupJoin can be considered as a redundant component, because it can be replaced easily by the other join components or by a lookup function (since both the
LookupJoin and the lookup function need a lookup table).
ExtHashJoin component is highly recommended only in cases where it can be provided in-memory.
ExtMergeJoin component requires careful planning of graph run phases and preceding sorts. It is recommended to use it only in the case that other solutions can not handle the required join operation.
The Simple Lookup Table is recommended for small volume datasets. The Persistent Lookup Table can serve bigger datasets, which advantage is paid by more time needed for each individual lookup operation.