Multiload of CSV data

Related Tags: etl data loading api cloudconnect

The GoodData platform supports loading multiple datasets from a set of CSV files in a single task. In addition to loading a single CSV file at a time, you can upload your CSV files, provide a JSON manifest file, and then execute the data load through a single API call. This method is particularly useful if your project contains many datasets, uses incremental data loading, or requires a long time to synchronize data.

We recommend that you use our new “batch upload” functionality to enable greater overall throughput for your customers’ ETL. This parameter, when set to TRUE, will group all data set uploads from a single CloudConnect graph execution into one batch to be executed directly on our data loading infrastructure, thereby enabling the CloudConnect worker to move on to the next customer’s graph execution rather than waiting for all data sets to be separately, independently loaded before the CC worker is freed up.

  • Multiload of CSV data was introduced in Release 98.3.
  • This feature is backward compatible with all of your existing ETL scripts.
  • Support for multiload through CloudConnect is available since Release 100.7.

Overview

When a project is refreshed through scripted methods, ETL developers must manage the challenge of starting each CSV data loading task for each dataset. Multiload simplifies this process by integrating all of the data loading tasks into a single task.

When the appropriate API call is executed, the data loading interface manages the entire process. The loading process consists of two phases: the first one is common for all datasets (hence an error during this phase causes crash for all datasets), while the second one processes each dataset independently (they are processed in the order specified in the manifest file, and when errors are encountered with a specific dataset, the loading of the next dataset is started).

Process

The following steps outline the basic process of using multiload:

  1. Build your manifest file to reflect the fields of each dataset. Name this file upload_info.json. See Manifest File below.
  2. Upload all CSV data files and the manifest to a directory on the GoodData platform. See Project-Specific Storage.
  3. Execute a POST to the following API endpoint:
    /gdc/md/[project-id]/etl/pull2
    
  4. The data loading interface (SLI) handles the loading of each dataset with the appropriate CSV, as specified in the manifest. If errors are encountered, the loading of the next CSV file is begun.

Manifest File

The structure of the manifest is very similar to the single-file version. The file is an array of individual manifests, and it must be named upload_info.json:

{
   "dataSetSLIManifestList" : [
        {
            "dataSetSLIManifest": {
                "parts": [
                    {
                        "columnName": "team_id",
                        "mode": "FULL",
                        "populates": [
                            "label.dim_team.id"
                        ],
                        "referenceKey": 1
                    },
                    {
                        "columnName": "name",
                        "mode": "FULL",
                        "populates": [
                            "label.dim_team.name"
                        ]
                    }
                ],
                "file": "dim_team.csv",
                "dataSet": "dataset.dim_team"
            }
        },
        {
            "dataSetSLIManifest": {
                "parts": [
                    {
                        "columnName": "assignee_id",
                        "mode": "FULL",
                        "populates": [
                            "label.dim_assignee.id"
                        ],
                        "referenceKey": 1
                    },
                    {
                        "columnName": "team_id",
                        "mode": "FULL",
                        "populates": [
                            "label.dim_assignee.team_id"
                        ],
                        "referenceKey": 1
                    },
                    {
                        "columnName": "name",
                        "mode": "FULL",
                        "populates": [
                            "label.dim_assignee.name"
                        ]
                    }
                ],
                "file": "dim_assignee.csv",
                "dataSet": "dataset.dim_assignee"
            }
        }
    ]
}

Configure CloudConnect Graphs in Data Integration Console

NOTE: The following procedure applies to you only if your token was created before January 16, 2016. If your token was created on January 16, 2016, or later, batch mode is already set as default for you. For more details, see Data Loading Modes in CloudConnect.

In the Data Integration Console, you can configure your ETL graphs to batch load data by adding a new parameter to the process schedule.

Steps:

  1. Click on your name in the GoodData Portal, and select Data Integration Console from the dropdown.
  2. In Data Integration Console, select and open your schedule.
  3. Click Add parameter.
  4. For the name of the parameter, enter GDC_USE_BATCH_SLI_UPLOAD. Set the value to TRUE.
  5. Save your changes.
  6. Run the graph to validate.

Remember

  • If data loading process fails in the middle of the running graph, no data is loaded.
  • A finished Dataset Writer does not mean that data load finished successfully. The process waits until all dataset writers are finished, and then the data is loaded.

Please let us know how this feature works for you.