Move data to and from Azure Table using Azure Data Factory
Note
This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see Azure Table Storage connector in V2.
This article explains how to use the Copy Activity in Azure Data Factory to move data to/from Azure Table Storage. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity.
You can copy data from any supported source data store to Azure Table Storage or from Azure Table Storage to any supported sink data store. For a list of data stores supported as sources or sinks by the copy activity, see the Supported data stores table.
Note
We recommend that you use the Azure Az PowerShell module to interact with Azure. See Install Azure PowerShell to get started. To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Getting started
You can create a pipeline with a copy activity that moves data to/from an Azure Table Storage by using different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store:
- Create linked services to link input and output data stores to your data factory.
- Create datasets to represent input and output data for the copy operation.
- Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from an Azure Table Storage, see JSON examples section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities specific to Azure Table Storage:
Linked service properties
There are two types of linked services you can use to link an Azure blob storage to an Azure data factory. They are: AzureStorage linked service and AzureStorageSas linked service. The Azure Storage linked service provides the data factory with global access to the Azure Storage. Whereas, The Azure Storage SAS (Shared Access Signature) linked service provides the data factory with restricted/time-bound access to the Azure Storage. There are no other differences between these two linked services. Choose the linked service that suits your needs. The following sections provide more details on these two linked services.
Azure Storage Linked Service
The Azure Storage linked service allows you to link an Azure storage account to an Azure data factory by using the account key, which provides the data factory with global access to the Azure Storage. The following table provides description for JSON elements specific to Azure Storage linked service.
Property | Description | Required |
---|---|---|
type | The type property must be set to: AzureStorage | Yes |
connectionString | Specify information needed to connect to Azure storage for the connectionString property. | Yes |
For information about how to retrieve the storage account access keys, see Manage storage account access keys.
Example:
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
}
}
}
Azure Storage Sas Linked Service
A shared access signature (SAS) provides delegated access to resources in your storage account. It allows you to grant a client limited permissions to objects in your storage account for a specified period of time and with a specified set of permissions, without having to share your account access keys. The SAS is a URI that encompasses in its query parameters all the information necessary for authenticated access to a storage resource. To access storage resources with the SAS, the client only needs to pass in the SAS to the appropriate constructor or method. For more information about SAS, see Grant limited access to Azure Storage resources using shared access signatures (SAS).
Important
Azure Data Factory now only supports Service SAS but not Account SAS. Note the SAS URL generable from Azure portal or Storage Explorer is an Account SAS, which is not supported.
Tip
You can execute below PowerShell commands to generate a Service SAS for your storage account (replace the place-holders and grant the needed permission):
$context = New-AzStorageContext -StorageAccountName <accountName> -StorageAccountKey <accountKey>
New-AzStorageContainerSASToken -Name <containerName> -Context $context -Permission rwdl -StartTime <startTime> -ExpiryTime <endTime> -FullUri
The Azure Storage SAS linked service allows you to link an Azure Storage Account to an Azure data factory by using a Shared Access Signature (SAS). It provides the data factory with restricted/time-bound access to all/specific resources (blob/container) in the storage. The following table provides description for JSON elements specific to Azure Storage SAS linked service.
Property | Description | Required |
---|---|---|
type | The type property must be set to: AzureStorageSas | Yes |
sasUri | Specify Shared Access Signature URI to the Azure Storage resources such as blob, container, or table. | Yes |
Example:
{
"name": "StorageSasLinkedService",
"properties": {
"type": "AzureStorageSas",
"typeProperties": {
"sasUri": "<Specify SAS URI of the Azure Storage resource>"
}
}
}
When creating an SAS URI, considering the following:
- Set appropriate read/write permissions on objects based on how the linked service (read, write, read/write) is used in your data factory.
- Set Expiry time appropriately. Make sure that the access to Azure Storage objects does not expire within the active period of the pipeline.
- Uri should be created at the right container/blob or Table level based on the need. A SAS Uri to an Azure blob allows the Data Factory service to access that particular blob. A SAS Uri to an Azure blob container allows the Data Factory service to iterate through blobs in that container. If you need to provide access more/fewer objects later, or update the SAS URI, remember to update the linked service with the new URI.
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for the dataset of type AzureTable has the following properties.
Property | Description | Required |
---|---|---|
tableName | Name of the table in the Azure Table Database instance that linked service refers to. | Yes. When a tableName is specified without an azureTableSourceQuery, all records from the table are copied to the destination. If an azureTableSourceQuery is also specified, records from the table that satisfies the query are copied to the destination. |
Schema by Data Factory
For schema-free data stores such as Azure Table, the Data Factory service infers the schema in one of the following ways:
- If you specify the structure of data by using the structure property in the dataset definition, the Data Factory service honors this structure as the schema. In this case, if a row does not contain a value for a column, a null value is provided for it.
- If you don't specify the structure of data by using the structure property in the dataset definition, Data Factory infers the schema by using the first row in the data. In this case, if the first row does not contain the full schema, some columns are missed in the result of copy operation.
Therefore, for schema-free data sources, the best practice is to specify the structure of data using the structure property.
Copy activity properties
For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output datasets, and policies are available for all types of activities.
Properties available in the typeProperties section of the activity on the other hand vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks.
AzureTableSource supports the following properties in typeProperties section:
Property | Description | Allowed values | Required |
---|---|---|---|
azureTableSourceQuery | Use the custom query to read data. | Azure table query string. See examples in the next section. | No. When a tableName is specified without an azureTableSourceQuery, all records from the table are copied to the destination. If an azureTableSourceQuery is also specified, records from the table that satisfies the query are copied to the destination. |
azureTableSourceIgnoreTableNotFound | Indicate whether swallow the exception of table not exist. | TRUE FALSE |
No |
azureTableSourceQuery examples
If Azure Table column is of string type:
azureTableSourceQuery": "$$Text.Format('PartitionKey ge \\'{0:yyyyMMddHH00_0000}\\' and PartitionKey le \\'{0:yyyyMMddHH00_9999}\\'', SliceStart)"
If Azure Table column is of datetime type:
"azureTableSourceQuery": "$$Text.Format('DeploymentEndTime gt datetime\\'{0:yyyy-MM-ddTHH:mm:ssZ}\\' and DeploymentEndTime le datetime\\'{1:yyyy-MM-ddTHH:mm:ssZ}\\'', SliceStart, SliceEnd)"
AzureTableSink supports the following properties in typeProperties section:
Property | Description | Allowed values | Required |
---|---|---|---|
azureTableDefaultPartitionKeyValue | Default partition key value that can be used by the sink. | A string value. | No |
azureTablePartitionKeyName | Specify name of the column whose values are used as partition keys. If not specified, AzureTableDefaultPartitionKeyValue is used as the partition key. | A column name. | No |
azureTableRowKeyName | Specify name of the column whose column values are used as row key. If not specified, use a GUID for each row. | A column name. | No |
azureTableInsertType | The mode to insert data into Azure table. This property controls whether existing rows in the output table with matching partition and row keys have their values replaced or merged. To learn about how these settings (merge and replace) work, see Insert or Merge Entity and Insert or Replace Entity topics. This setting applies at the row level, not the table level, and neither option deletes rows in the output table that do not exist in the input. |
merge (default) replace |
No |
writeBatchSize | Inserts data into the Azure table when the writeBatchSize or writeBatchTimeout is hit. | Integer (number of rows) | No (default: 10000) |
writeBatchTimeout | Inserts data into the Azure table when the writeBatchSize or writeBatchTimeout is hit | timespan Example: "00:20:00" (20 minutes) |
No (Default to storage client default timeout value 90 sec) |
azureTablePartitionKeyName
Map a source column to a destination column using the translator JSON property before you can use the destination column as the azureTablePartitionKeyName.
In the following example, source column DivisionID is mapped to the destination column: DivisionID.
"translator": {
"type": "TabularTranslator",
"columnMappings": "DivisionID: DivisionID, FirstName: FirstName, LastName: LastName"
}
The DivisionID is specified as the partition key.
"sink": {
"type": "AzureTableSink",
"azureTablePartitionKeyName": "DivisionID",
"writeBatchSize": 100,
"writeBatchTimeout": "01:00:00"
}
JSON examples
The following examples provide sample JSON definitions that you can use to create a pipeline by using Visual Studio or Azure PowerShell. They show how to copy data to and from Azure Table Storage and Azure Blob Database. However, data can be copied directly from any of the sources to any of the supported sinks. For more information, see the section "Supported data stores and formats" in Move data by using Copy Activity.
Example: Copy data from Azure Table to Azure Blob
The following sample shows:
- A linked service of type AzureStorage (used for both table & blob).
- An input dataset of type AzureTable.
- An output dataset of type AzureBlob.
- The pipeline with Copy activity that uses AzureTableSource and BlobSink.
The sample copies data belonging to the default partition in an Azure Table to a blob every hour. The JSON properties used in these samples are described in sections following the samples.
Azure storage linked service:
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
}
}
}
Azure Data Factory supports two types of Azure Storage linked services: AzureStorage and AzureStorageSas. For the first one, you specify the connection string that includes the account key and for the later one, you specify the Shared Access Signature (SAS) Uri. See Linked Services section for details.
Azure Table input dataset:
The sample assumes you have created a table "MyTable" in Azure Table.
Setting "external": "true" informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory.
{
"name": "AzureTableInput",
"properties": {
"type": "AzureTable",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"tableName": "MyTable"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
Azure Blob output dataset:
Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time.
{
"name": "AzureBlobOutput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}",
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
],
"format": {
"type": "TextFormat",
"columnDelimiter": "\t",
"rowDelimiter": "\n"
}
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Copy activity in a pipeline with AzureTableSource and BlobSink:
The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to AzureTableSource and sink type is set to BlobSink. The SQL query specified with AzureTableSourceQuery property selects the data from the default partition every hour to copy.
{
"name":"SamplePipeline",
"properties":{
"start":"2014-06-01T18:00:00",
"end":"2014-06-01T19:00:00",
"description":"pipeline for copy activity",
"activities":[
{
"name": "AzureTabletoBlob",
"description": "copy activity",
"type": "Copy",
"inputs": [
{
"name": "AzureTableInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"typeProperties": {
"source": {
"type": "AzureTableSource",
"AzureTableSourceQuery": "PartitionKey eq 'DefaultPartitionKey'"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
]
}
}
Example: Copy data from Azure Blob to Azure Table
The following sample shows:
- A linked service of type AzureStorage (used for both table & blob)
- An input dataset of type AzureBlob.
- An output dataset of type AzureTable.
- The pipeline with Copy activity that uses BlobSource and AzureTableSink.
The sample copies time-series data from an Azure blob to an Azure table hourly. The JSON properties used in these samples are described in sections following the samples.
Azure storage (for both Azure Table & Blob) linked service:
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
}
}
}
Azure Data Factory supports two types of Azure Storage linked services: AzureStorage and AzureStorageSas. For the first one, you specify the connection string that includes the account key and for the later one, you specify the Shared Access Signature (SAS) Uri. See Linked Services section for details.
Azure Blob input dataset:
Data is picked up from a new blob every hour (frequency: hour, interval: 1). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, and day part of the start time and file name uses the hour part of the start time. "external": "true" setting informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory.
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}",
"fileName": "{Hour}.csv",
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
],
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
}
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
Azure Table output dataset:
The sample copies data to a table named "MyTable" in Azure Table. Create an Azure table with the same number of columns as you expect the Blob CSV file to contain. New rows are added to the table every hour.
{
"name": "AzureTableOutput",
"properties": {
"type": "AzureTable",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"tableName": "MyOutputTable"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Copy activity in a pipeline with BlobSource and AzureTableSink:
The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to BlobSource and sink type is set to AzureTableSink.
{
"name":"SamplePipeline",
"properties":{
"start":"2014-06-01T18:00:00",
"end":"2014-06-01T19:00:00",
"description":"pipeline with copy activity",
"activities":[
{
"name": "AzureBlobtoTable",
"description": "Copy Activity",
"type": "Copy",
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureTableOutput"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "AzureTableSink",
"writeBatchSize": 100,
"writeBatchTimeout": "01:00:00"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
]
}
}
Type Mapping for Azure Table
As mentioned in the data movement activities article, Copy activity performs automatic type conversions from source types to sink types with the following two-step approach.
- Convert from native source types to .NET type
- Convert from .NET type to native sink type
When moving data to & from Azure Table, the following mappings defined by Azure Table service are used from Azure Table OData types to .NET type and vice versa.
OData Data Type | .NET Type | Details |
---|---|---|
Edm.Binary | byte[] | An array of bytes up to 64 KB. |
Edm.Boolean | bool | A Boolean value. |
Edm.DateTime | DateTime | A 64-bit value expressed as Coordinated Universal Time (UTC). The supported DateTime range begins from 12:00 midnight, January 1, 1601 A.D. (C.E.), UTC. The range ends at December 31, 9999. |
Edm.Double | double | A 64-bit floating point value. |
Edm.Guid | Guid | A 128-bit globally unique identifier. |
Edm.Int32 | Int32 | A 32-bit integer. |
Edm.Int64 | Int64 | A 64-bit integer. |
Edm.String | String | A UTF-16-encoded value. String values may be up to 64 KB. |
Type Conversion Sample
The following sample is for copying data from an Azure Blob to Azure Table with type conversions.
Suppose the Blob dataset is in CSV format and contains three columns. One of them is a datetime column with a custom datetime format using abbreviated French names for day of the week.
Define the Blob Source dataset as follows along with type definitions for the columns.
{
"name": " AzureBlobInput",
"properties":
{
"structure":
[
{ "name": "userid", "type": "Int64"},
{ "name": "name", "type": "String"},
{ "name": "lastlogindate", "type": "Datetime", "culture": "fr-fr", "format": "ddd-MM-YYYY"}
],
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"fileName":"myfile.csv",
"format":
{
"type": "TextFormat",
"columnDelimiter": ","
}
},
"external": true,
"availability":
{
"frequency": "Hour",
"interval": 1,
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
Given the type mapping from Azure Table OData type to .NET type, you would define the table in Azure Table with the following schema.
Azure Table schema:
Column name | Type |
---|---|
userid | Edm.Int64 |
name | Edm.String |
lastlogindate | Edm.DateTime |
Next, define the Azure Table dataset as follows. You do not need to specify "structure" section with the type information since the type information is already specified in the underlying data store.
{
"name": "AzureTableOutput",
"properties": {
"type": "AzureTable",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"tableName": "MyOutputTable"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
In this case, Data Factory automatically does type conversions including the Datetime field with the custom datetime format using the "fr-fr" culture when moving data from Blob to Azure Table.
Note
To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data Factory.
Performance and Tuning
To learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it, see Copy Activity Performance & Tuning Guide.
Feedback
Submit and view feedback for