Copy data to and from Azure Synapse Analytics using Azure Data Factory
Note
This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see Azure Synapse Analytics connector in V2.
This article explains how to use the Copy Activity in Azure Data Factory to move data to/from Azure Synapse Analytics. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity.
Tip
To achieve best performance, use PolyBase to load data into Azure Synapse Analytics. The Use PolyBase to load data into Azure Synapse Analytics section has details. For a walkthrough with a use case, see Load 1 TB into Azure Synapse Analytics under 15 minutes with Azure Data Factory.
Supported scenarios
You can copy data from Azure Synapse Analytics to the following data stores:
You can copy data from the following data stores to Azure Synapse Analytics:
Tip
When copying data from SQL Server or Azure SQL Database to Azure Synapse Analytics, if the table does not exist in the destination store, Data Factory can automatically create the table in Azure Synapse Analytics by using the schema of the table in the source data store. See Auto table creation for details.
Supported authentication type
Azure Synapse Analytics connector support basic authentication.
Getting started
You can create a pipeline with a copy activity that moves data to/from an Azure Synapse Analytics by using different tools/APIs.
The easiest way to create a pipeline that copies data to/from Azure Synapse Analytics is to use the Copy data wizard. See Tutorial: Load data into Azure Synapse Analytics with Data Factory for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store:
- Create a data factory. A data factory may contain one or more pipelines.
- Create linked services to link input and output data stores to your data factory. For example, if you are copying data from an Azure blob storage to an Azure Synapse Analytics, you create two linked services to link your Azure storage account and Azure Synapse Analytics to your data factory. For linked service properties that are specific to Azure Synapse Analytics, see linked service properties section.
- Create datasets to represent input and output data for the copy operation. In the example mentioned in the last step, you create a dataset to specify the blob container and folder that contains the input data. And, you create another dataset to specify the table in the Azure Synapse Analytics that holds the data copied from the blob storage. For dataset properties that are specific to Azure Synapse Analytics, see dataset properties section.
- Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. In the example mentioned earlier, you use BlobSource as a source and SqlDWSink as a sink for the copy activity. Similarly, if you are copying from Azure Synapse Analytics to Azure Blob Storage, you use SqlDWSource and BlobSink in the copy activity. For copy activity properties that are specific to Azure Synapse Analytics, see copy activity properties section. For details on how to use a data store as a source or a sink, click the link in the previous section for your data store.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from Azure Synapse Analytics, see JSON examples section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities specific to Azure Synapse Analytics:
Linked service properties
The following table provides description for JSON elements specific to Azure Synapse Analytics linked service.
Property | Description | Required |
---|---|---|
type | The type property must be set to: AzureSqlDW | Yes |
connectionString | Specify information needed to connect to the Azure Synapse Analytics instance for the connectionString property. Only basic authentication is supported. | Yes |
Important
Configure Azure SQL Database Firewall and the database server to allow Azure Services to access the server. Additionally, if you are copying data to Azure Synapse Analytics from outside Azure including from on-premises data sources with data factory gateway, configure appropriate IP address range for the machine that is sending data to Azure Synapse Analytics.
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for the dataset of type AzureSqlDWTable has the following properties:
Property | Description | Required |
---|---|---|
tableName | Name of the table or view in the Azure Synapse Analytics database that the linked service refers to. | Yes |
Copy activity properties
For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policy are available for all types of activities.
Note
The Copy Activity takes only one input and produces only one output.
Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks.
SqlDWSource
When source is of type SqlDWSource, the following properties are available in typeProperties section:
Property | Description | Allowed values | Required |
---|---|---|---|
sqlReaderQuery | Use the custom query to read data. | SQL query string. For example: select * from MyTable. | No |
sqlReaderStoredProcedureName | Name of the stored procedure that reads data from the source table. | Name of the stored procedure. The last SQL statement must be a SELECT statement in the stored procedure. | No |
storedProcedureParameters | Parameters for the stored procedure. | Name/value pairs. Names and casing of parameters must match the names and casing of the stored procedure parameters. | No |
If the sqlReaderQuery is specified for the SqlDWSource, the Copy Activity runs this query against the Azure Synapse Analytics source to get the data.
Alternatively, you can specify a stored procedure by specifying the sqlReaderStoredProcedureName and storedProcedureParameters (if the stored procedure takes parameters).
If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure section of the dataset JSON are used to build a query to run against Azure Synapse Analytics. Example: select column1, column2 from mytable
. If the dataset definition does not have the structure, all columns are selected from the table.
SqlDWSource example
"source": {
"type": "SqlDWSource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', SliceStart)", "type": "Int"}
}
}
The stored procedure definition:
CREATE PROCEDURE CopyTestSrcStoredProcedureWithParameters
(
@stringData varchar(20),
@identifier int
)
AS
SET NOCOUNT ON;
BEGIN
select *
from dbo.UnitTestSrcTable
where dbo.UnitTestSrcTable.stringData != stringData
and dbo.UnitTestSrcTable.identifier != identifier
END
GO
SqlDWSink
SqlDWSink supports the following properties:
Property | Description | Allowed values | Required |
---|---|---|---|
sqlWriterCleanupScript | Specify a query for Copy Activity to execute such that data of a specific slice is cleaned up. For details, see repeatability section. | A query statement. | No |
allowPolyBase | Indicates whether to use PolyBase (when applicable) instead of BULKINSERT mechanism. Using PolyBase is the recommended way to load data into Azure Synapse Analytics. See Use PolyBase to load data into Azure Synapse Analytics section for constraints and details. |
True False (default) |
No |
polyBaseSettings | A group of properties that can be specified when the allowPolybase property is set to true. | No | |
rejectValue | Specifies the number or percentage of rows that can be rejected before the query fails. Learn more about the PolyBase's reject options in the Arguments section of CREATE EXTERNAL TABLE (Transact-SQL) topic. |
0 (default), 1, 2, … | No |
rejectType | Specifies whether the rejectValue option is specified as a literal value or a percentage. | Value (default), Percentage | No |
rejectSampleValue | Determines the number of rows to retrieve before the PolyBase recalculates the percentage of rejected rows. | 1, 2, … | Yes, if rejectType is percentage |
useTypeDefault | Specifies how to handle missing values in delimited text files when PolyBase retrieves data from the text file. Learn more about this property from the Arguments section in CREATE EXTERNAL FILE FORMAT (Transact-SQL). |
True, False (default) | No |
writeBatchSize | Inserts data into the SQL table when the buffer size reaches writeBatchSize | Integer (number of rows) | No (default: 10000) |
writeBatchTimeout | Wait time for the batch insert operation to complete before it times out. | timespan Example: "00:30:00" (30 minutes). |
No |
SqlDWSink example
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true
}
Use PolyBase to load data into Azure Synapse Analytics
Using PolyBase is an efficient way of loading large amount of data into Azure Synapse Analytics with high throughput. You can see a large gain in the throughput by using PolyBase instead of the default BULKINSERT mechanism. See copy performance reference number with detailed comparison. For a walkthrough with a use case, see Load 1 TB into Azure Synapse Analytics under 15 minutes with Azure Data Factory.
- If your source data is in Azure Blob or Azure Data Lake Store, and the format is compatible with PolyBase, you can directly copy to Azure Synapse Analytics using PolyBase. See Direct copy using PolyBase with details.
- If your source data store and format is not originally supported by PolyBase, you can use the Staged Copy using PolyBase feature instead. It also provides you better throughput by automatically converting the data into PolyBase-compatible format and storing the data in Azure Blob storage. It then loads data into Azure Synapse Analytics.
Set the allowPolyBase
property to true as shown in the following example for Azure Data Factory to use PolyBase to copy data into Azure Synapse Analytics. When you set allowPolyBase to true, you can specify PolyBase specific properties using the polyBaseSettings
property group. see the SqlDWSink section for details about properties that you can use with polyBaseSettings.
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true,
"polyBaseSettings":
{
"rejectType": "percentage",
"rejectValue": 10.0,
"rejectSampleValue": 100,
"useTypeDefault": true
}
}
Direct copy using PolyBase
Azure Synapse Analytics PolyBase directly support Azure Blob and Azure Data Lake Store (using service principal) as source and with specific file format requirements. If your source data meets the criteria described in this section, you can directly copy from source data store to Azure Synapse Analytics using PolyBase. Otherwise, you can use Staged Copy using PolyBase.
Tip
To copy data from Data Lake Store to Azure Synapse Analytics efficiently, learn more from Azure Data Factory makes it even easier and convenient to uncover insights from data when using Data Lake Store with Azure Synapse Analytics.
If the requirements are not met, Azure Data Factory checks the settings and automatically falls back to the BULKINSERT mechanism for the data movement.
Source linked service is of type: AzureStorage or AzureDataLakeStore with service principal authentication.
The input dataset is of type: AzureBlob or AzureDataLakeStore, and the format type under
type
properties is OrcFormat, ParquetFormat, or TextFormat with the following configurations:rowDelimiter
must be \n.nullValue
is set to empty string (""), ortreatEmptyAsNull
is set to true.encodingName
is set to utf-8, which is default value.escapeChar
,quoteChar
,firstRowAsHeader
, andskipLineCount
are not specified.compression
can be no compression, GZip, or Deflate."typeProperties": { "folderPath": "<blobpath>", "format": { "type": "TextFormat", "columnDelimiter": "<any delimiter>", "rowDelimiter": "\n", "nullValue": "", "encodingName": "utf-8" }, "compression": { "type": "GZip", "level": "Optimal" } },
There is no
skipHeaderLineCount
setting under BlobSource or AzureDataLakeStore for the Copy activity in the pipeline.There is no
sliceIdentifierColumnName
setting under SqlDWSink for the Copy activity in the pipeline. (PolyBase guarantees that all data is updated or nothing is updated in a single run. To achieve repeatability, you could usesqlWriterCleanupScript
).There is no
columnMapping
being used in the associated in Copy activity.
Staged Copy using PolyBase
When your source data doesn't meet the criteria introduced in the previous section, you can enable copying data via an interim staging Azure Blob Storage (cannot be Premium Storage). In this case, Azure Data Factory automatically performs transformations on the data to meet data format requirements of PolyBase, then use PolyBase to load data into Azure Synapse Analytics, and at last clean-up your temp data from the Blob storage. See Staged Copy for details on how copying data via a staging Azure Blob works in general.
Note
When copying data from an on premises data store into Azure Synapse Analytics using PolyBase and staging, if your Data Management Gateway version is below 2.4, JRE (Java Runtime Environment) is required on your gateway machine that is used to transform your source data into proper format. Suggest you upgrade your gateway to the latest to avoid such dependency.
To use this feature, create an Azure Storage linked service that refers to the Azure Storage Account that has the interim blob storage, then specify the enableStaging
and stagingSettings
properties for the Copy Activity as shown in the following code:
"activities":[
{
"name": "Sample copy activity from SQL Server to Azure Synapse Analytics via PolyBase",
"type": "Copy",
"inputs": [{ "name": "OnpremisesSQLServerInput" }],
"outputs": [{ "name": "AzureSQLDWOutput" }],
"typeProperties": {
"source": {
"type": "SqlSource",
},
"sink": {
"type": "SqlDwSink",
"allowPolyBase": true
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": "MyStagingBlob"
}
}
}
]
Best practices when using PolyBase
The following sections provide additional best practices to the ones that are mentioned in Best practices for Azure Synapse Analytics.
Required database permission
To use PolyBase, it requires the user being used to load data into Azure Synapse Analytics has the "CONTROL" permission on the target database. One way to achieve that is to add that user as a member of "db_owner" role. Learn how to do that by following this section.
Row size and data type limitation
Polybase loads are limited to loading rows both smaller than 1 MB and cannot load to VARCHR(MAX), NVARCHAR(MAX) or VARBINARY(MAX). Refer to here.
If you have source data with rows of size greater than 1 MB, you may want to split the source tables vertically into several small ones where the largest row size of each of them does not exceed the limit. The smaller tables can then be loaded using PolyBase and merged together in Azure Synapse Analytics.
Azure Synapse Analytics resource class
To achieve best possible throughput, consider to assign larger resource class to the user being used to load data into Azure Synapse Analytics via PolyBase. Learn how to do that by following Change a user resource class example.
tableName in Azure Synapse Analytics
The following table provides examples on how to specify the tableName property in dataset JSON for various combinations of schema and table name.
DB Schema | Table name | tableName JSON property |
---|---|---|
dbo | MyTable | MyTable or dbo.MyTable or [dbo].[MyTable] |
dbo1 | MyTable | dbo1.MyTable or [dbo1].[MyTable] |
dbo | My.Table | [My.Table] or [dbo].[My.Table] |
dbo1 | My.Table | [dbo1].[My.Table] |
If you see the following error, it could be an issue with the value you specified for the tableName property. See the table for the correct way to specify values for the tableName JSON property.
Type=System.Data.SqlClient.SqlException,Message=Invalid object name 'stg.Account_test'.,Source=.Net SqlClient Data Provider
Columns with default values
Currently, PolyBase feature in Data Factory only accepts the same number of columns as in the target table. Say, you have a table with four columns and one of them is defined with a default value. The input data should still contain four columns. Providing a 3-column input dataset would yield an error similar to the following message:
All columns of the table must be specified in the INSERT BULK statement.
NULL value is a special form of default value. If the column is nullable, the input data (in blob) for that column could be empty (cannot be missing from the input dataset). PolyBase inserts NULL for them in the Azure Synapse Analytics.
Auto table creation
If you are using Copy Wizard to copy data from SQL Server or Azure SQL Database to Azure Synapse Analytics and the table that corresponds to the source table does not exist in the destination store, Data Factory can automatically create the table in the data warehouse by using the source table schema.
Data Factory creates the table in the destination store with the same table name in the source data store. The data types for columns are chosen based on the following type mapping. If needed, it performs type conversions to fix any incompatibilities between source and destination stores. It also uses Round Robin table distribution.
Source SQL Database column type | Destination Azure Synapse Analytics column type (size limitation) |
---|---|
Int | Int |
BigInt | BigInt |
SmallInt | SmallInt |
TinyInt | TinyInt |
Bit | Bit |
Decimal | Decimal |
Numeric | Decimal |
Float | Float |
Money | Money |
Real | Real |
SmallMoney | SmallMoney |
Binary | Binary |
Varbinary | Varbinary (up to 8000) |
Date | Date |
DateTime | DateTime |
DateTime2 | DateTime2 |
Time | Time |
DateTimeOffset | DateTimeOffset |
SmallDateTime | SmallDateTime |
Text | Varchar (up to 8000) |
NText | NVarChar (up to 4000) |
Image | VarBinary (up to 8000) |
UniqueIdentifier | UniqueIdentifier |
Char | Char |
NChar | NChar |
VarChar | VarChar (up to 8000) |
NVarChar | NVarChar (up to 4000) |
Xml | Varchar (up to 8000) |
Repeatability during Copy
When copying data to Azure SQL/SQL Server from other data stores one needs to keep repeatability in mind to avoid unintended outcomes.
When copying data to Azure SQL/SQL Server Database, copy activity will by default APPEND the data set to the sink table by default. For example, when copying data from a CSV (comma separated values data) file source containing two records to Azure SQL/SQL Server Database, this is what the table looks like:
ID Product Quantity ModifiedDate
... ... ... ...
6 Flat Washer 3 2015-05-01 00:00:00
7 Down Tube 2 2015-05-01 00:00:00
Suppose you found errors in source file and updated the quantity of Down Tube from 2 to 4 in the source file. If you re-run the data slice for that period, you’ll find two new records appended to Azure SQL/SQL Server Database. The below assumes none of the columns in the table have the primary key constraint.
ID Product Quantity ModifiedDate
... ... ... ...
6 Flat Washer 3 2015-05-01 00:00:00
7 Down Tube 2 2015-05-01 00:00:00
6 Flat Washer 3 2015-05-01 00:00:00
7 Down Tube 4 2015-05-01 00:00:00
To avoid this, you will need to specify UPSERT semantics by leveraging one of the below 2 mechanisms stated below.
Note
A slice can be re-run automatically in Azure Data Factory as per the retry policy specified.
Mechanism 1
You can leverage sqlWriterCleanupScript property to first perform cleanup action when a slice is run.
"sink":
{
"type": "SqlSink",
"sqlWriterCleanupScript": "$$Text.Format('DELETE FROM table WHERE ModifiedDate >= \\'{0:yyyy-MM-dd HH:mm}\\' AND ModifiedDate < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
}
The cleanup script would be executed first during copy for a given slice which would delete the data from the SQL Table corresponding to that slice. The activity will subsequently insert the data into the SQL Table.
If the slice is now re-run, then you will find the quantity is updated as desired.
ID Product Quantity ModifiedDate
... ... ... ...
6 Flat Washer 3 2015-05-01 00:00:00
7 Down Tube 4 2015-05-01 00:00:00
Suppose the Flat Washer record is removed from the original csv. Then re-running the slice would produce the following result:
ID Product Quantity ModifiedDate
... ... ... ...
7 Down Tube 4 2015-05-01 00:00:00
Nothing new had to be done. The copy activity ran the cleanup script to delete the corresponding data for that slice. Then it read the input from the csv (which then contained only 1 record) and inserted it into the Table.
Mechanism 2
Important
sliceIdentifierColumnName is not supported for Azure Synapse Analytics at this time.
Another mechanism to achieve repeatability is by having a dedicated column (sliceIdentifierColumnName) in the target Table. This column would be used by Azure Data Factory to ensure the source and destination stay synchronized. This approach works when there is flexibility in changing or defining the destination SQL Table schema.
This column would be used by Azure Data Factory for repeatability purposes and in the process Azure Data Factory will not make any schema changes to the Table. Way to use this approach:
Define a column of type binary (32) in the destination SQL Table. There should be no constraints on this column. Let's name this column as ‘ColumnForADFuseOnly’ for this example.
Use it in the copy activity as follows:
"sink": { "type": "SqlSink", "sliceIdentifierColumnName": "ColumnForADFuseOnly" }
Azure Data Factory will populate this column as per its need to ensure the source and destination stay synchronized. The values of this column should not be used outside of this context by the user.
Similar to mechanism 1, Copy Activity will automatically first clean up the data for the given slice from the destination SQL Table and then run the copy activity normally to insert the data from source to destination for that slice.
Type mapping for Azure Synapse Analytics
As mentioned in the data movement activities article, Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach:
- Convert from native source types to .NET type
- Convert from .NET type to native sink type
When moving data to & from Azure Synapse Analytics, the following mappings are used from SQL type to .NET type and vice versa.
The mapping is same as the SQL Server Data Type Mapping for ADO.NET.
SQL Server Database Engine type | .NET Framework type |
---|---|
bigint | Int64 |
binary | Byte[] |
bit | Boolean |
char | String, Char[] |
date | DateTime |
Datetime | DateTime |
datetime2 | DateTime |
Datetimeoffset | DateTimeOffset |
Decimal | Decimal |
FILESTREAM attribute (varbinary(max)) | Byte[] |
Float | Double |
image | Byte[] |
int | Int32 |
money | Decimal |
nchar | String, Char[] |
ntext | String, Char[] |
numeric | Decimal |
nvarchar | String, Char[] |
real | Single |
rowversion | Byte[] |
smalldatetime | DateTime |
smallint | Int16 |
smallmoney | Decimal |
sql_variant | Object * |
text | String, Char[] |
time | TimeSpan |
timestamp | Byte[] |
tinyint | Byte |
uniqueidentifier | Guid |
varbinary | Byte[] |
varchar | String, Char[] |
xml | Xml |
You can also map columns from source dataset to columns from sink dataset in the copy activity definition. For details, see Mapping dataset columns in Azure Data Factory.
JSON examples for copying data to and from Azure Synapse Analytics
The following examples provide sample JSON definitions that you can use to create a pipeline by using Visual Studio or Azure PowerShell. They show how to copy data to and from Azure Synapse Analytics and Azure Blob Storage. However, data can be copied directly from any of sources to any of the sinks stated here using the Copy Activity in Azure Data Factory.
Example: Copy data from Azure Synapse Analytics to Azure Blob
The sample defines the following Data Factory entities:
- A linked service of type AzureSqlDW.
- A linked service of type AzureStorage.
- An input dataset of type AzureSqlDWTable.
- An output dataset of type AzureBlob.
- A pipeline with Copy Activity that uses SqlDWSource and BlobSink.
The sample copies time-series (hourly, daily, etc.) data from a table in Azure Synapse Analytics database to a blob every hour. The JSON properties used in these samples are described in sections following the samples.
Azure Synapse Analytics linked service:
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
Azure Blob storage linked service:
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
}
}
}
Azure Synapse Analytics input dataset:
The sample assumes you have created a table "MyTable" in Azure Synapse Analytics and it contains a column called "timestampcolumn" for time series data.
Setting "external": "true" informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory.
{
"name": "AzureSqlDWInput",
"properties": {
"type": "AzureSqlDWTable",
"linkedServiceName": "AzureSqlDWLinkedService",
"typeProperties": {
"tableName": "MyTable"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
Azure Blob output dataset:
Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time.
{
"name": "AzureBlobOutput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}",
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
],
"format": {
"type": "TextFormat",
"columnDelimiter": "\t",
"rowDelimiter": "\n"
}
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Copy activity in a pipeline with SqlDWSource and BlobSink:
The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to SqlDWSource and sink type is set to BlobSink. The SQL query specified for the SqlReaderQuery property selects the data in the past hour to copy.
{
"name":"SamplePipeline",
"properties":{
"start":"2014-06-01T18:00:00",
"end":"2014-06-01T19:00:00",
"description":"pipeline for copy activity",
"activities":[
{
"name": "AzureSQLDWtoBlob",
"description": "copy activity",
"type": "Copy",
"inputs": [
{
"name": "AzureSqlDWInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"typeProperties": {
"source": {
"type": "SqlDWSource",
"sqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
]
}
}
Note
In the example, sqlReaderQuery is specified for the SqlDWSource. The Copy Activity runs this query against the Azure Synapse Analytics source to get the data.
Alternatively, you can specify a stored procedure by specifying the sqlReaderStoredProcedureName and storedProcedureParameters (if the stored procedure takes parameters).
If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure section of the dataset JSON are used to build a query (select column1, column2 from mytable) to run against Azure Synapse Analytics. If the dataset definition does not have the structure, all columns are selected from the table.
Example: Copy data from Azure Blob to Azure Synapse Analytics
The sample defines the following Data Factory entities:
- A linked service of type AzureSqlDW.
- A linked service of type AzureStorage.
- An input dataset of type AzureBlob.
- An output dataset of type AzureSqlDWTable.
- A pipeline with Copy activity that uses BlobSource and SqlDWSink.
The sample copies time-series data (hourly, daily, etc.) from Azure blob to a table in an Azure Synapse Analytics database every hour. The JSON properties used in these samples are described in sections following the samples.
Azure Synapse Analytics linked service:
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
Azure Blob storage linked service:
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
}
}
}
Azure Blob input dataset:
Data is picked up from a new blob every hour (frequency: hour, interval: 1). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, and day part of the start time and file name uses the hour part of the start time. "external": "true" setting informs the Data Factory service that this table is external to the data factory and is not produced by an activity in the data factory.
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}",
"fileName": "{Hour}.csv",
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
],
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
}
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
Azure Synapse Analytics output dataset:
The sample copies data to a table named "MyTable" in Azure Synapse Analytics. Create the table in Azure Synapse Analytics with the same number of columns as you expect the Blob CSV file to contain. New rows are added to the table every hour.
{
"name": "AzureSqlDWOutput",
"properties": {
"type": "AzureSqlDWTable",
"linkedServiceName": "AzureSqlDWLinkedService",
"typeProperties": {
"tableName": "MyOutputTable"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Copy activity in a pipeline with BlobSource and SqlDWSink:
The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to BlobSource and sink type is set to SqlDWSink.
{
"name":"SamplePipeline",
"properties":{
"start":"2014-06-01T18:00:00",
"end":"2014-06-01T19:00:00",
"description":"pipeline with copy activity",
"activities":[
{
"name": "AzureBlobtoSQLDW",
"description": "Copy Activity",
"type": "Copy",
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureSqlDWOutput"
}
],
"typeProperties": {
"source": {
"type": "BlobSource",
"blobColumnSeparators": ","
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
]
}
}
For a walkthrough, see the see Load 1 TB into Azure Synapse Analytics under 15 minutes with Azure Data Factory and Load data with Azure Data Factory article in the Azure Synapse Analytics documentation.
Performance and Tuning
See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it.
Feedback
Submit and view feedback for