Move data by using Copy Activity
Note
This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see Copy Activity in V2.
Overview
In Azure Data Factory, you can use Copy Activity to copy data between on-premises and cloud data stores. After the data is copied, it can be further transformed and analyzed. You can also use Copy Activity to publish transformation and analysis results for business intelligence (BI) and application consumption.
Copy Activity is powered by a secure, reliable, scalable, and globally available service. This article provides details on data movement in Data Factory and Copy Activity.
First, let's see how data migration occurs between two cloud data stores, and between an on-premises data store and a cloud data store.
Note
To learn about activities in general, see Understanding pipelines and activities.
Copy data between two cloud data stores
When both source and sink data stores are in the cloud, Copy Activity goes through the following stages to copy data from the source to the sink. The service that powers Copy Activity:
- Reads data from the source data store.
- Performs serialization/deserialization, compression/decompression, column mapping, and type conversion. It does these operations based on the configurations of the input dataset, output dataset, and Copy Activity.
- Writes data to the destination data store.
The service automatically chooses the optimal region to perform the data movement. This region is usually the one closest to the sink data store.
Copy data between an on-premises data store and a cloud data store
To securely move data between an on-premises data store and a cloud data store, install Data Management Gateway on your on-premises machine. Data Management Gateway is an agent that enables hybrid data movement and processing. You can install it on the same machine as the data store itself, or on a separate machine that has access to the data store.
In this scenario, Data Management Gateway performs the serialization/deserialization, compression/decompression, column mapping, and type conversion. Data does not flow through the Azure Data Factory service. Instead, Data Management Gateway directly writes the data to the destination store.
See Move data between on-premises and cloud data stores for an introduction and walkthrough. See Data Management Gateway for detailed information about this agent.
You can also move data from/to supported data stores that are hosted on Azure IaaS virtual machines (VMs) by using Data Management Gateway. In this case, you can install Data Management Gateway on the same VM as the data store itself, or on a separate VM that has access to the data store.
Supported data stores and formats
Copy Activity in Data Factory copies data from a source data store to a sink data store. Data Factory supports the following data stores. Data from any source can be written to any sink. Click a data store to learn how to copy data to and from that store.
Note
If you need to move data to/from a data store that Copy Activity doesn't support, use a custom activity in Data Factory with your own logic for copying/moving data. For details on creating and using a custom activity, see Use custom activities in an Azure Data Factory pipeline.
Category | Data store | Supported as a source | Supported as a sink |
---|---|---|---|
Azure | Azure Blob storage | ✓ | ✓ |
Azure Cosmos DB for NoSQL | ✓ | ✓ | |
Azure Data Lake Storage Gen1 | ✓ | ✓ | |
Azure SQL Database | ✓ | ✓ | |
Azure Synapse Analytics | ✓ | ✓ | |
Azure Cognitive Search Index | ✓ | ||
Azure Table storage | ✓ | ✓ | |
Databases | Amazon Redshift | ✓ | |
DB2* | ✓ | ||
MySQL* | ✓ | ||
Oracle* | ✓ | ✓ | |
PostgreSQL* | ✓ | ||
SAP Business Warehouse* | ✓ | ||
SAP HANA* | ✓ | ||
SQL Server* | ✓ | ✓ | |
Sybase* | ✓ | ||
Teradata* | ✓ | ||
NoSQL | Cassandra* | ✓ | |
MongoDB* | ✓ | ||
File | Amazon S3 | ✓ | |
File System* | ✓ | ✓ | |
FTP | ✓ | ||
HDFS* | ✓ | ||
SFTP | ✓ | ||
Others | Generic HTTP | ✓ | |
Generic OData | ✓ | ||
Generic ODBC* | ✓ | ||
Salesforce | ✓ | ||
Web Table (table from HTML) | ✓ |
Note
Data stores with * can be on-premises or on Azure IaaS, and require you to install Data Management Gateway on an on-premises/Azure IaaS machine.
Supported file formats
You can use Copy Activity to copy files as-is between two file-based data stores, you can skip the format section in both the input and output dataset definitions. The data is copied efficiently without any serialization/deserialization.
Copy Activity also reads from and writes to files in specified formats: Text, JSON, Avro, ORC, and Parquet, and compression codec GZip, Deflate, BZip2, and ZipDeflate are supported. See Supported file and compression formats with details.
For example, you can do the following copy activities:
- Copy data in a SQL Server database and write to Azure Data Lake Store in ORC format.
- Copy files in text (CSV) format from on-premises File System and write to Azure Blob in Avro format.
- Copy zipped files from on-premises File System and decompress then land to Azure Data Lake Store.
- Copy data in GZip compressed text (CSV) format from Azure Blob and write to Azure SQL Database.
Globally available data movement
Azure Data Factory is available only in the West US, East US, and North Europe regions. However, the service that powers Copy Activity is available globally in the following regions and geographies. The globally available topology ensures efficient data movement that usually avoids cross-region hops. See Services by region for availability of Data Factory and Data Movement in a region.
Copy data between cloud data stores
When both source and sink data stores are in the cloud, Data Factory uses a service deployment in the region that is closest to the sink in the same geography to move the data. Refer to the following table for mapping:
Geography of the destination data stores | Region of the destination data store | Region used for data movement |
---|---|---|
United States | East US | East US |
East US 2 | East US 2 | |
Central US | Central US | |
North Central US | North Central US | |
South Central US | South Central US | |
West Central US | West Central US | |
West US | West US | |
West US 2 | West US 2 | |
Canada | Canada East | Canada Central |
Canada Central | Canada Central | |
Brazil | Brazil South | Brazil South |
Europe | North Europe | North Europe |
West Europe | West Europe | |
United Kingdom | UK West | UK South |
UK South | UK South | |
Asia Pacific | Southeast Asia | Southeast Asia |
East Asia | Southeast Asia | |
Australia | Australia East | Australia East |
Australia Southeast | Australia Southeast | |
India | Central India | Central India |
West India | Central India | |
South India | Central India | |
Japan | Japan East | Japan East |
Japan West | Japan East | |
Korea | Korea Central | Korea Central |
Korea South | Korea Central |
Alternatively, you can explicitly indicate the region of Data Factory service to be used to perform the copy by specifying executionLocation
property under Copy Activity typeProperties
. Supported values for this property are listed in above Region used for data movement column. Note your data goes through that region over the wire during copy. For example, to copy between Azure stores in Korea, you can specify "executionLocation": "Japan East"
to route through Japan region (see sample JSON as reference).
Note
If the region of the destination data store is not in preceding list or undetectable, by default Copy Activity fails instead of going through an alternative region, unless executionLocation
is specified. The supported region list will be expanded over time.
Copy data between an on-premises data store and a cloud data store
When data is being copied between on-premises (or Azure virtual machines/IaaS) and cloud stores, Data Management Gateway performs data movement on an on-premises machine or virtual machine. The data does not flow through the service in the cloud, unless you use the staged copy capability. In this case, data flows through the staging Azure Blob storage before it is written into the sink data store.
Create a pipeline with Copy Activity
You can create a pipeline with Copy Activity in a couple of ways:
By using the Copy Wizard
The Data Factory Copy Wizard helps you to create a pipeline with Copy Activity. This pipeline allows you to copy data from supported sources to destinations without writing JSON definitions for linked services, datasets, and pipelines. See Data Factory Copy Wizard for details about the wizard.
By using JSON scripts
You can use Data Factory Editor in Visual Studio, or Azure PowerShell to create a JSON definition for a pipeline (by using Copy Activity). Then, you can deploy it to create the pipeline in Data Factory. See Tutorial: Use Copy Activity in an Azure Data Factory pipeline for a tutorial with step-by-step instructions.
JSON properties (such as name, description, input and output tables, and policies) are available for all types of activities. Properties that are available in the typeProperties
section of the activity vary with each activity type.
For Copy Activity, the typeProperties
section varies depending on the types of sources and sinks. Click a source/sink in the Supported sources and sinks section to learn about type properties that Copy Activity supports for that data store.
Here's a sample JSON definition:
{
"name": "ADFTutorialPipeline",
"properties": {
"description": "Copy data from Azure blob to Azure SQL table",
"activities": [
{
"name": "CopyFromBlobToSQL",
"type": "Copy",
"inputs": [
{
"name": "InputBlobTable"
}
],
"outputs": [
{
"name": "OutputSQLTable"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink"
},
"executionLocation": "Japan East"
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
],
"start": "2016-07-12T00:00:00Z",
"end": "2016-07-13T00:00:00Z"
}
}
The schedule that is defined in the output dataset determines when the activity runs (for example: daily, frequency as day, and interval as 1). The activity copies data from an input dataset (source) to an output dataset (sink).
You can specify more than one input dataset to Copy Activity. They are used to verify the dependencies before the activity is run. However, only the data from the first dataset is copied to the destination dataset. For more information, see Scheduling and execution.
Performance and tuning
See the Copy Activity performance and tuning guide, which describes key factors that affect the performance of data movement (Copy Activity) in Azure Data Factory. It also lists the observed performance during internal testing and discusses various ways to optimize the performance of Copy Activity.
Fault tolerance
By default, copy activity will stop copying data and return failure when encounter incompatible data between source and sink; while you can explicitly configure to skip and log the incompatible rows and only copy those compatible data to make the copy succeeded. See the Copy Activity fault tolerance on more details.
Security considerations
See the Security considerations, which describes security infrastructure that data movement services in Azure Data Factory use to secure your data.
Scheduling and sequential copy
See Scheduling and execution for detailed information about how scheduling and execution works in Data Factory. It is possible to run multiple copy operations one after another in a sequential/ordered manner. See the Copy sequentially section.
Type conversions
Different data stores have different native type systems. Copy Activity performs automatic type conversions from source types to sink types with the following two-step approach:
- Convert from native source types to a .NET type.
- Convert from a .NET type to a native sink type.
The mapping from a native type system to a .NET type for a data store is in the respective data store article. (Click the specific link in the Supported data stores table). You can use these mappings to determine appropriate types while creating your tables, so that Copy Activity performs the right conversions.
Next steps
- To learn about the Copy Activity more, see Copy data from Azure Blob storage to Azure SQL Database.
- To learn about moving data from an on-premises data store to a cloud data store, see Move data from on-premises to cloud data stores.
Feedback
Submit and view feedback for