Move data from an SFTP server using Azure Data Factory
Note
This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see SFTPconnector in V2.
This article outlines how to use the Copy Activity in Azure Data Factory to move data from an on-premises/cloud SFTP server to a supported sink data store. This article builds on the data movement activities article that presents a general overview of data movement with copy activity and the list of data stores supported as sources/sinks.
Data factory currently supports only moving data from an SFTP server to other data stores, but not for moving data from other data stores to an SFTP server. It supports both on-premises and cloud SFTP servers.
Note
Copy Activity does not delete the source file after it is successfully copied to the destination. If you need to delete the source file after a successful copy, create a custom activity to delete the file and use the activity in the pipeline.
Supported scenarios and authentication types
You can use this SFTP connector to copy data from both cloud SFTP servers and on-premises SFTP servers. Basic and SshPublicKey authentication types are supported when connecting to the SFTP server.
When copying data from an on-premises SFTP server, you need install a Data Management Gateway in the on-premises environment/Azure VM. See Data Management Gateway for details on the gateway. See moving data between on-premises locations and cloud article for step-by-step instructions on setting up the gateway and using it.
Getting started
You can create a pipeline with a copy activity that moves data from an SFTP source by using different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity. For JSON samples to copy data from SFTP server to Azure Blob Storage, see JSON Example: Copy data from SFTP server to Azure blob section of this article.
Linked service properties
The following table provides description for JSON elements specific to FTP linked service.
Property | Description | Required |
---|---|---|
type | The type property must be set to Sftp . |
Yes |
host | Name or IP address of the SFTP server. | Yes |
port | Port on which the SFTP server is listening. The default value is: 21 | No |
authenticationType | Specify authentication type. Allowed values: Basic, SshPublicKey. Refer to Using basic authentication and Using SSH public key authentication sections on more properties and JSON samples respectively. |
Yes |
skipHostKeyValidation | Specify whether to skip host key validation. | No. The default value: false |
hostKeyFingerprint | Specify the finger print of the host key. | Yes if the skipHostKeyValidation is set to false. |
gatewayName | Name of the Data Management Gateway to connect to an on-premises SFTP server. | Yes if copying data from an on-premises SFTP server. |
encryptedCredential | Encrypted credential to access the SFTP server. Auto-generated when you specify basic authentication (username + password) or SshPublicKey authentication (username + private key path or content) in copy wizard or the ClickOnce popup dialog. | No. Apply only when copying data from an on-premises SFTP server. |
Using basic authentication
To use basic authentication, set authenticationType
as Basic
, and specify the following properties besides the SFTP connector generic ones introduced in the last section:
Property | Description | Required |
---|---|---|
username | User who has access to the SFTP server. | Yes |
password | Password for the user (username). | Yes |
Example: Basic authentication
{
"name": "SftpLinkedService",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "mysftpserver",
"port": 22,
"authenticationType": "Basic",
"username": "xxx",
"password": "xxx",
"skipHostKeyValidation": false,
"hostKeyFingerPrint": "ssh-rsa 2048 xx:00:00:00:xx:00:x0:0x:0x:0x:0x:00:00:x0:x0:00",
"gatewayName": "mygateway"
}
}
}
Example: Basic authentication with encrypted credential
{
"name": "SftpLinkedService",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "mysftpserver",
"port": 22,
"authenticationType": "Basic",
"username": "xxx",
"encryptedCredential": "xxxxxxxxxxxxxxxxx",
"skipHostKeyValidation": false,
"hostKeyFingerPrint": "ssh-rsa 2048 xx:00:00:00:xx:00:x0:0x:0x:0x:0x:00:00:x0:x0:00",
"gatewayName": "mygateway"
}
}
}
Using SSH public key authentication
To use SSH public key authentication, set authenticationType
as SshPublicKey
, and specify the following properties besides the SFTP connector generic ones introduced in the last section:
Property | Description | Required |
---|---|---|
username | User who has access to the SFTP server | Yes |
privateKeyPath | Specify absolute path to the private key file that gateway can access. | Specify either the privateKeyPath or privateKeyContent . Apply only when copying data from an on-premises SFTP server. |
privateKeyContent | A serialized string of the private key content. The Copy Wizard can read the private key file and extract the private key content automatically. If you are using any other tool/SDK, use the privateKeyPath property instead. | Specify either the privateKeyPath or privateKeyContent . |
passPhrase | Specify the pass phrase/password to decrypt the private key if the key file is protected by a pass phrase. | Yes if the private key file is protected by a pass phrase. |
Note
SFTP connector supports RSA/DSA OpenSSH key. Make sure your key file content starts with "-----BEGIN [RSA/DSA] PRIVATE KEY-----". If the private key file is a ppk-format file, please use Putty tool to convert from .ppk to OpenSSH format.
Example: SshPublicKey authentication using private key filePath
{
"name": "SftpLinkedServiceWithPrivateKeyPath",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "mysftpserver",
"port": 22,
"authenticationType": "SshPublicKey",
"username": "xxx",
"privateKeyPath": "D:\\privatekey_openssh",
"passPhrase": "xxx",
"skipHostKeyValidation": true,
"gatewayName": "mygateway"
}
}
}
Example: SshPublicKey authentication using private key content
{
"name": "SftpLinkedServiceWithPrivateKeyContent",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "mysftpserver.westus.cloudapp.azure.com",
"port": 22,
"authenticationType": "SshPublicKey",
"username": "xxx",
"privateKeyContent": "<base64 string of the private key content>",
"passPhrase": "xxx",
"skipHostKeyValidation": true
}
}
}
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types.
The typeProperties section is different for each type of dataset. It provides information that is specific to the dataset type. The typeProperties section for a dataset of type FileShare dataset has the following properties:
Property | Description | Required |
---|---|---|
folderPath | Sub path to the folder. Use escape character ' \ ' for special characters in the string. See Sample linked service and dataset definitions for examples. You can combine this property with partitionBy to have folder paths based on slice start/end date-times. |
Yes |
fileName | Specify the name of the file in the folderPath if you want the table to refer to a specific file in the folder. If you do not specify any value for this property, the table points to all files in the folder. When fileName is not specified for an output dataset, the name of the generated file would be in the following this format: Data.<Guid>.txt (Example: Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt |
No |
fileFilter | Specify a filter to be used to select a subset of files in the folderPath rather than all files. Allowed values are: * (multiple characters) and ? (single character).Examples 1: "fileFilter": "*.log" Example 2: "fileFilter": 2014-1-?.txt" fileFilter is applicable for an input FileShare dataset. This property is not supported with HDFS. |
No |
partitionedBy | partitionedBy can be used to specify a dynamic folderPath, filename for time series data. For example, folderPath parameterized for every hour of data. | No |
format | The following format types are supported: TextFormat, JsonFormat, AvroFormat, OrcFormat, ParquetFormat. Set the type property under format to one of these values. For more information, see Text Format, Json Format, Avro Format, Orc Format, and Parquet Format sections. If you want to copy files as-is between file-based stores (binary copy), skip the format section in both input and output dataset definitions. |
No |
compression | Specify the type and level of compression for the data. Supported types are: GZip, Deflate, BZip2, and ZipDeflate. Supported levels are: Optimal and Fastest. For more information, see File and compression formats in Azure Data Factory. | No |
useBinaryTransfer | Specify whether use Binary transfer mode. True for binary mode and false ASCII. Default value: True. This property can only be used when associated linked service type is of type: FtpServer. | No |
Note
filename and fileFilter cannot be used simultaneously.
Using partionedBy property
As mentioned in the previous section, you can specify a dynamic folderPath, filename for time series data with partitionedBy. You can do so with the Data Factory macros and the system variable SliceStart, SliceEnd that indicate the logical time period for a given data slice.
To learn about time series datasets, scheduling, and slices, See Creating Datasets, Scheduling & Execution, and Creating Pipelines articles.
Sample 1:
"folderPath": "wikidatagateway/wikisampledataout/{Slice}",
"partitionedBy":
[
{ "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } },
],
In this example {Slice} is replaced with the value of Data Factory system variable SliceStart in the format (YYYYMMDDHH) specified. The SliceStart refers to start time of the slice. The folderPath is different for each slice. Example: wikidatagateway/wikisampledataout/2014100103 or wikidatagateway/wikisampledataout/2014100104.
Sample 2:
"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}",
"fileName": "{Hour}.csv",
"partitionedBy":
[
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } }
],
In this example, year, month, day, and time of SliceStart are extracted into separate variables that are used by folderPath and fileName properties.
Copy activity properties
For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policies are available for all types of activities.
Whereas, the properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, the type properties vary depending on the types of sources and sinks.
In Copy Activity, when source is of type FileSystemSource, the following properties are available in typeProperties section:
Property | Description | Allowed values | Required |
---|---|---|---|
recursive | Indicates whether the data is read recursively from the sub folders or only from the specified folder. | True, False (default) | No |
Supported file and compression formats
See File and compression formats in Azure Data Factory article on details.
JSON Example: Copy data from SFTP server to Azure blob
The following example provides sample JSON definitions that you can use to create a pipeline by using Visual Studio or Azure PowerShell. They show how to copy data from SFTP source to Azure Blob Storage. However, data can be copied directly from any of sources to any of the sinks stated here using the Copy Activity in Azure Data Factory.
Important
This sample provides JSON snippets. It does not include step-by-step instructions for creating the data factory. See moving data between on-premises locations and cloud article for step-by-step instructions.
The sample has the following data factory entities:
- A linked service of type sftp.
- A linked service of type AzureStorage.
- An input dataset of type FileShare.
- An output dataset of type AzureBlob.
- A pipeline with Copy Activity that uses FileSystemSource and BlobSink.
The sample copies data from an SFTP server to an Azure blob every hour. The JSON properties used in these samples are described in sections following the samples.
SFTP linked service
This example uses the basic authentication with user name and password in plain text. You can also use one of the following ways:
- Basic authentication with encrypted credentials
- SSH public key authentication
See FTP linked service section for different types of authentication you can use.
{
"name": "SftpLinkedService",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "mysftpserver",
"port": 22,
"authenticationType": "Basic",
"username": "myuser",
"password": "mypassword",
"skipHostKeyValidation": false,
"hostKeyFingerPrint": "ssh-rsa 2048 xx:00:00:00:xx:00:x0:0x:0x:0x:0x:00:00:x0:x0:00",
"gatewayName": "mygateway"
}
}
}
Azure Storage linked service
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
}
}
}
SFTP input dataset
This dataset refers to the SFTP folder mysharedfolder
and file test.csv
. The pipeline copies the file to the destination.
Setting "external": "true" informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory.
{
"name": "SFTPFileInput",
"properties": {
"type": "FileShare",
"linkedServiceName": "SftpLinkedService",
"typeProperties": {
"folderPath": "mysharedfolder",
"fileName": "test.csv"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Azure Blob output dataset
Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time.
{
"name": "AzureBlobOutput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/sftp/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": "\t"
},
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
]
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Pipeline with Copy activity
The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to FileSystemSource and sink type is set to BlobSink.
{
"name": "pipeline",
"properties": {
"activities": [{
"name": "SFTPToBlobCopy",
"inputs": [{
"name": "SFTPFileInput"
}],
"outputs": [{
"name": "AzureBlobOutput"
}],
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileSystemSource"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1,
"timeout": "00:05:00"
}
}],
"start": "2017-02-20T18:00:00Z",
"end": "2017-02-20T19:00:00Z"
}
}
Performance and Tuning
See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it.
Next steps
See the following articles:
- Copy Activity tutorial for step-by-step instructions for creating a pipeline with a Copy Activity.
Feedback
Submit and view feedback for