Copying data to the Cloud from the Command line

Copying data to the Cloud from the Command line, this is what this post is all about. Over the last couple of years we have seen vendors offering different object storage platforms/services in the Cloud. Cloud storage typically refers to a hosted object storage service! The following are considered the most known and used Cloud Object Storage platforms and we are going to demonstrate how to copy data to each from the command line.

In no particular order I strongly recommend you read and get knowledgeable on each of their offerings and register for a free subscription:

Now for each platform we will specifically implement 4 actions (commands):

  1. create a bucket/container (placeholder)
  2. copy object(s)
  3. delete objects(s)
  4. delete a bucket/container (placeholder)

We will be using the 2016 Jersey City – Citi Bike trip data, details and info about the data -> https://www.citibikenyc.com/system-data, but for this sample demo I already downloaded and packaged the data for you! Which you can download here.

Once downloaded and unzipped the data should look like this…

Warning!
I am on macOS, all works in Windows, just don’t forget to change the file path (i.e. from “/data/citibike/jc/2016/” to “C:\data\citibike\jc\2016\". The goal is to load/copy the same “file” structure from local source (image above) to the cloud, but with a root “bucket/container” named citibike-tripdata instead of citibike. Let’s get started…

     

Google Cloud Storage

Access to Google Cloud Storage from the command line is done with the gsutil Tool. Assumption is that you have selected and created a Cloud Platform project and enabled billing for your project. Next you need to install the Cloud SDK and Python 2.7 (if not already installed). To install and initialize follow these Quickstarts.

Create a bucket

mb – Make buckets. The following command creates a bucket named citibike-tripdata to a nearline storage class in the us-east-1 region:

gsutil mb -c nearline -l us-east1 gs://citibike-tripdata

Copy object(s)

cp – Copy files and objects. The following command performs a parallel (multi-threaded/multi-processing) copy of files recursively, and applies the gzip content-encoding to each:

gsutil -m cp -r -Z "/data/citibike/jc/2016/" gs://citibike-tripdata/jc/

Delete object(s)

rm – Remove objects. The following command causes bucket or bucket subdirectory contents (all objects and subdirectories that it contains) to be removed recursively:

gsutil -m rm -r gs://citibike-tripdata/jc/2016/

Delete bucket

rb – Remove buckets. Buckets must be empty before you can delete them. The following command deletes the citibike-tripdata bucket (and all of it’s subdirectories, because we deleted all objects previously):

gsutil rb gs://citibike-tripdata

     

Amazon S3

The AWS CLI is an open source tool built on top of the AWS SDK for Python that provides commands for interacting with AWS services. It uses all of the functionality provided by the AWS Management Console from your favorite terminal program. Start with Installing the AWS Command Line Interface and next configure your settings.

Create a bucket

mb – Creates an S3 bucket. The following command creates a bucket named citibike-tripdata in the ca-central-1 region:

aws s3 mb s3://citibike-tripdata --region ca-central-1

Copy object(s)

cp – Copies a local file or S3 object to another location locally or in S3. The following command copies files recursively to a STANDARD_IA (infrequent access) storage class and applies the gzip content-encoding to each:

aws s3 cp "/data/citibike/jc/2016/" s3://citibike-tripdata/jc/2016/ --storage-class STANDARD_IA --recursive --content-encoding gzip

Delete object(s)

rm – Deletes an S3 object. The following command causes bucket or bucket subdirectory contents (all objects and subdirectories that it contains) to be removed recursively:

aws s3 rm s3://citibike-tripdata/jc/2016/ --recursive

Delete bucket

rb – Deletes an empty S3 bucket. A bucket must be completely empty of objects and versioned objects before it can be deleted. The following command deletes the citibike-tripdata bucket (and all of it’s subdirectories, because we deleted all objects previously):

aws s3 rb s3://citibike-tripdata

     

Azure Blob Storage

The Azure CLI and Azure PowerShell module are used to create and manage Azure resources from the command line or in scripts. AzCopy available on Windows and Linux is also an interesting command-line utility designed for high-performance copying of data to and from Azure Storage.

Both the Azure CLI and Azure PowerShell are available for macOS, Windows and Linux, but at this time for macOS and Linux, the PowerShell 6 (beta) and Azure PowerShell for .NET Core are still in beta, so the Azure PowerShell sample commands below are done on the Windows platform.

If using the Azure CLI run the login command: az login or launch the Cloud Shell and if using the Azure PowerShell run the login command: Login-AzureRmAccount.

Create a container

az storage container. The following commands creates a resource group citibiketripdata_rg, a storage account citibiketripdata, and a container citibike-tripdata:

# Create a resource group
az group create --name citibiketripdata_rg --location eastus

# Create a storage account
az storage account create \
    --name citibiketripdata \
    --resource-group citibiketripdata_rg \
    --location eastus \
    --sku Standard_LRS \
    --encryption blob

# The Azure CLI needs your storage account credentials...
export AZURE_STORAGE_ACCOUNT='PUT_YOUR_STORAGE_ACCOUNT_HERE'
export AZURE_STORAGE_ACCESS_KEY='PUT_YOUR_ACCESS_KEY_HERE'

# Create a container
az storage container create --name citibike-tripdata

New-AzureStorageContainer. The following commands creates a resource group citibiketripdata_rg, a storage account citibiketripdata, and a container citibike-tripdata:

# Create a resource group
New-AzureRmResourceGroup -Name citibiketripdata_rg -Location eastus 

# Create a storage account
$storageAccount = New-AzureRmStorageAccount -ResourceGroupName citibiketripdata_rg `
  -Name "citibiketripdata" `
  -Location eastus `
  -SkuName Standard_LRS `
  -Kind Storage `
  -EnableEncryptionService Blob

$ctx = $storageAccount.Context

# Create a container
New-AzureStorageContainer -Name "citibike-tripdata" -Context $ctx -Permission Off

Copy object(s)

az storage blob upload-batch. The following command uploads files from a local directory /data/citibike/ to a blob container citibike-tripdata with gzip content-encoding

az storage blob upload-batch --destination citibike-tripdata \ --source /data/citibike/ \ --content-encoding gzip

The Set-AzureStorageBlobContent does the job, however it does not keep and copy the folder structure from our local source. So here is a PowerShell script that does exactly what we need:

$StorageAccountName = "citibiketripdata"
$StorageAccountKey = "PUT_YOUR_ACCESS_KEY_HERE"
$ContainerName = "citibike-tripdata"
$sourceFileRootDirectory = "C:\data\citibike"
 
function UploadFilesToAzureStorageContainer {
    [cmdletbinding()]
    param(
        $StorageAccountName,
        $StorageAccountKey,
        $ContainerName,
        $sourceFileRootDirectory,
        $Force
    )
 
    $ctx = New-AzureStorageContext -StorageAccountName $StorageAccountName -StorageAccountKey $StorageAccountKey
    $container = Get-AzureStorageContainer -Name $ContainerName -Context $ctx
 
    $container.CloudBlobContainer.Uri.AbsoluteUri
    if ($container) {
        $filesToUpload = Get-ChildItem $sourceFileRootDirectory -Recurse -File
 
        foreach ($x in $filesToUpload) {
            $targetPath = ($x.fullname.Substring($sourceFileRootDirectory.Length + 1)).Replace("\", "/")
 
            Write-Verbose "Uploading $("\" + $x.fullname.Substring($sourceFileRootDirectory.Length + 1)) to $($container.CloudBlobContainer.Uri.AbsoluteUri + "/" + $targetPath)"
            Set-AzureStorageBlobContent -File $x.fullname -Container $container.Name -Blob $targetPath -Context $ctx -Properties @{"ContentEncoding" = "gzip"} -Force:$Force | Out-Null
        }
    }
}
 
UploadFilesToAzureStorageContainer -StorageAccountName $StorageAccountName -StorageAccountKey $StorageAccountKey -ContainerName $ContainerName -sourceFileRootDirectory $sourceFileRootDirectory -Verbose

As mentioned earlier the AzCopy utility is another option for high-performance scriptable data transfer for Azure Storage. The following command will recursively copy data from the C:\data\citibike local folder to the citibike-tripdata container

AzCopy /Source:C:\data\citibike /Dest:https://citibiketripdata.blob.core.windows.net/citibike-tripdata /DestKey:PUT_YOUR_ACCESS_KEY_HERE /S

Delete object(s)

az storage blob delete-batch. Delete blobs from a blob container recursively

Azure-CLI
*Note: It’s documented! But at the time of writing this post, it’s not yet implemented. The following GitHub Pull Request as been approved and should be released soon. Details here -> https://github.com/Azure/azure-cli/pull/4781

Get-AzureStorageBlob -Container $ContainerName -Context $ctx | ForEach-Object {Remove-AzureStorageBlob -Blob $_.Name -Container $ContainerName -Context $ctx}

Delete container

az storage container delete. The following command marks the specified container for deletion

az storage container delete --name citibike-tripdata

Remove-AzureStorageContainer. The following command removes the specified storage container

Remove-AzureStorageContainer -Name "citibike-tripdata"

But wait! It can be easier… If you no longer need any of the resources in your resource group, including the storage account and blobs, delete the resource group with the az group delete command using Azure CLI or the Remove-AzureRmResourceGroup command in Azure PowerShell

az group delete --name citibiketripdata_rg

Remove-AzureRmResourceGroup -Name "citibiketripdata_rg"

Voilà! We are done. Need further assistance and guidance, post a comment below.
Enjoy!

comments powered by Disqus