AWS Big Data Blog
Moving Big Data Into The Cloud with ExpeDat Gateway for HAQM S3
Matt Yanchyshyn is a Principal Solutions Architect with HAQM Web Services
Introduction
A previous blog post (Moving Big Data Into the Cloud with Tsunami UDP) discussed how Tsunami UDP is a fast and easy way to move large amounts of data to and from AWS. Specifically, we showed how you can use it to move data quickly into HAQM EC2 from another instance in a distant AWS Region. From there, using the multipart upload functionality built into the AWS CLI, we moved the data into HAQM Simple Storage Service (HAQM S3).
Tsunami UDP has no software license fees and is easy to set up, but it has drawbacks. It doesn’t support native encryption—important to know if you’re working with sensitive data. It’s also a single-threaded application that hasn’t been updated since 2009 and commercial support is not available. In addition, due to the lack of an SDK or plugins, Tsunami UDP can be hard to automate for tasks like creating watch folders with complex rules. And Tsunami UDP doesn’t support native HAQM S3 integration, so transfers must first be terminated on an HAQM Elastic Compute Cloud (HAQM EC2) instance and then re-transmitted to HAQM S3 manually using tools like the AWS CLI.
ExpeDat, by Data Expedition Inc., addresses these shortfalls. It also provides features that make moving large amounts of data into HAQM S3 from on-premises or HAQM EC2 instances in other regions a seamless experience. Unlike Tsunami UDP, ExpeDat is an actively maintained and fully supported product that employs AES encryption and has lightweight, cross-platform clients with GUIs. ExpeDat also has Object Handlers that let you integrate with any external script or program, making automation easy to set up. If lower-level integration is required, SDKs are available. The ExpeDat S3 Gateway product can also automatically stream data into HAQM S3 – data never touches HAQM Elastic Block Store (HAQM EBS) or ephemeral storage, but instead lives only in memory as it’s transmitted via the ExpeDat gateway on HAQM EC2 to the bucket of your choice in HAQM S3.
One of the easiest ways to get started with ExpeDat is to install the ExpeDat Gateway for HAQM S3 via the AWS Marketplace. This product can transmit ~300GB per hour and is available as a monthly subscription. The ExpeDat S3 Gateway runs on an HAQM EC2 instance, which can be set up in a couple of minutes.
Getting Started
For this example, we’ll use the same dataset that we used in the earlier blog post: the Wikipedia Traffic Statistics V2 from AWS Public Data Sets. We’ ll move this 650GB compressed dataset over the Internet from an HAQM EC2 instance in the AWS Tokyo Region (ap-northeast-1) to an ExpeDat S3 Gateway “free trial” HAQM EC2 instance launched from the AWS Marketplace into the AWS N. Virginia Region (us-east-1). For convenience, we’ve placed a copy of the data in an HAQM S3 bucket in ap-northeast-1: s3://accel-file-tx-demo-tokyo.
Launch the ExpeDat Gateway for HAQM S3 (server)
- Go to the AWS Marketplace page for the ExpeDat Gateway for HAQM S3 trial.
Note: You won’t be charged software fees for 21 days, but the usual AWS infrastructure fees apply.
- Launch an instance of the ExpeDat Gateway for HAQM S3 into the US East (Virginia) Region.
Note: We will use an ExpeDat command line client on a separate HAQM EC2 instance for our tests, but feel free to download others such as the graphical clients for Mac or Windows for your own tests.
You should now have a working ExpeDat Gateway for HAQM S3 instance running the servedat server application and pointing to an HAQM S3 bucket.
Set up the ExpeDat Client
- Launch an HAQM Linux instance in ap-northeast-1 (Tokyo). For testing purposes, this instance should be the same type as the one you launched from the AWS Marketplace in us-east-1. For convenience, we’ve prepared an AWS CloudFormation template that launches an HAQM EC2 instance running the 64-bit 2014.03.01 HAQM Linux PV AMI on instance types with two or more large ephemeral drives. The bootstrap script in the template creates a RAID0 array of two of the ephemeral volumes and mounts it to /mnt/bigephemeral.
- Download the ExpeDat client from the web interface of the ExpeDat Gateway for HAQM S3 instance that you just launched:
If you’d like to try ExpeDat without installing the S3 Gateway from the AWS Marketplace, you can download a trial version of the Linux x86-64 ExpeDat client.
- Copy the ExpeDat client that you downloaded to the instance created by the AWS CloudFormation template. You can use the scp utility for this because it uses the same port as SSH, TCP 22, which should already be open in the instance’s Security Group.
- SSH onto the instance created by the AWS CloudFormation template.
- Tune the operating system by increasing the UDP buffers, which can result in overall faster throughput:
sysctl -w net.core.wmem_max=4194304 sysctl -w net.core.rmem_max=4194304
See Data Expedition’s UPD Tuning guidance for notes about this optimization.
- Install the ExpeDat “movedat” file transfer client. To do this, uncompress the ExpeDat archive that you downloaded from your ExpeDat Gateway for HAQM S3 instance and run the install-movedat.sh script:
- Use the “fallocate” command to create a test file, replacing 650 with the size in gigabytes that you prefer for testing:
fallocate -l 650G bigfile.img movedat bigfile.img [user]@[ExpeDat S3 Gateway IP]:=S3
Alternative option: Create a tarball from the dataset files, such as those discussed in the previous post, and pipe the output to the ExpeDat movedat transfer application:
tar -cf - /mnt/bigephemeral | movedat [user]@[ExpeDat S3 Gateway IP]:=S3
- The file(s) should start showing up in your HAQM S3 bucket very shortly after the transfer completes:
Easy! No more watch folders or manual HAQM S3 uploads. Files sent to the ExpeDat Gateway for HAQM S3 from an ExpeDat client (movedat) move straight into your HAQM S3 bucket.
To use AES-128 encryption with ExpeDat, add a -K argument after the movedat command. This addresses Tsunami UDP’s lack of encryption, one of its major shortcomings. Enabling encrypted transfers with ExpeDat increases the CPU load of both the server and the client computers and may cause a reduction in performance on very fast networks or very busy CPUs:
movedat -K * test@[ExpeDat S3 Gateway IP]:=S3
Another great use for ExpeDat Gateway for HAQM S3 is to list, rename, delete and download files in HAQM S3 buckets. Manipulating objects in HAQM S3 via ExpeDat makes automation a lot easier since complex file workloads can be scripted without having to use multiple tools. For example, to list a bucket:
movedat -o [user]@[server]:=S3
List a subdirectory in S3:
movedat -o [user]@[server]:=subdir/S3
Rename:
movedat -o -m [user]@[server]:pagecounts-2008-1001-000000.gz=S3 newfile.gz
Download the renamed file and give it a new local name:
movedat -o [test]@[server:newfile.gz=S3 localfile.gz
Conclusion
The use cases for the big data world continue to evolve, and in some industries batch processing is giving way to real-time or nearly real-time analytics. Tools like ExpeDat S3 Gateway make it easier to meet your business needs when you must quickly move a ton of data into AWS. Its direct HAQM S3 integration makes this a fast, seamless experience. This is especially true if you need to move large files from on-premises or other AWS Regions into HAQM S3 for analysis with HAQM EMR or HAQM Redshift.
ExpeDat Gateway for HAQM S3 improves on free tools such as Tsunami UDP by providing AES 128-bit encryption and a rich selection of graphical and command line clients. It’s easy to automate complex file workflows thanks to the many options provided by its command line tools. This includes the HAQM S3 object manipulation as demonstrated in this blog post and more advanced techniques like using Object Handlers to trigger actions on the ExpeDat Gateway for HAQM S3 server or using the ExpeDat Client SDK in your own programs. Perhaps most importantly, ExpeDat and the suite of associated products offered by Data Expedition, Inc. are actively maintained and fully supported. It’s easy to install them via the AWS Marketplace and the overall cost per GB is very low – there’s even a free trial, so give it a try today!
If you have questions or suggestions, please leave a comment below.
———————————————————-
Related:
Building and Maintaining an HAQM S3 Metadata Index without Servers