Skip to content

Importing and exporting data#

This document outlines common solutions for importing data into your Lambda On-Demand Cloud (ODC) instances and 1-Click Clusters (1CCs). The document also provides guidance on backing up your data so that it persists beyond the life of your instance or 1CC.

Importing data#

You can use rsync to copy data to and from your Lambda instances and their attached filesystems. rsync allows you to copy files from your local environment to your ODC instance, between ODC instances, from instances to 1CCs, and more. If you need to import data from AWS S3 or an S3-compatible object storage service like Cloudflare R2, Google Cloud Storage, or Minio, you can use s5cmd or rclone.

Importing data from your local environment#

To copy files from your local environment to a Lambda Cloud instance or cluster, run the following rsync command from your local terminal. Replace the variables as follows:

  • Replace <FILES> with the files or directories you want to copy to the remote instance. If you're copying multiple files or directories, separate them using spaces—for example, foo.md bar/ baz/.
  • Replace <USERNAME> with your username on the remote instance.
  • Replace <SERVER-IP>with the IP address of the remote instance.
  • Replace <REMOTE-PATH> with the directory into which you want to copy files.
rsync -av --info=progress2 <FILES> <USERNAME>@<SERVER-IP>:<REMOTE-PATH>

Copying data between instances#

To copy files directly between remote servers using rsync, you must use public key authentication for SSH with an SSH agent. To add your private key to the SSH agent, run ssh-add, replacing <PRIVATE-KEY-PATH> with the path to your SSH private key (for example, ~/.ssh/id_ed25519a):

ssh-add <PRIVATE-KEY-PATH>

You can confirm your key was added to the SSH agent by running:

ssh-add -L

You should see your public key in the output.

After you add your private key to your SSH agent, you can copy files directly between remote servers:

  1. Establish an SSH connection to the server you're copying files from. Replace <SERVER-IP> with the IP address of the server, and replace <USERNAME> with your username on that server.

    ssh -A <USERNAME>@<SERVER-IP>
    
  2. On that server, start a tmux session for your copy operation, replacing <SESSION-NAME> with an appropriate session name. tmux lets you create and manage multiple terminal sessions within a single terminal window or tab.

    tmux new-session -s <SESSION-NAME>
    
  3. Copy files to your remote destination server. Replace <FILES> with the files or directory you want to copy, <USERNAME> with your username on the server, <SERVER-IP> with the server's IP address, and <REMOTE-PATH> with the directory into which you want to copy your files.

    rsync -av --info=progress2 <FILES> <USERNAME>@<SERVER-IP>:<REMOTE-PATH>
    
  4. Optionally, detach your tmux session by pressing Ctrl + B, then D. You can resume the session by running tmux attach-session -t <SESSION-NAME>, replacing <SESSION-NAME> with the session name you chose in step 2.

Importing data from S3 or S3-compatible object storage#

To import data from AWS S3 or S3-compatible object storage services like Google Cloud Storage, Cloudflare R2, or MinIO, you can use a command line tool that supports parallelized import, such as s5cmd or rclone. These solutions are optimized to move large amounts of data and are typically much faster than rsync, scp, s3, and other common file import solutions.

Import with s5cmd#

First, install s5cmd on your instance or node:

  1. Navigate to the s5cmd releases page. Copy the link for the latest AMD64 .deb release.
  2. Establish an SSH connection to your instance or node.
  3. In your SSH terminal, download the release to your instance or node. Replace <RELEASE-URL> with the URL you copied in step 1:

    wget <RELEASE-URL>
    
  4. Install the release. Replace <DEB-FILENAME> with the filename of your downloaded Debian package. Make sure to keep the ./ in front of the filename:

    sudo apt install ./<DEB-FILENAME>
    

After you install s5cmd, set up your environment and then begin importing your data:

  1. Open your .bashrc file for editing:

    nano ~/.bashrc
    
  2. At the bottom of the file, set the required environment variables, and then save and exit:

    export AWS_ACCESS_KEY_ID='<ACCESS-KEY-ID>'
    export AWS_SECRET_ACCESS_KEY='<SECRET-ACCESS-KEY>'
    export AWS_PROFILE='<PROFILE-NAME>'
    export AWS_REGION='<BUCKET-REGION>'
    

    Note

    s5cmd supports other methods of specifying credentials as well. For details, see the Specifying credentials section of the s5cmd documentation.

  3. Update your environment with your new environment variables:

    source ~/.bashrc
    
  4. Verify that your credentials are working as expected by listing the files in your source bucket. Replace <S3-BUCKET> with the S3-compatible bucket from which you're importing data:

    s5cmd ls s3://<S3-BUCKET>
    
  5. Navigate to the directory into which you want to import your data.

  6. Use s5cmd to import the data. Replace <S3-BUCKET-PATH> with the path to your files inside your S3-compatible bucket:

    s5cmd cp '<S3-BUCKET-PATH>' .
    

    You can use wildcards to filter the content you import. For example, the following command imports the files and file structure of foo and its subdirectories:

    s5cmd cp 's3://example-bucket/foo/*' .
    

To help optimize your data transfer, you can add the following flags to the s5cmd command.

Flag Function
--concurrency N Sets the number of parts that will be uploaded or downloaded in parallel for a single file. Default is 5.
--dry-run Outputs which operations will be performed without actually carrying out those operations.
--numworkers N Sets the size of the global worker pool. In practice, acts as an upper bound for how many files can be uploaded or downloaded concurrently. Default is 256.
--retry-count N Sets the maximum number of retries for failed operations. Default is 10.

For more guidance on using s5cmd to import your data, see s5cmd on GitHub .

Import with rclone#

To import data from an S3-compatible storage solution using rclone:

  1. Establish an SSH connection to your instance or node.
  2. Install rclone. For installation instructions, see Install in the Rclone documentation.
  3. Configure rclone. Follow the series of prompts to create a new remote and then set the source storage service, your credentials for that service, the region in which your data is stored, and other details:

    rclone config
    
  4. After you complete the configuration process, verify your connection by listing the contents of your source storage bucket. Replace <REMOTE> with the name of your remote and <BUCKET-NAME> with the name of your source bucket:

    rclone ls <REMOTE>:<BUCKET-NAME>
    

    For example, the following command lists the contents of a bucket named example-bucket:

    rclone ls my-remote:example-bucket
    
  5. Navigate to the directory into which you want to import your data.

  6. Import your data. Replace <REMOTE> with the name of your remote, <BUCKET-NAME> with the name of your source bucket, and <LOCAL-DIR> with the path to the directory to which you're importing the data:

    rclone -P copy <REMOTE>:<BUCKET-NAME> <LOCAL-DIR>
    

You can optimize your data transfer by adding flags to your rclone command. Particularly useful flags include:

Flag Function
--checkers N Number of file integrity and status checkers to run in parallel. Default is 8.
--dry-run Outputs which operations will be performed without actually carrying out those operations.
--low-level-retries N Sets the maximum number of retries for failed API-level operations such as reads. Default is 10.
--transfers N Number of file transfers to run in parallel. Default is 4.
--retries N Sets the maximum number of retries for failed transfers. Default is 10.
--timeout duration Sets timeout for blocking network failures. Default is 5m0s.

For a list of additional flags that might be useful, see Global Flags in the Rclone docs. For general information about rclone, see Rclone.

Exporting data#

When you terminate an ODC instance or your 1CC reservation ends, all local, non-filesystem data is destroyed. To preserve your data, you should perform regular backups.

Warning

We cannot recover your data once you've terminated your instance. Before terminating an instance, make sure to back up any data that you want to keep.

Tip

Virtual environments can help simplify the backup process by isolating and centralizing your system state into a small set of directories. For details on setting up an isolated virtual environment, see Managing your system environment > Isolating environments on your instance.

Backing up data to a Lambda filesystem#

You can persist your data on Lambda Cloud by backing the data up to an attached filesystem. In the default configuration, your Lambda filesystem is mounted in the following location:

/home/ubuntu/<FILESYSTEM-NAME>

Important

To back up to a filesystem, the filesystem must be attached to your instance before you start your instance. You can't attach a filesystem to your instance after you start your instance.

Backing up data to S3-compatible object storage#

If you'd prefer to back up your data to an S3-compatible object storage service, you can use the same tools and commands outlined in the Importing data from S3 or S3-compatible object storage section. Instead of copying data from an S3-compatible bucket into your Lambda instance or node, you copy your data from the instance or node into your bucket.