Overview

In this notebook we are going to have a brief view on the AWS S3 storage (a high level overview of the storage types in AWS can be found in AWS. Storage services this literally a 2 minutes read). Concretely, we are going to discuss the following:

  • How to create an AWS S3 Bucket
  • How to upload and download items
  • How to do multi-part file transfer
  • How to generate pre-signed URLS.
  • How to set up bucket policies

Moreover, we will work with AWS S3 buckets using the Boto3 Python package.

S3 Buckets

AWS S3 is an object storage system. It is designed to store any type of data. Furthermore, it is a serverless service. Items in S3 are stored in buckets. As we add items to the buckets its size grows. In theory there is no limit on the size a bucket can get. By design, S3 has 11 9's of durability and stores data for millions of applications. S3 files are referred to as objects. You can find more information about S3 here and here.

Items in S3 are stored in buckets. The user/application must create a bucket before storing any data on S3. The bucket name must

  • Be unique
  • Be at least 3 and no more than 63 character long
  • Cannot be an IP address format
  • Be lowercase
  • Must not contain uppercase characters or underscores

Below we will use the Boto3 Python package to interact with AWS S3; that is to create a bucket, upload and upload files in the created bucket.

import logging
import boto3
from botocore.exceptions import ClientError

Create S3 Bucket

AWS_ACCESS_KEY_ID = 'Use your own credentials'
AWS_SECRET_ACCESS_KEY = 'Use your own credentials'
s3_client = boto3.client('s3', region_name='us-west-2',
                       aws_access_key_id=AWS_ACCESS_KEY_ID, 
                       aws_secret_access_key=AWS_SECRET_ACCESS_KEY)
location = {'LocationConstraint': 'us-west-2'}
s3_client.create_bucket(Bucket='coursera-s3-bucket',
                       CreateBucketConfiguration=location)

The response of the function all above is shown below:

{'ResponseMetadata': {'RequestId': '355VX5QNYSQBTSCM',
  'HostId': '7jXN853VP175Fw/il1Zvx8UXkfRsdQRXH3VrAFOcCYZl4y2ZTF6zNPp6tXvwnpBGlmAKTCP9RFA=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': '7jXN853VP175Fw/il1Zvx8UXkfRsdQRXH3VrAFOcCYZl4y2ZTF6zNPp6tXvwnpBGlmAKTCP9RFA=',
   'x-amz-request-id': '355VX5QNYSQBTSCM',
   'date': 'Thu, 11 Nov 2021 11:01:14 GMT',
   'location': 'http://coursera-s3-bucket.s3.amazonaws.com/',
   'server': 'AmazonS3',
   'content-length': '0'},
  'RetryAttempts': 0},
 'Location': 'http://coursera-s3-bucket.s3.amazonaws.com/'}

S3 access can be modified via:

  • IAM policies
  • Bucket policies
  • Access control lists

Upload and object to a bucket

Bucket policies

Retrieve the policies attached to a bucket

result = 3_client.get_bucket_policy(Bucket='bucket-name')

The call above fails because by default there are no policies set. A bucket's policy can be set by calling the put_bucket_policy method. Moreover, a policy is defined in the same JSON format as an IAM policy.

The Sid (statement ID) is an optional identifier that you provide for the policy statement. You can assign a Sid value to each statement in a statement array.

The Effect element is required and specifies whether the statement results in an allow or an explicit deny. Valid values for Effect are Allow and Deny.

By default, access to resources is denied.

Use the Principal element in a policy to specify the principal that is allowed or denied access to a resource.

You can specify any of the following principals in a policy:

  • AWS account and root user
  • IAM users
  • Federated users (using web identity or SAML federation)
  • IAM roles
  • Assumed-role sessions
  • AWS services
  • Anonymous users

The Action element describes the specific action or actions that will be allowed or denied.

We specify a value using a service namespace as an action prefix (iam, ec2, sqs, sns, s3, etc.) followed by the name of the action to allow or deny.

The Resource element specifies the object or objects that the statement covers. We specify a resource using an ARN. Amazon Resource Names (ARNs) uniquely identify AWS resources.

CORS Configuration

response = s3_client.get_bucket_cors(Bucket=bucket_name)
print(response['CORSRules'])
cors_configuration = {
    'CORSRules':[{'AllowHeaders':['Authorization'],
                'AllowedMethods':['GET', 'PUT'],
                'AllowedOrigins':['*'],
                'ExposeHeaders':['GET', 'PUT'],
                 'MaxAgeSeconds':3000}
                ]
}

response = s3_client.put_bucket_cors(Bucket=bucket_name, CORSConfiguration=cors_configuration)

AWS Glacier

The Glacier storage is the cheapest storage option in AWS. It is used for long term storage of data. Just like S3 is a serverless service. However, it is not as readily accessible as S3. Given this, typically Glacier storage is used for archiving data rather than storing data used on a daily basis.

It provides three retrieval options

  • Expedited (1-5 minutes)
  • Standard (3-5 hours)
  • Bulk (5-12 hours)

Summary

References