Optimize S3 API cost for Data Lake

Published in

Dev Genius

4 min readNov 8, 2022

In the previous article, we discussed how to identify S3 data sets causing high S3 API costs. Once we have the list of tables next step is to see if we can reduce the cost associated with those.

How S3 API cost is calculated

While discussing the API cost, the most important thing to remember is API calls are dependent on the number of objects added or retrieved irrespective of object size. This means a large number of small objects can escalate the costs for Get and Put API calls.

To reduce the cost we need to reduce the number of objects to be scanned when reading a table from Data Lake using Athena or any other query engine. It can be done in two ways: partitioning and bucketing. We will discuss both but bucketing is the technique that worked well for me.

Partitioning

Partitioning is a very well-known topic — a detailed discussion of the topic is out of the scope of this article. You can read here how to partition S3 data for Data Lake — AWS doc

To partition an existing non-partitioned table Athena’s CTAS command can be used — AWS doc

If the table has more than 100 partitions you can use INSERT INTO — AWS doc

Confirm Partitions are used in queries

If the API cost from a partitioned table is high, first thing to do is to make sure partitions are actually used in the queries:

Confirm partition metadata is updated using SHOW PARTITIONS table_name. It shows all the partitions listed
Make sure partition keys are used while scanning data. This is very important because there are many edge cases where partition keys are ignored. For example, if the partition name is within the WHERE clause of the subquery, Athena currently does not filter the partition.EXPLAIN command in Athena can be used to confirm the usage of partition keys

-- Table is partitioned by (year,month,day,hour)
explain select device_id from ticket 
where year = '2022' and month ='11' and day = '6' and hour = '13'-- Result
Query Plan
Fragment 0 [SOURCE]
    Output layout: [device_id]
    Output partitioning: SINGLE []
    Output[columnNames = [device_id]]
    │   Layout: [device_id:varchar]
    │   Estimates: 
    └─ TableScan[table = awsdatacatalog:<athena_db_name>:ticket]
           Layout: [device_id:varchar]
           Estimates: 
           device_id := device_id:string:REGULAR
           year:string:PARTITION_KEY
               :: [[2022]]
           day:string:PARTITION_KEY
               :: [[6]]
           month:string:PARTITION_KEY
               :: [[11]]
           hour:string:PARTITION_KEY
               :: [[13]]

The command output will clearly show which all partitions keys and values will be used in the query execution. This can be used for any SQL on Athena and if there is no mention of partition keys as shown above or if keys are with all possible values that mean no keys are used and a full table scan will happen.

If partition keys are not used while query execution, try to change SQL if possible or maybe bucketing is the best solution, which is the next topic.

Bucketing basics

Bucketing is a technique that groups data based on specific columns together within a single partition. These columns are known as bucket keys. While creating the table we can specify the number of files the table should have in S3 path. As the number of files decreases drastically, S3 API cost also reduces on the same proportion. This is the most effective way to tackle huge API costs.

Athena support the CTAS command to create bucketed tables. You can use the column that has high cardinality as bucket key

CREATE TABLE merged_tables.ticket
WITH (
format = 'PARQUET',
external_location = 's3://de-experements/databases/merged_tables/ticket',
bucketed_by = ARRAY['device_id'],
bucket_count = 10)
AS SELECT *
FROM raw_tables.ticket;
-- target table will have exactly 10 files in S3. If source table has 10K files and did full table scan, we will see API count reduction of ~1000X for this data set

Confirm bucketed table is more cost-effective

Now we know how to create bucketed tables using SQL. The best way to make sure bucketed tables are more cost-effective is to use bucketed tables in backfills of SQLs.

Say you have new SQL from an analyst. They want the target table to be backfilled for the last three years which means you have to run the SQL hundreds of times (if SQL is for a day and the result is to be appended)

Create bucketed tables for each source table. Use a separate S3 bucket for bucketed table location.
Point the SQL to the bucketed table and run the Backfill SQLs from the scheduler (we use Airflow)
After backfill is completed, check how much is the cost of S3 API from the S3 bucket used for the bucketed table.

If no backfills are to be done, you can create bucketed tables and repeatedly run some daily SQLs on top of those tables and compare the API cost.

How to bucket an existing table

In most of cases, the data is written to S3 from batch or stream pipelines as a single file and it is not possible to re-create bucketed table each time a write is happening to S3.

Here is a simple strategy to use. Here we have one raw_table and a corresponding bucketed_table (created from raw_table)

We will keep raw_table intact. Write to the raw_table’s S3 location will happen as usual.
bucketed_table is created from raw_table every week (or any frequency you decide). Any file writes happening to raw_table’s S3 location after this table creation will also be copied to bucketed_table’s S3 location.
So bucketed_table will have bucketed files and additional files for last N days.
In the next week, we will drop the bucketed_table , Clean S3 location and again recreate it from raw_table. — Now the bucketed_table has only bucketed files.
If the raw_table is partitioned there will be additional columns available in the table. So while writing individual files in bucketed_table, make sure those columns are added to the file

Point all SQLs to run on bucketed_table and you will see a drastic reduction in S3 API cost and as bonus most of the queries will be much faster.

Conclusion

Bucketing of tables is a very useful technique for reducing S3 API cost and can be implemented for production pipelines without much effort.

Dev Genius

Optimize S3 API cost for Data Lake

How S3 API cost is calculated

Partitioning

Confirm Partitions are used in queries

Bucketing basics

Confirm bucketed table is more cost-effective

How to bucket an existing table

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Dev Genius

Written by Faisal KK

No responses yet