For some use-cases it really make sense to integrate advanced machine learning models, such as Large Language Models (LLMs), into your data analytics workflow. Athena allows us to analyze data directly in Amazon S3 using standard SQL — now we are going beyond traditional SQL queries and leverage the power of LLMs to gain deeper insights. This blog will guide you through the process of calling LLMs from Athena using User-Defined Functions (UDFs) implemented in Python via AWS Lambda (Full Lambda code is also shared)
Use case used for example
The table has tyre size informations entered by users — and it varies a lot in formats. LLM is used for correcting it and making it a standard format :
Eg data :
SELECT distinct tyre_size_front_left
FROM "<db>"."table";
-- Result:
-- 245/45/R19
-- 295/45R20
-- 225/ZR20
-- 275/60/R20
-- 122213
-- 275/35/R20
Now we will discuss the steps to create a lambda and call it from the SQL:
Create Lambda
First make sure the IAM principal(user,role..) have proper permissions — check here and confirm. Nothing complicated here, Athena query execution and Lambda invocations are enough.
Then create a lambda use the code from this repo :
- Lambda requires module
pyarrow
and for that use Lambda layer arn:aws:lambda:us-east-1:336392948345:layer:AWSSDKPandas-Python311:18 - BaseAthenaUDF is the class required by UDF lambda to work with Athena and handle_athena_record method implemented to handle the request
- Used AWS Bedrock hosted LLMs but can use any as wish.
Athena will be sending a batch of records(rows) to the function.And in the function we can split that raws further and send to LLM as single quests. This Batching and concurrent requests are handled by the code.You adjust chunk_size in Baseclass
Use the Lambda in SQL
Once the Lambda is created succefully then we can call it from Athena like :
-- Name of the lambda is athena_external_udf
USING EXTERNAL FUNCTION tyre_ai_suggest(col1 varchar,col2 varchar) RETURNS varchar LAMBDA 'athena_external_udf'
with tbl1 as (SELECT distinct tyre_size_front_left
FROM "db"."table" limit 1330)
select tyre_size_front_left,tyre_ai_suggest('tyre_size_validation',tyre_size_front_left) as ai_suggest_tyre_size_front_left
from tbl1 where length(tyre_size_front_left)>3 ;
-- Result
-- tyre_size_front_left,ai_suggest_tyre_size_front_left
-- 245/60R18,["No correction needed"]
-- 255/50/R20,["255/50R20"]
-- 275/60r20,["275/60R20"]
-- 215/55/R17,["No correction needed"]
-- 235/55 R20,["235/55R20"]
-- 205/60/R16,["No correction needed"]
-- 205/55/R17,["No correction needed"]
-- P25550r20,["255/50R20"]
-- 215 55r17,["215/55R17"]
-- 285/45/R22,["No correction needed"]
-- 265/50/20,["265/50R20"]
-- 255/45 R20,["255/45R20"]
-- 225 40zr18 92 y,["225/40ZR18 92Y"]
-- 205/50R17,["No correction needed"]
-- 255/70R18,["No correction needed"]
-- 265/60R18 110H,["No correction needed"]
-- LT315/70R17,["No correction needed"]
-- 225/60R18 100H,["No correction needed"]
-- 235/55R20,["No correction needed"]
Things to Consider
Keep these points checked to avoid surprise AWS bills
- The batching of records handled by Athena and sometimes it can be very small. For example Athena may call Lambda just by sending 10 records — so it may ends up one LLM call for that 10 records (even if you put chunk size may be 1000) and if you are adding a big context to LLM along with the records, this will end up a way more token usage — Always keep Lambda throttling and other alerts to make sure LLM cost is tracked.
- You can use sinlge lambda for multiple use-cases and keep the type of analysis as second argument as I did above