From Excel Chaos to Data Clarity: Transform and Deploy with Python and AWS
Data transformation is crucial for leveraging multiple Excel files as sources for forecasting models, machine learning models, or software applications. This post explores how to efficiently transform data using Python and Pandas, ensuring your data is clean, well-structured, and ready for deployment. The post will also cover how to deploy your transformation code to AWS using a Serverless Application Model (SAM) and set up a CI/CD pipeline for automatic deployment whenever new code is committed to GitHub.
Before diving into the technicalities, let’s understand why data transformation is essential:
Consistency: Ensures data from various sources follows a uniform structure.
Quality: Cleanses and filters out any inconsistencies or errors.
Compatibility: Makes data ready for various forecasting or machine learning models or for use within software applications.
Tools of the Trade: Python and Pandas
Before starting, ensure you have the following installed:
- Python 3.x
- Pandas library (pip install pandas)
- AWS CLI (pip install awscli)
- AWS SAM CLI (pip install aws-sam-cli)
Step 1: Reading and Combining Excel Files
Let’s assume we have multiple Excel files. If the files have similar structures, we can read them directly into a single Pandas DataFrame. However, if they have different structures, we need to handle each file according to its unique schema before combining them.
Handling Files with Similar Structure
import pandas as pd
import glob
# Define the path to the Excel files
file_path = '<path>/*.xlsx'
# Use glob to get all file paths
excel_files = glob.glob(file_path)
# Initialize an empty DataFrame
combined_df = pd.DataFrame()
# Loop through each file and append to the combined DataFrame
for file in excel_files:
df = pd.read_excel(file)
combined_df = pd.concat([combined_df, df], ignore_index=True)
# Display the combined DataFrame
print(combined_df.head())
Handling Files with Different Structures
If the files have different structures, follow these steps:
- Identify Common Columns: Determine the columns that are common across all files.
- Normalize Structures: Standardize the structure of each file to match a common format.
- Combine Data: Merge the standardized DataFrames.
import pandas as pd
import glob
# Define the path to the Excel files
file_path = '<path>/*.xlsx'
# Use glob to get all file paths
excel_files = glob.glob(file_path)
# Define a function to normalize each DataFrame
def normalize_df(df):
# Select common columns and fill missing columns with NaN
common_columns = ['CommonColumn1', 'CommonColumn2', 'CommonColumn3']
for column in common_columns:
if column not in df.columns:
df[column] = pd.NA
return df[common_columns]
# Initialize an empty DataFrame
combined_df = pd.DataFrame()
# Loop through each file, normalize, and append to the combined DataFrame
for file in excel_files:
df = pd.read_excel(file)
df = normalize_df(df)
combined_df = pd.concat([combined_df, df], ignore_index=True)
# Display the combined DataFrame
print(combined_df.head())
Step 2: Data Cleaning and Validation
Now that we have a combined DataFrame, let’s clean and validate the data to ensure it meets quality standards. Each process will have its own specific rules, but here are some of the most common practices.
Data Cleaning
Remove Unnecessary Columns
# Drop unnecessary columns
columns_to_drop = ['UnwantedColumn1', 'UnwantedColumn2']
cleaned_df = combined_df.drop(columns=columns_to_drop)
Handle Missing Values
# Handle missing values
cleaned_df = cleaned_df.fillna(method='ffill') # Forward fill missing values
Data Validation
Check for Duplicates
# Remove duplicate rows
cleaned_df = cleaned_df.drop_duplicates()
Validate Data Types
# Convert data types if necessary
cleaned_df['DateColumn'] = pd.to_datetime(cleaned_df['DateColumn'])
cleaned_df['NumericColumn'] = cleaned_df['NumericColumn'].astype(float)
Consistency Checks
# Ensure no negative values in a specific column
cleaned_df = cleaned_df[cleaned_df['NumericColumn'] >= 0]
Step 3: Data Transformation
After cleaning and validating the data, we can transform it to match our predefined format, making it ready for machine learning models or software applications. Here are several common transformation methods:
Renaming Columns
# Rename columns to match the predefined format
columns_rename = {
'OldName1': 'NewName1',
'OldName2': 'NewName2'
}
cleaned_df = cleaned_df.rename(columns=columns_rename)
Aggregating Data
If your model requires aggregated data (e.g., weekly or monthly summaries), you can perform the aggregation:
# Aggregate data by a time period, for example, monthly
transformed_df = cleaned_df.resample('M', on='DateColumn').sum().reset_index()
Feature Engineering
Creating new features can enhance the predictive power of your models:
# Create new features
transformed_df['NewFeature'] = transformed_df['NumericColumn'] * transformed_df['AnotherColumn']
Encoding Categorical Variables
Machine learning models often require categorical variables to be encoded as numerical values
# One-hot encode categorical variables
transformed_df = pd.get_dummies(cleaned_df, columns=['CategoricalColumn'])
Scaling Numerical Data
Scaling numerical data ensures that all features contribute equally to the model. Scaling transforms the data to fit within a certain range or distribution. This helps in:
- Normalization: Scaling the data to a range of [0, 1] or [-1, 1].
- Standardization: Scaling the data to have a mean of 0 and a standard deviation of 1.
from sklearn.preprocessing import StandardScaler
# Scale numerical columns
scaler = StandardScaler()
transformed_df[['NumericColumn1', 'NumericColumn2']] = scaler.fit_transform(transformed_df[['NumericColumn1', 'NumericColumn2']])
Handling Outliers
Outliers as data points that are significantly different from others can skew the results of a model and may need to be addressed. Outliers are values that lie outside the overall pattern of distribution. They can be caused by variability in the data or errors in measurement. Handling outliers is important to:
- Prevent them from skewing the analysis.
- Ensure the model is robust and reliable.
# Cap outliers
def cap_outliers(series, lower_quantile=0.01, upper_quantile=0.99):
lower_bound = series.quantile(lower_quantile)
upper_bound = series.quantile(upper_quantile)
return series.clip(lower_bound, upper_bound)
transformed_df['NumericColumn'] = cap_outliers(transformed_df['NumericColumn'])
Step 4: Deploying to AWS Using SAM
AWS Serverless Application Model (SAM) allows us to deploy our transformation module as a Lambda function, making it scalable and efficient. Let’s dive deeper into how SAM works and how to set it up for our data transformation module.
AWS SAM is an open-source framework for building serverless applications. It provides a simple and clean syntax to describe the resources needed for your application. SAM extends AWS CloudFormation to simplify the setup and deployment of serverless functions, APIs, databases, and event source mappings.
Create a SAM Template
A SAM template is written in YAML and defines the resources required by serverless applications. Here is a detailed example of a template.yaml file for our data transformation Lambda function.
AWSTemplateFormatVersion: '2010-09-09'
Transform: 'AWS::Serverless-2016-10-31'
Resources:
DataTransformationFunction:
Type: 'AWS::Serverless::Function'
Properties:
Handler: app.lambda_handler
Runtime: python3.8
CodeUri: .
Timeout: 900
MemorySize: 128
Policies:
- AWSLambdaBasicExecutionRole
Environment:
Variables:
VARIABLE_NAME: "value"
Package and Deploy with SAM
To package and deploy your SAM application, follow these steps:
Package the Application
This step uploads your code to an S3 bucket and prepares a packaged template file for deployment.
sam package \
- template-file template.yaml \
- output-template-file packaged.yaml \
- s3-bucket <your-s3-bucket>
Deploy the Application
This step deploys the packaged application to AWS, creating the necessary resources defined in your SAM template.
sam deploy \
- template-file packaged.yaml \
- stack-name DataTransformationStack \
- capabilities CAPABILITY_IAM
Testing Locally with SAM Local
Before deploying to AWS, you can test your Lambda function locally using sam local. This is particularly useful for debugging and ensuring your function behaves as expected.
To see how it performs, you can invoke your Lambda function locally with a test event.
sam local invoke DataTransformationFunction - event event.json
Here, event.json is a JSON file containing a sample event payload.
If your function is triggered via API Gateway, you can start a local API to test it.
sam local start-api
This command will start a local instance of API Gateway, and you can send HTTP requests to it as you would with the deployed API.
Setting Up Environment Variables
Environment variables allow you to pass configuration settings to your Lambda function. In the template.yaml file, you can define environment variables under the Environment property.
Environment:
Variables:
VARIABLE_NAME: "value"
Configuring IAM Roles and Policies
Your Lambda function will need permission to perform specific actions. The Policies property in the SAM template allows you to attach managed policies or inline policies directly to your Lambda function’s execution role.
Policies:
- AWSLambdaBasicExecutionRole
- PolicyName: CustomPolicy
PolicyDocument:
Statement:
- Effect: Allow
Action:
- s3:GetObject
- s3:PutObject
Resource: arn:aws:s3:::<your-s3-bucket>/*
Testing the Deployment
After deploying the application, you can test your Lambda function using the AWS Management Console, AWS CLI, or AWS SDKs. Ensure that the function executes as expected and processes your data correctly.
Step 5: Setting Up CI/CD for Automatic Deployment
To automate the deployment process, we can set up a CI/CD pipeline using GitHub Actions. This ensures that every time new code is committed to the repository, it is automatically packaged and deployed to AWS.
Create a GitHub Actions Workflow
- Create a .github/workflows folder
mkdir -p .github/workflows
- Create a deploy.yml file
name: CI/CD Pipeline
on:
push:
branches:
- main
jobs:
build-and-deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: 3.8
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install awscli aws-sam-cli
- name: Package application
run: |
sam package \
--template-file template.yaml \
--output-template-file packaged.yaml \
--s3-bucket ${{ secrets.S3_BUCKET }}
- name: Deploy application
run: |
sam deploy \
--template-file packaged.yaml \
--stack-name DataTransformationStack \
--capabilities CAPABILITY_IAM
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: ${{ secrets.AWS_DEFAULT_REGION }}
Explanation of the Workflow
- Trigger: The workflow is triggered on every push to the main branch.
- Jobs: The build-and-deploy job runs on an ubuntu-latest virtual machine.
- Checkout code: Check out the code from the repository.
- Set up Python: Set up Python 3.8.
- Install dependencies: Installs awscli and aws-sam-cli.
- Package application: Packages the application using SAM.
- Deploy application: Deploys the application to AWS using SAM.
To securely manage AWS credentials and other sensitive information, use GitHub Secrets. Add the following secrets to your GitHub repository:
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
- AWS_DEFAULT_REGION
- S3_BUCKET
Conclusion
Data transformation from multiple Excel files to a predefined format is a vital step for effective forecasting or machine learning models and software applications. By leveraging Python, Pandas, and AWS SAM, you can automate, version control, and scale your data processing pipeline efficiently. Additionally, setting up a CI/CD pipeline ensures that your deployment process is smooth and automatic, enhancing productivity and reducing manual errors.
Feel free to explore and adapt the provided code and templates to suit your specific requirements. Happy transforming!