From Excel Chaos to Data Clarity: Transform and Deploy with Python and AWS

Lejla Imamovic

15 Nov 2024 — 7 min read

Data transformation is crucial for leveraging multiple Excel files as sources for forecasting models, machine learning models, or software applications. This post explores how to efficiently transform data using Python and Pandas, ensuring your data is clean, well-structured, and ready for deployment. The post will also cover how to deploy your transformation code to AWS using a Serverless Application Model (SAM) and set up a CI/CD pipeline for automatic deployment whenever new code is committed to GitHub.

Before diving into the technicalities, let’s understand why data transformation is essential:

Consistency: Ensures data from various sources follows a uniform structure.

Quality: Cleanses and filters out any inconsistencies or errors.

Compatibility: Makes data ready for various forecasting or machine learning models or for use within software applications.

Tools of the Trade: Python and Pandas

Before starting, ensure you have the following installed:

Python 3.x
Pandas library (pip install pandas)
AWS CLI (pip install awscli)
AWS SAM CLI (pip install aws-sam-cli)

Step 1: Reading and Combining Excel Files

Let’s assume we have multiple Excel files. If the files have similar structures, we can read them directly into a single Pandas DataFrame. However, if they have different structures, we need to handle each file according to its unique schema before combining them.

Handling Files with Similar Structure

import pandas as pd
import glob

# Define the path to the Excel files
file_path = '<path>/*.xlsx'

# Use glob to get all file paths
excel_files = glob.glob(file_path)

# Initialize an empty DataFrame
combined_df = pd.DataFrame()

# Loop through each file and append to the combined DataFrame
for file in excel_files:
    df = pd.read_excel(file)
    combined_df = pd.concat([combined_df, df], ignore_index=True)

# Display the combined DataFrame
print(combined_df.head())

Handling Files with Different Structures

If the files have different structures, follow these steps:

Identify Common Columns: Determine the columns that are common across all files.
Normalize Structures: Standardize the structure of each file to match a common format.
Combine Data: Merge the standardized DataFrames.

import pandas as pd
import glob

# Define the path to the Excel files
file_path = '<path>/*.xlsx'

# Use glob to get all file paths
excel_files = glob.glob(file_path)

# Define a function to normalize each DataFrame
def normalize_df(df):
    # Select common columns and fill missing columns with NaN
    common_columns = ['CommonColumn1', 'CommonColumn2', 'CommonColumn3']
    for column in common_columns:
        if column not in df.columns:
            df[column] = pd.NA
    return df[common_columns]

# Initialize an empty DataFrame
combined_df = pd.DataFrame()

# Loop through each file, normalize, and append to the combined DataFrame
for file in excel_files:
    df = pd.read_excel(file)
    df = normalize_df(df)
    combined_df = pd.concat([combined_df, df], ignore_index=True)

# Display the combined DataFrame
print(combined_df.head())

Step 2: Data Cleaning and Validation

Now that we have a combined DataFrame, let’s clean and validate the data to ensure it meets quality standards. Each process will have its own specific rules, but here are some of the most common practices.

Data Cleaning

Remove Unnecessary Columns

# Drop unnecessary columns
columns_to_drop = ['UnwantedColumn1', 'UnwantedColumn2']
cleaned_df = combined_df.drop(columns=columns_to_drop)

Handle Missing Values

# Handle missing values
cleaned_df = cleaned_df.fillna(method='ffill') # Forward fill missing values

Data Validation

Check for Duplicates

# Remove duplicate rows
cleaned_df = cleaned_df.drop_duplicates()

Validate Data Types

# Convert data types if necessary
cleaned_df['DateColumn'] = pd.to_datetime(cleaned_df['DateColumn'])
cleaned_df['NumericColumn'] = cleaned_df['NumericColumn'].astype(float)

Consistency Checks

# Ensure no negative values in a specific column
cleaned_df = cleaned_df[cleaned_df['NumericColumn'] >= 0]

Step 3: Data Transformation

After cleaning and validating the data, we can transform it to match our predefined format, making it ready for machine learning models or software applications. Here are several common transformation methods:

Renaming Columns

# Rename columns to match the predefined format
columns_rename = {
  'OldName1': 'NewName1',
  'OldName2': 'NewName2'
}
cleaned_df = cleaned_df.rename(columns=columns_rename)

Aggregating Data

If your model requires aggregated data (e.g., weekly or monthly summaries), you can perform the aggregation:

# Aggregate data by a time period, for example, monthly
transformed_df = cleaned_df.resample('M', on='DateColumn').sum().reset_index()

Feature Engineering

Creating new features can enhance the predictive power of your models:

# Create new features
transformed_df['NewFeature'] = transformed_df['NumericColumn'] * transformed_df['AnotherColumn']

Encoding Categorical Variables

Machine learning models often require categorical variables to be encoded as numerical values

# One-hot encode categorical variables
transformed_df = pd.get_dummies(cleaned_df, columns=['CategoricalColumn'])

Scaling Numerical Data

Scaling numerical data ensures that all features contribute equally to the model. Scaling transforms the data to fit within a certain range or distribution. This helps in:

Normalization: Scaling the data to a range of [0, 1] or [-1, 1].
Standardization: Scaling the data to have a mean of 0 and a standard deviation of 1.

from sklearn.preprocessing import StandardScaler

# Scale numerical columns
scaler = StandardScaler()
transformed_df[['NumericColumn1', 'NumericColumn2']] = scaler.fit_transform(transformed_df[['NumericColumn1', 'NumericColumn2']])

Handling Outliers

Outliers as data points that are significantly different from others can skew the results of a model and may need to be addressed. Outliers are values that lie outside the overall pattern of distribution. They can be caused by variability in the data or errors in measurement. Handling outliers is important to:

Prevent them from skewing the analysis.
Ensure the model is robust and reliable.

# Cap outliers
def cap_outliers(series, lower_quantile=0.01, upper_quantile=0.99):
  lower_bound = series.quantile(lower_quantile)
  upper_bound = series.quantile(upper_quantile)
  return series.clip(lower_bound, upper_bound)

transformed_df['NumericColumn'] = cap_outliers(transformed_df['NumericColumn'])

Step 4: Deploying to AWS Using SAM

AWS Serverless Application Model (SAM) allows us to deploy our transformation module as a Lambda function, making it scalable and efficient. Let’s dive deeper into how SAM works and how to set it up for our data transformation module.

AWS SAM is an open-source framework for building serverless applications. It provides a simple and clean syntax to describe the resources needed for your application. SAM extends AWS CloudFormation to simplify the setup and deployment of serverless functions, APIs, databases, and event source mappings.

Create a SAM Template

A SAM template is written in YAML and defines the resources required by serverless applications. Here is a detailed example of a template.yaml file for our data transformation Lambda function.

AWSTemplateFormatVersion: '2010-09-09'
Transform: 'AWS::Serverless-2016-10-31'
Resources:
  DataTransformationFunction:
    Type: 'AWS::Serverless::Function'
    Properties:
      Handler: app.lambda_handler
      Runtime: python3.8
      CodeUri: .
      Timeout: 900
      MemorySize: 128
      Policies:
        - AWSLambdaBasicExecutionRole
      Environment:
        Variables:
          VARIABLE_NAME: "value"

Package and Deploy with SAM

To package and deploy your SAM application, follow these steps:

Package the Application

This step uploads your code to an S3 bucket and prepares a packaged template file for deployment.

sam package \
 - template-file template.yaml \
 - output-template-file packaged.yaml \
 - s3-bucket <your-s3-bucket>

Deploy the Application

This step deploys the packaged application to AWS, creating the necessary resources defined in your SAM template.

sam deploy \
 - template-file packaged.yaml \
 - stack-name DataTransformationStack \
 - capabilities CAPABILITY_IAM

Testing Locally with SAM Local

Before deploying to AWS, you can test your Lambda function locally using sam local. This is particularly useful for debugging and ensuring your function behaves as expected.

To see how it performs, you can invoke your Lambda function locally with a test event.

sam local invoke DataTransformationFunction - event event.json

Here, event.json is a JSON file containing a sample event payload.

If your function is triggered via API Gateway, you can start a local API to test it.

sam local start-api

This command will start a local instance of API Gateway, and you can send HTTP requests to it as you would with the deployed API.

Setting Up Environment Variables

Environment variables allow you to pass configuration settings to your Lambda function. In the template.yaml file, you can define environment variables under the Environment property.

Environment:
  Variables:
    VARIABLE_NAME: "value"

Configuring IAM Roles and Policies

Your Lambda function will need permission to perform specific actions. The Policies property in the SAM template allows you to attach managed policies or inline policies directly to your Lambda function’s execution role.

Policies:
  - AWSLambdaBasicExecutionRole
  - PolicyName: CustomPolicy
    PolicyDocument:
      Statement:
        - Effect: Allow
          Action:
            - s3:GetObject
            - s3:PutObject
          Resource: arn:aws:s3:::<your-s3-bucket>/*

Testing the Deployment

After deploying the application, you can test your Lambda function using the AWS Management Console, AWS CLI, or AWS SDKs. Ensure that the function executes as expected and processes your data correctly.

Step 5: Setting Up CI/CD for Automatic Deployment

To automate the deployment process, we can set up a CI/CD pipeline using GitHub Actions. This ensures that every time new code is committed to the repository, it is automatically packaged and deployed to AWS.

Create a GitHub Actions Workflow

Create a .github/workflows folder

mkdir -p .github/workflows

Create a deploy.yml file

name: CI/CD Pipeline

on:
  push:
    branches:
      - main

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: 3.8

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install awscli aws-sam-cli

      - name: Package application
        run: |
          sam package \
            --template-file template.yaml \
            --output-template-file packaged.yaml \
            --s3-bucket ${{ secrets.S3_BUCKET }}

      - name: Deploy application
        run: |
          sam deploy \
            --template-file packaged.yaml \
            --stack-name DataTransformationStack \
            --capabilities CAPABILITY_IAM
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          AWS_DEFAULT_REGION: ${{ secrets.AWS_DEFAULT_REGION }}

Explanation of the Workflow

Trigger: The workflow is triggered on every push to the main branch.
Jobs: The build-and-deploy job runs on an ubuntu-latest virtual machine.
Checkout code: Check out the code from the repository.
Set up Python: Set up Python 3.8.
Install dependencies: Installs awscli and aws-sam-cli.
Package application: Packages the application using SAM.
Deploy application: Deploys the application to AWS using SAM.

To securely manage AWS credentials and other sensitive information, use GitHub Secrets. Add the following secrets to your GitHub repository:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_DEFAULT_REGION
S3_BUCKET

Conclusion

Data transformation from multiple Excel files to a predefined format is a vital step for effective forecasting or machine learning models and software applications. By leveraging Python, Pandas, and AWS SAM, you can automate, version control, and scale your data processing pipeline efficiently. Additionally, setting up a CI/CD pipeline ensures that your deployment process is smooth and automatic, enhancing productivity and reducing manual errors.

Feel free to explore and adapt the provided code and templates to suit your specific requirements. Happy transforming!