How to handle user actions collected in Mixpanel and/or Segment
This article will explain how to extract collected user actions from Mixpanel and Segment to AWS S3 bucket(s), how to configure Mixpanel/Segment pipelines and later how to query extracted data within Redshift.
Business owners wants to know as much as possible about user behavior within an application(s) or wants to have a set of reports that will give a clear picture about user engagement and retention, the activities that provide the information important for planning new features or preparing a pitch for investors.
This article will explain how to extract collected user actions from Mixpanel and Segment to AWS S3 bucket(s), how to configure Mixpanel/Segment pipelines and later how to query extracted data within Redshift. What this article will not cover is Mixpanel/Segment tracking configuration and Redshift cluster configuration because the documentation is self explanatory.
Data collection
The first step is to define user actions that will be tracked. Let’s imagine that we have a mobile application that is an index of all public companies in a country and the end user can follow some of them and check their insights and monthly reports. In this use case, one of the user activities that may be monitored is the followed action, hence each click on the Follow button within the application should be tracked. Apart from the clicks, we have to know which company is followed and from which application screen.
For user actions tracking we can use different 3rd party services like Mixpanel (https://mixpanel.com/) or Segment (https://segment.com/).
The image above illustrates a collection of custom defined information that will be utilised for the followed action tracking. Except for custom defined metadata, every third-party service can automatically monitor some common user data such as device details, operating system details, user location, and so on.
Here is an example of Mixpanel properties that are attached to each track event:
A similar set of common user data is attached to each Segment tracking event.
However, this article will not focus on the ways of tracking as documentation is pretty clear and integration is pretty straight forward.
Data Extraction
Mixpanel
In the stage when Mixpanel contains all information about the tracked actions let’s think about how to reuse the collected data. The first idea will be to query this data directly from Mixpanel. However, as Mixpanel is not designed to be used in this way and the applications will reach different rate limits, we have to think about a different approach.
For any kind of decision systems we have to execute a couple of queries and do a deeper analysis of data. So the natural approach is the usage of Data warehouses, especially in cases when, for some data analysis users need data from different data sources. For this purpose, Mixpanel Export Data API allows export of raw data either manually or through pipeline.
In order to continue, let’s explain how to configure Mixpanel pipelines for raw data extraction to AWS S3 bucket.
Before pipeline creation it is necessary to create an S3 bucket that will receive and store exported raw data and give Mixpanel the required permissions to write to the bucket. Details about S3 bucket creation and permissions are provided on this page.
Pipeline creation requests can be generated and executed via this page.
Most of the properties in the process of pipeline creation are intuitive and well described and can be selected based on your needs. An important note in this process is to set the `use_glue` parameter to `false`.
The reasons for this will be explained later in the section about data transformation.
Also, a suggestion is to use `multischema` schema type and enable load of every event type in its own dedicated database table. This brings better structuring and easier data access.
In our use case Mixpanel data will be located within:
Segment
Segment can be the main or additional source of raw data that is going to be used in data analysis. Segment also offers the extraction of raw data to AWS S3 bucket. Configuration of the export pipeline is more simple than Mixpanel but data format that is stored within the S3 bucket is harder to process. That will be explained in detail in the next sections.
In our use case we will use the same S3 bucket to store Segment raw data, so the same set of AWS permissions and roles can be used for export bucket configuration.
Export from Segment is a simple selection of new destination from Catalog (as shown on picture below)
and manual entering of connection settings (AWS Region, Bucket Name and IAM Role ARN).
In this case Segment data will be stored within
Segment does not have an option for multischema export and all tracked events are exported to the same folder, which is making data processing more complex.
Data Transformation
Data stored within S3 bucket is raw data and is not useful at the moment. To make it readable we have to structure it and transform it in a readable format.
For this purpose we will use AWS Glue that can be interpreted as a data integration service. Glue Data catalog is actually designed to structure user data and it behaves as a NoSql database with registered database and tables. Each table is described by schema and reference a path to S3 bucket where data is stored.
Let’s explain why AWS Glue configuration should be done manually and why we should not enable `use_glue` in the process of pipeline creation as mentioned in the previous section.
Exported Mixpanel data stored in S3 bucket is partitioned base on dates of tracked events so for instance all follow events that are tracked on the 1st of August 2022 will be located in s3://…/company_follow_action/2022/08/01/<fileName>.json.gz.
In case when `use_glue` is set to `true` in the process of pipeline creation and Mixpanel did configuration of Glue Data Catalog, Mixpanel creates a database, table with schema and link to the main event directory (s3://…/company_follow_action) without any partition key for data. Later, using this approach, any tool it needs to access is unable to look for the exact data set, as there is not any partition or structure associated with it.
Manual configuration on Glue Data Catalog allows data partitioning. A partition in which a record is located is symbolized by the partition key inside of each record in the table.
As a first step in the process of configuration, it’s necessary to set up AWS Glue Crawlers that are workers responsible for browsing of the directory and all subdirectories with the goal to discover how data is structured and than to create Data Catalog Table with paths and schemas according to the stored raw data.
Based on pipeline setup, that can be daily/hourly Glue Crawlers can be scheduled to run also daily or hourly. Also, per Mixpanel and Segment documentation, both sources can export data multiple times within the same directory so it is necessary to configure Glue Crawler to crawl all subfolders on each execution.
Recommendation is to create separate crawlers for each tacked event and each event folder within S3 bucket. In the Mixpanel case with multischema configured, the crawler will be configured for the company_follow_event folder, while for the Segment case it’s necessary to create separate crawlers for each source type (Android, iOS. Web, etc).
Data Access
Now when AWS Glue is configured and data is structured, the way of accessing this data is simplified. The first step will be connection to the Redshift cluster.
The whole process of cluster configuration and access to external data is explained in AWS documentation. The following instructions are an overview of process of proper Redshift configuration:
- Create an IAM role with read access to Glue and the S3 bucket with collected data
- Assign that role to the created Redshift cluster
- Run next query that should create an external database
CREATE EXTERNAL SCHEMA schema_name FROM DATA CATALOG
DATABASE ‘selected-glue-database-name’ // defined when you configured Glue
IAM_ROLE ‘created-role-arn’ // this is the ARN for the role with access to Glue + S3
CREATE EXTERNAL DATABASE IF NOT EXISTS;
This connection should be created only once. Configured pipelines will export data automatically in S3 bucket and Redshift will automatically have access to all the new data and schema changes. This means that we can write any query to extract a specific set of data.
To get information about company follow actions from Mixpanel data, it is possible to write a simple query:
SELECT companyid, distict_id, mp_os, mp_country_code, mp_processing_time as time
FROM mixpanel.mp_company_follow_actions;
To access the same set of data from Segment data set query will include the access to nested properties and condition that will include the selection of specific tracking event by name:
SELECT s.properties.companyid as companyid, s.userId, s.context.os.name as os, s.context.ip, s.timestamp as time
FROM segment.sg_android_source s
WHERE s.event = ‘company_follow_action’;
What we can notice in this approach is pretty slow query execution, especially in cases when it’s necessary to do the data extraction that includes a combination of tables and different aggregations. To improve performances it is possible to introduce Redshift materialized views that will be treated as snapshots of extracted sets of data necessary for data analyses.
Let’s say that we have to produce the following set of reports:
- Total number of company followers
- Daily distribution of company followers
- Location distribution of company followers
Queries that are shown above contain all necessary details for these types of reports and in this case it is possible to create a materialized view that will keep all these details up to date with extracted data located in S3 bucket.
CREATE MATERIALIZED VIEW sg_followers_view
BACKUP YES
AS
SELECT s.properties.companyid as companyid, s.userId, s.context.os.name, s.context.ip, s.timestamp as time
FROM segment.sg_android_source s
WHERE s.event = ‘company_follow_action’;
Created materialized view reflects current data set exported to S3 bucket. To provide continuous updates and include new data that is coming to S3 bucket it is necessary to schedule the refresh of materialized view. This can be achieved by using Redshift scheduler that will execute the following query:
REFRESH MATERIALIZED VIEW sg_followers_view;
Per Mixpanel and Segment documentation exports happen once per hour so the refreshing of materialized view should be scheduled hourly.
The created materialized view can be treated as any Redshift database table and queried the same way. In our case the total number of company followers can be extracted as:
SELECT companyid, count(distinct userid)
FROM sg_followers_view
GROUP BY companyid
Daily and location distributions of company followers can be achieved with aggregation per time and location. All these operations are still simple aggregations. More often it is necessary to perform complex aggregation on a very large set of data and in this case it is suggested to do the second level of data extraction which can create aggregated data tables that will contain aggregated values selected from materialized views.
For example
CREATE TABLE data.company_aggregated_followers_table
AS
SELECT companyid, to_date(cast(time as timestamp), ‘YYYY-MM-DD’) as formattedDate, count(userid) as totalFollowers
FROM sg_followers_view
GROUP BY companyId, formattedDate
However, it is necessary to provide updates of these aggregated tables and this can also be achieved by running a set of queries within Redshift scheduler:
CREATE TEMP TABLE company_aggregated_followers_stage (like data.company_aggregated_followers_table);
INSERT INTO
company_aggregated_followers_stage
SELECT
companyId,
to_date(cast(time as timestamp), ‘YYYY-MM-DD’) as formattedDate,
count(userid) as totalFollowers
FROM
public.sg_followers_view
WHERE
formattedDate BETWEEN to_date(dateadd(day, -1, current_date), ‘YYYY-MM-DD’)
AND current_date
GROUP BY
companyId,
formattedDate;
DELETE FROM
data.company_aggregated_followers_table using company_aggregated_followers_stage
WHERE
data.company_aggregated_followers_table.companyid = company_aggregated_followers_stage.companyid
AND company_aggregated_followers_table.formattedDate = company_aggregated_followers_stage.formattedDate;
INSERT INTO
data.company_aggregated_followers_table
SELECT *
FROM
company_aggregated_followers_stage;
DROP TABLE company_aggregated_followers_stage;
Conclusion
Reports and analytics are a nightmare for most developers while for customers and stakeholders they are the heart of any business and their needs about different types of information are changing on daily basis. This can be classified as the main reason why it’s necessary to have access to all data within one location.
The goal of this article was to bring ideas on how to collect all data from different sources in one place. Sources and the final location can be differently chosen but the principals are the same. Also, different types of optimisation and performance improvements can be applied. It all depends on your needs and technology stack on which is your business based.