Overview
Snowplow is an open-source platform that allows businesses to capture granular and event-level data on user behavior from across multiple touchpoints and store it in a single location. The platform is designed to function at enterprise levels and Snowplow events can be plugged into almost all analytics tools.
Because Snowplow collects event-level data already characterized by events and properties, Snowplow data needs next-to-no adjusting before being connected to Indicative for analysis.
After the Snowplow Enrich step, Snowplow events are stored in an AWS S3 bucket or streamed to AWS Kinesis. Indicative reads Snowplow data from either source (we will refer to both as the Unified Log). Users can extend their data model with other data sources at the Data Modeling step (see Advanced data Modeling) before conducting analysis in Indicative.
An implementation of Snowplow allows for tracking a range of predefined events, modeled structured events, or unstructured events that users can custom model.
By default, a wide range of common properties are logged with any implementation of Snowplow. In addition, customers can define both structured contexts and more flexible custom contexts.
Deriving Indicative Events and Properties
Overview
Because Snowplow inherently uses an event-based model, there is no transformation needed to plug Snowplow data into Indicative for analysis. See below for how events and properties are derived from specific types Snowplow events.
By default, for all event generation the ‘domain_userid’ field in the Snowplow Unified Log is used as the unique user identifier in Indicative. The ‘collector_tstamp’ in the unified log is the timestamp used by Indicative.
Event and Properties Summary Table
Snowplow Entity |
Unique Identifier Used |
Timestamp |
Properties |
Predefined events (page views, page pings, ecommerce transactions, errors) |
‘domain_userid’ |
‘collector_tstamp’ |
Common, Platform-specific fields, and applicable Custom Contexts |
Structured Events |
‘domain_userid’ |
‘collector_tstamp’ |
Common and Platform-specific fields, ‘se_category’, ‘se_label’, ‘se_property’, ‘se_value’ and applicable Custom Contexts |
Unstructured Events |
‘domain_userid’ |
‘collector_tstamp’ |
Common, Platform-specific fields, and applicable Custom Contexts |
Predefined Events
Snowplow has a set of predefined event types that can be instrumented:
- Page views
- Page pings
- Ecommerce transactions
- Errors
If instrumented, the integration will generate these events, where the ‘event’ field in the Unified Log is used as the event name in Indicative.
Common and Platform-Specific Properties
For all Snowplow events, a range of datetime, user and device fields are recorded along with the event. If instrumented, all of these fields are generated as Indicative properties. Additionally, any platform-specific fields will be recorded as well, such as page referer and url information for a web-specific instrumentation.
Structured Events
Snowplow custom structured events are generated using the ‘se_action’ field as the event name for Indicative (or the ‘event_name’ if ‘se_action’ is not populated). The ‘se_category’, ‘se_label’, ‘se_property’ and ‘se_value’ fields are added as Indicative properties in addition to all common and platform-specific properties.
Unstructured Events
Snowplow allows customers to model flexible custom unstructured events as needed. The ‘event_name’ field is used for the event name in Indicative (or it defaults to ‘unstruct’ if ‘event_name’ is not populated).
Custom Contexts
Snowplow allows customers to define their own context around events, such as extra user properties for a customer (membership information, age, etc.) or extra properties about a product for a purchase event (SKU, tags, product name, etc.).
Aliasing
If it is a S3 or batch-based Snowplow integration, and the data has both authenticated and unauthenticated sessions for their users, Indicative automatically aliases these sessions by tying the ‘user_id’ field to associated ‘domain_userid’ in order to have a shared user history for analysis. This is only available for Snowplow Relay customers under the Pro or Enterprise Plans. For further reading on Indicative aliasing process, please see Aliasing Documentation.
Advanced Data Modeling
Overview
In addition to generating events from predefined events, custom structured and custom unstructured snowplow events, the Snowplow integration with Indicative allows customers to apply another layer of data modeling using all fields available in the Snowplow Unified Log and custom data tables provided by the customer.
For example, customers can generate a ‘Splash Page View’ if the Snowplow ‘page view’ event is generated and the page_url contains ‘landing’. Customers can take any subset of Snowplow events/properties and logical operators to generate new Indicative events and properties.
Also, if customers have extra data tables to join against Snowplow data, new events and properties can be modeled and generated at this step.
Generation Logic for New Events and Properties
If a customer needs a set of events and properties that are not captured in the default Snowplow integration, the customer can provide flexible logic-based rules to generate new events and properties based on Snowplow data or custom data tables.
BigQuery Integration
The BigQuery integration with Indicative is available for Enterprise customers only. If interested, please contact us. You must grant 'bigquery.dataViewer' access to Indicative for your BigQuery project.
In order to perform the following steps you must have administrative access to the BigQuery console as well as your BigQuery database.
1. In Indicative, click on Settings and select Data Sources
2. Click on New Data Source
3. Select the Snowplow BigQuery icon and click on Next.
4. You will need to enter these values, which you can get from your BigQuery console.
a. Open the BigQuery console on Google Cloud Platform and Select a project.
b. Enter the GCP Project ID containing your Snowplow data.
c. Enter the Dataset Name
d. Enter the Table Name and click Next in Indicative.



S3 Integration
The Snowplow S3 integration with Indicative is available for Enterprise customers only. If interested, please contact us. The Snowplow Unified Log is stored in an S3 bucket and the customer is required to write an IAM policy to grant Indicative programmatic access to the respective S3 bucket.
To get set up with S3, follow these instructions:
1. In Indicative, click on Settings and select Data Sources
2. Click on New Data Source
3. Select the Snowplow S3 icon and click on Next.
4. You will need to enter the Bucket Name and the File Path to Enriched/Archived. These values can be accessed from your AWS Management Console at https://console.aws.amazon.com/iam/.
-
-
To get the Bucket Name from your AWS Console, click on the Services dropdown and select S3 under Storage.
-
Copy your bucket name and paste that into the Bucket Name field in Indicative.
-
Navigate the folder structure to get to the enriched data and refer to the image below to get the value to enter in File Path To Enriched/Archived field in the Indicative UI. In this example, the value will be /main/enriched/good/
-
5. Click Next in Indicative. You will need to copy this policy in step 8.
6. In your AWS console, click on the appropriate bucket containing the Snowplow unshredded logs and then click into Permissions. We recommend the enriched/archive bucket set up through your Snowplow configuration.
7. Click Bucket Policy
8. Copy the policy from step 5 and paste into the editor in the AWS Console and click Save.
9. Click Validate Integration. If successful you will see a scheduling screen. Please select a date and time and enter your contact info to schedule a meeting with an Indicative Specialist.
10. Grant Indicative the necessary information move forward with your integration by completing this Data Integration Questionnaire.
Snowflake Integration
The Snowplow Snowflake integration with Indicative is available for Enterprise customers only. To integrate with Snowplow Snowflake, follow these instructions:
1. Click on Settings and select Data Sources.
2. Click New Data Source.
3. Select Snowplow Snowflake.
4. Click Next and you should see the following screen. To get the values for this page, please log into your Snowflake account.
*Note: The Auto-Generated Password is a password that Indicative has randomly generated. If you would like to use your own password, please place your own value in that field.
a. Account ID is everything to the left of .gcp.fnowflakecomputing.com/...b. Enter the Warehouse name.
c. Enter the Database name.
d. Click into Warehouses and copy the Schema.
e. Enter the Table name.
5. Click Next
6. You will need to copy and paste these code snippets into your Snowflake console. The last snippet is optional. Navigate to the Worksheets tab and paste the snippets into the SQL runner, check the All Queries checkbox, and hit Run.
7. Click Validate Integration in Indicative.
8. If the validation is successful, you will see a screen to schedule a call with an Indicative specialist. Please select a date and time and enter in your contact info.
Kinesis Integration
To connect your real time Snowplow data to Indicative, follow the instructions below:
1. In Indicative, click on Settings and select Data Sources.
2. Click New Data Source.
3. Select Snowplow Kinesis.
4. Click Next. You will need to use this API Key in step 4 of Create the Lamda Function
Create an IAM Role for the Lambda
Your AWS Lambda needs to have an Execution Role that allows it to use the Kinesis Stream and CloudWatch. (For more information on setting up IAM Roles, please see the official AWS tutorial.)
1. Go to IAM Management in the Console and choose Roles from the sidebar.
2. Click Create role.
3. For the type of trusted entity select AWS Service and for the service that will use this role choose Lambda. Click Next: Permissions at the bottom of the screen.
4. Now you need to choose a permission policy for the role. The Lambda needs to have read access to Kinesis and write access to CloudWatch logs - for that we will choose AWSLambdaKinesisExecutionRole. Search for AWSLambdaKinesisExecutionRole in the search and mark the checkbox as shown below.
5. Click Next:Review at the bottom of the screen.
6. On the next screen provide a name for the newly created role under Role Name, then click Create role to finish the process.
Create the Lambda Function
The Lambda function can be created either directly through AWS Console or through other tools like the AWS CLI. For this integration, the recommended memory setting is 256 MB and because the JVM has to cold start when the function is called for the first time on a new instance, you should set a high timeout value; 90 seconds should be safe.
As with the IAM Role, we will be using the AWS Console to get our Lambda function up and running. Make sure you are in the same region as where your Kinesis streams are defined.
1. On the Console navigate to the Lambda section and click Create a function (runtime should be Java 8).
2. Write a name for your function in Name. In the Role dropdown pick Choose an existing role; then in the dropdown below choose the name of the role you created in the previous step. Click Create function.
3. The Lambda has been created, although it does not do anything yet. We need to provide the code and configure the function:
a. Take a look at the Function code box. In the Handler textbox paste: com.snowplowanalytics.indicative.LambdaHandler::recordHandler
b. From the Code entry type dropdown pick Upload a file from Amazon S3. A textbox labeled S3 Link URL will appear. We are hosting the code through our hosted assets. You will need to choose the S3 bucket in the same region as your AWS Lambda function: for example if your Lambda is us-east-1 region, paste the following URL: s3://snowplow-hosted-assets-us-east-1/relays/indicative/indicative-relay-0.4.0.jar in the textbox. Take a look at this table to pick the right bucket name for your region. Make sure Runtime is Java 8.
4. Get the API Key from step 4 from the Indicative UI.
5. Below Function code settings you will find a section called Environment variables.
a. In the first row, first column (the key), type INDICATIVE_API_KEY. In the second column (the value), paste your API Key.
b. The relay lets you configure the following filters:
- UNUSED_EVENTS: events that will not be relayed to Indicative;
- UNUSED_ATOMIC_FIELDS: fields of the canonical Snowplow event that will not be relayed to Indicative;
- UNUSED_CONTEXTS: contexts whose fields will not be relayed to Indicative.
Out of the box, the relay is configured to use the following defaults:
Unused events | Unused atomic fields | Unused contexts |
---|---|---|
app_heartbeat | etl_tstamp | application_context |
app_initialized | collector_tstamp | application_error |
app_shutdown | dvce_created_tstamp | duplicate |
app_warning | event | geolocation_context |
create_event | txn_id | instance_identity_document |
emr_job_failed | name_tracker | java_context |
emr_job_started | v_tracker | jobflow_step_status |
emr_job_status | v_collector | parent_event |
emr_job_succeeded | v_etl | performance_timing |
incident | user_fingerprint | timing |
incident_assign | geo_latitude | |
incident_notify_of_close | geo_longitude | |
incident_notify_user | ip_isp | |
job_update | ip_organization | |
load_failed | ip_domain | |
load_succeeded | ip_netspeed | |
page_ping | page_urlscheme | |
s3_notification_event | page_urlport | |
send_email | page_urlquery | |
send_message | page_urlfragment | |
storage_write_failed | refr_urlscheme | |
stream_write_failed | refr_urlport | |
task_update | refr_urlquery | |
wd_access_log | refr_urlfragment | |
pp_xoffset_min | ||
pp_xoffset_max | ||
pp_yoffset_min | ||
pp_yoffset_max | ||
br_features_pdf | ||
br_features_flash | ||
br_features_java | ||
br_features_director | ||
br_features_quicktime | ||
br_features_realplayer | ||
br_features_windowsmedia | ||
br_features_gears | ||
br_features_silverlight | ||
br_cookies | ||
br_colordepth | ||
br_viewwidth | ||
br_viewheight | ||
dvce_ismobile | ||
dvce_screenwidth | ||
dvce_screenheight | ||
doc_charset | ||
doc_width | ||
doc_height | ||
tr_currency | ||
mkt_clickid | ||
etl_tags | ||
dvce_sent_tstamp | ||
refr_domain_userid | ||
refr_device_tstamp | ||
derived_tstamp | ||
event_vendor | ||
event_name | ||
event_format | ||
event_version | ||
event_fingerprint | ||
true_tstamp |
To change the defaults, you can pass in your own lists of events, atomic fields or contexts to be filtered out. For example:
Environment variable key | Environment variable value |
---|---|
UNUSED_EVENTS | page_ping,file_download |
UNUSED_ATOMIC_FIELDS | name_tracker,event_vendor |
UNUSED_CONTEXTS | performance_timing,client_context |
Similarly to setting up the API key, the first column (key) needs to be set to the specified environment variable name in ALLCAPS. The second column (value) is your own list as a comma-separated string with no spaces.
If you only specify the environment variable name but do not provide a list of values, then nothing will be filtered out.
If you do not set any of the environment variables, the defaults will be used.
6. Scroll down a bit and take a look at the Basic settings box. There you can set memory and timeout limits for the Lambda. As mentioned earlier, we recommend setting 256 MB of memory or higher (on AWS Lambda the CPU performance scales linearly with the amount of memory) and a high timeout time of 1 minute 30 seconds.
7. As final step, add your Snowplow enriched Kinesis stream as an event source for the Lambda function. You can follow the official AWS tutorial if you are using AWS CLI or do it directly from the AWS Console using the following instructions. Scroll to the top of the page and from the list of triggers in the Designer configuration up top, choose Kinesis.
Take a look at the Configure triggers section which just appeared below. Choose your Kinesis stream that contains Snowplow enriched events. Set the batch size to your liking - 100 is a reasonable setting. Note that this a maximum batch size, the function can be triggered with fewer records. For the starting position we recommend Trim horizon, which starts processing the stream from an observable start (Alternatively, you can select At timestamp to start sending data from a particular date). Click the Add button to finish the trigger configuration. Make sure Enable trigger is selected.
8. Save the changes by clicking the Save button in the top-right part of the page.
Validate Your Data
Go to your Indicative project to check if you are receiving data. You can also go to the debug console to troubleshoot the relay in real time.