AWS Kinesis: Snowplow – mParticle Support

Important: As of August 13, 2024, this page will no longer be actively maintained. Please refer to the current version of this content here.

✓ Growth ✓ Enterprise

Prerequisites:

The Snowplow Unified Log is stored in an S3 bucket and you is required to write an IAM policy to grant Analytics programmatic access to the respective S3 bucket.

If there are additional enrichments required, such as joining with user property tables or deriving custom user_ids, please contact us.

Instructions:

To connect your real time Snowplow data to Analytics, follow the instructions below:

1. In Analytics, click on the gear icon and select Project Settings.

2. Select the Data Sources tab.

3. Select New Data Source.

4. Select Snowplow Kinesis.

5. Click Next. You will need to use this API Key in step 4 of Create the Lamda Function

Create an IAM Role for the Lambda

Your AWS Lambda needs to have an Execution Role that allows it to use the Kinesis Stream and CloudWatch. (For more information on setting up IAM Roles, please see the official AWS tutorial.)

1. Go to IAM Management in the Console and choose Roles from the sidebar.

Screen_Shot_2018-09-21_at_6.39.09_PM.png

2. Click Create role.

Screen_Shot_2018-09-21_at_6.42.37_PM.png

3. For the type of trusted entity select AWS Service and for the service that will use this role choose Lambda. Click Next: Permissions at the bottom of the screen.

Screen_Shot_2018-09-21_at_6.46.25_PM.png

4. Now you need to choose a permission policy for the role. The Lambda needs to have read access to Kinesis and write access to CloudWatch logs - for that we will choose AWSLambdaKinesisExecutionRole. Search for AWSLambdaKinesisExecutionRole in the search and mark the checkbox as shown below.

Screen_Shot_2018-09-21_at_6.51.05_PM.png

5. Click Next:Review at the bottom of the screen.

6. On the next screen provide a name for the newly created role under Role Name, then click Create role to finish the process.

Create the Lambda Function

The Lambda function can be created either directly through AWS Console or through other tools like the AWS CLI. For this integration, the recommended memory setting is 256 MB and because the JVM has to cold start when the function is called for the first time on a new instance, you should set a high timeout value; 90 seconds should be safe.

As with the IAM Role, we will be using the AWS Console to get our Lambda function up and running. Make sure you are in the same region as where your Kinesis streams are defined.

1. On the Console navigate to the Lambda section and click Create a function (runtime should be Java 8).

Screen_Shot_2018-09-21_at_6.59.22_PM.png

2. Write a name for your function in Name. In the Role dropdown pick Choose an existing role; then in the dropdown below choose the name of the role you created in the previous step. Click Create function.

3. The Lambda has been created, although it does not do anything yet. We need to provide the code and configure the function:

a. Take a look at the Function code box. In the Handler textbox paste: com.snowplowanalytics.indicative.LambdaHandler::recordHandler

b. From the Code entry type dropdown pick Upload a file from Amazon S3. A textbox labeled S3 Link URL will appear. We are hosting the code through our hosted assets. You will need to choose the S3 bucket in the same region as your AWS Lambda function: for example if your Lambda is us-east-1 region, paste the following URL: s3://snowplow-hosted-assets-us-east-1/relays/indicative/indicative-relay-0.4.0.jar in the textbox. Take a look at this table to pick the right bucket name for your region. Make sure Runtime is Java 8.

4. Get the API Key from step 4 from the Analytics UI.

5. Below Function code settings you will find a section called Environment variables.

a. In the first row, first column (the key), type INDICATIVE_API_KEY. In the second column (the value), paste your API Key.

b. The relay lets you configure the following filters:

UNUSED_EVENTS: events that will not be relayed to Analytics;
UNUSED_ATOMIC_FIELDS: fields of the canonical Snowplow event that will not be relayed to Analytics;
UNUSED_CONTEXTS: contexts whose fields will not be relayed to Analytics.

Out of the box, the relay is configured to use the following defaults:

Unused events	Unused atomic fields	Unused contexts
app_heartbeat	etl_tstamp	application_context
app_initialized	collector_tstamp	application_error
app_shutdown	dvce_created_tstamp	duplicate
app_warning	event	geolocation_context
create_event	txn_id	instance_identity_document
emr_job_failed	name_tracker	java_context
emr_job_started	v_tracker	jobflow_step_status
emr_job_status	v_collector	parent_event
emr_job_succeeded	v_etl	performance_timing
incident	user_fingerprint	timing
incident_assign	geo_latitude
incident_notify_of_close	geo_longitude
incident_notify_user	ip_isp
job_update	ip_organization
load_failed	ip_domain
load_succeeded	ip_netspeed
page_ping	page_urlscheme
s3_notification_event	page_urlport
send_email	page_urlquery
send_message	page_urlfragment
storage_write_failed	refr_urlscheme
stream_write_failed	refr_urlport
task_update	refr_urlquery
wd_access_log	refr_urlfragment
	pp_xoffset_min
	pp_xoffset_max
	pp_yoffset_min
	pp_yoffset_max
	br_features_pdf
	br_features_flash
	br_features_java
	br_features_director
	br_features_quicktime
	br_features_realplayer
	br_features_windowsmedia
	br_features_gears
	br_features_silverlight
	br_cookies
	br_colordepth
	br_viewwidth
	br_viewheight
	dvce_ismobile
	dvce_screenwidth
	dvce_screenheight
	doc_charset
	doc_width
	doc_height
	tr_currency
	mkt_clickid
	etl_tags
	dvce_sent_tstamp
	refr_domain_userid
	refr_device_tstamp
	derived_tstamp
	event_vendor
	event_name
	event_format
	event_version
	event_fingerprint
	true_tstamp

To change the defaults, you can pass in your own lists of events, atomic fields or contexts to be filtered out. For example:

Environment variable key	Environment variable value
UNUSED_EVENTS	page_ping,file_download
UNUSED_ATOMIC_FIELDS	name_tracker,event_vendor
UNUSED_CONTEXTS	performance_timing,client_context

Similarly to setting up the API key, the first column (key) needs to be set to the specified environment variable name in ALLCAPS. The second column (value) is your own list as a comma-separated string with no spaces.

If you only specify the environment variable name but do not provide a list of values, then nothing will be filtered out.

If you do not set any of the environment variables, the defaults will be used.

6. Scroll down a bit and take a look at the Basic settings box. There you can set memory and timeout limits for the Lambda. As mentioned earlier, we recommend setting 256 MB of memory or higher (on AWS Lambda the CPU performance scales linearly with the amount of memory) and a high timeout time of 1 minute 30 seconds.

7. As final step, add your Snowplow enriched Kinesis stream as an event source for the Lambda function. You can follow the official AWS tutorial if you are using AWS CLI or do it directly from the AWS Console using the following instructions. Scroll to the top of the page and from the list of triggers in the Designer configuration up top, choose Kinesis.

Take a look at the Configure triggers section which just appeared below. Choose your Kinesis stream that contains Snowplow enriched events. Set the batch size to your liking - 100 is a reasonable setting. Note that this a maximum batch size, the function can be triggered with fewer records. For the starting position we recommend Trim horizon, which starts processing the stream from an observable start (Alternatively, you can select At timestamp to start sending data from a particular date). Click the Add button to finish the trigger configuration. Make sure Enable trigger is selected.

8. Save the changes by clicking the Save button in the top-right part of the page.

Validate Your Data

Go to your Indicative project to check if you are receiving data. You can also go to the debug console to troubleshoot the relay in real time.