Building Sessions from Search Logs

Skill level

Basic, no prior knowledge requirements
Time to complete

Approx. 25 min

Introduction: Here is a basic example of using Bytewax to turn an incoming stream of event logs from a hypothetical search engine into metrics over search sessions. In this example, we're going to focus on the dataflow itself and aggregating state.

Prerequisites

Python modules

bytewax==0.19.*

Your Takeaway

This guide will teach you how to use Bytewax to detect and calculate the Click-Through Rate (CTR) on a custom session window on streaming data using a window and then calculate metrics downstream.

Table of content

Resources
Introduction and problem statement
Strategy
Assumptions
Imports and Setup
Data Model
Defining user events, adding events and calculating CTR
Creating our Dataflow
Execution
Summary

Resources

Github link

Introduction and problem statement

One of the most critical metrics in evaluating the effectiveness of online platforms, particularly search engines, is the Click-Through Rate (CTR). The CTR is a measure of how frequently users engage with search results or advertisements, making it an indispensable metric for digital marketers, web developers, and data analysts.

This relevance of CTR extends to any enterprise aiming to understand user behavior, refine content relevancy, and ultimately, increase the profitability of online activities. As such, efficiently calculating and analyzing CTR is not only essential for enhancing user experience but also for driving strategic business decisions. The challenge, however, lies in accurately aggregating and processing streaming data to generate timely and actionable insights.

Our focus on developing a dataflow using Bytewax—an open-source Python framework for streaming data processing—addresses this challenge head-on. Bytewax allows for the real-time processing of large volumes of event data, which is particularly beneficial for organizations dealing with continuous streams of user interactions. This tutorial is specifically relevant for:

Digital Marketers: Who need to analyze user interaction to optimize ad placements and content strategy effectively.
Data Analysts and Scientists: Who require robust tools to process and interpret user data to derive insights that drive business intelligence.
Web Developers: Focused on improving site architecture and user interface to enhance user engagement and satisfaction.
Product Managers: Who oversee digital platforms and are responsible for increasing user engagement and retention through data-driven methodologies.

Strategy

In this tutorial, we will demonstrate how to build a dataflow using Bytewax to process streaming data from a hypothetical search engine. The dataflow will be designed to calculate the Click-Through Rate (CTR) for each search session, providing a comprehensive overview of user engagement with search results. The key steps involved in this process include:

Defining a data model/schema for incoming events.
Generating input data to simulate user interactions.
Implementing logic functions to calculate CTR for each search session.
Creating a dataflow that incorporates windowing to process the incoming event stream.
Executing the dataflow to generate actionable insights.

Assumptions

Searches are per-user, so we need to divvy up events by user.
Searches don't span user sessions, so we should calculate user sessions first.
Sessions without a search shouldn't contribute.
Calculate one metric: click through rate (or CTR), if a user clicked on any result in a search.

Imports and Setup

Before we begin, let's import the necessary modules and set up the environment for building the dataflow.

search-session/dataflow.py

Lines 1 to 11 in cdd5108

    
           from datetime import datetime, timedelta, timezone 
        
           from dataclasses import dataclass 
        
           from typing import List 
        
           from bytewax.connectors.stdio import StdOutSink 
        
           from bytewax.operators.window import SessionWindow, \ 
        
                                               EventClockConfig 
        
           from bytewax.operators import window as wop 
        
           import bytewax.operators as op 
        
           from bytewax.dataflow import Dataflow 
        
           from bytewax.testing import TestingSource

In this example, we will define a data model for the incoming events, generate input data to simulate user interactions, and implement logic functions to calculate the Click-Through Rate (CTR) for each search session. We will then create a dataflow to process the incoming event stream and execute it to generate actionable insights.

Data Model

Let's start by defining a data model / schema for our incoming events. We'll make model classes for all the relevant events we'd want to monitor.

search-session/dataflow.py

Lines 13 to 36 in cdd5108

    
           @dataclass 
        
           class AppOpen: 
        
               user: int 
        
               time: datetime 
        
           @dataclass 
        
           class Search: 
        
               user: int 
        
               query: str 
        
               time: datetime 
        
           @dataclass 
        
           class Results: 
        
               user: int 
        
               items: List[str] 
        
               time: datetime 
        
           @dataclass 
        
           class ClickResult: 
        
               user: int 
        
               item: str 
        
               time: datetime

In a production system, these might come from external schema or be auto generated.

Once the data model is defined, we can move on to generating input data to simulate user interactions. This will allow us to test our dataflow and logic functions before deploying them in a live environment. Let's create 2 users and simulate their click activity as follows:

search-session/dataflow.py

Lines 38 to 58 in cdd5108

    
           # The time at which we want all of our windows to align to 
        
           align_to = datetime(2024, 1, 1, tzinfo=timezone.utc) 
        
           # Simulated events to emit into our Dataflow 
        
           # In these events we have two users, each of which  
        
           # performs a search, gets results, and clicks on a result 
        
           # User 1 searches for dogs, clicks on rover 
        
           # User 2 searches for cats, clicks on fluffy and kathy 
        
           client_events = [ 
        
               Search(user=1, time=align_to + timedelta(seconds=5), query="dogs"), 
        
               Results(user=1, time=align_to + timedelta(seconds=6), items=["fido", "rover", "buddy"]), 
        
               ClickResult(user=1, time=align_to + timedelta(seconds=7), item="rover"), 
        
               Search(user=2, time=align_to + timedelta(seconds=5), query="cats"), 
        
               Results( 
        
                   user=2, 
        
                   time=align_to + timedelta(seconds=6), 
        
                   items=["fluffy", "burrito", "kathy"], 
        
               ), 
        
               ClickResult(user=2, time=align_to + timedelta(seconds=7), item="fluffy"), 
        
               ClickResult(user=2, time=align_to + timedelta(seconds=8), item="kathy"), 
        
           ]

The client events will constitute the data input for our dataflow, simulating user interactions with the search engine. The events will include user IDs, search queries, search results, and click activity. This data will be used to calculate the Click-Through Rate (CTR) for each search session.

Defining user events, adding events and calculating CTR

We will define three helper functions: user_event, add_event, and calculate_ctr to process the incoming events and calculate the CTR for each search session.

The user_event function will extract the user ID from the incoming event and use it as the key for grouping the events by user.

search-session/dataflow.py

Lines 60 to 62 in cdd5108

    
           def user_event(event): 
        
               return str(event.user), event

The calculate_ctr function will calculate the Click-Through Rate (CTR) for each search session based on the click activity in the session.

search-session/dataflow.py

Lines 64 to 75 in cdd5108

    
           def calc_ctr(user__search_session): 
        
               user, (window_metadata, search_session) = user__search_session 
        
               searches = [event for event in search_session if isinstance(event, Search)] 
        
               clicks = [event for event in search_session if isinstance(event, ClickResult)] 
        
               # See counts of searches and clicks 
        
               print(f"User {user}: {len(searches)} searches, {len(clicks)} clicks") 
        
               if len(searches) == 0: 
        
                   return (user, 0) 
        
               return (user, len(clicks) / len(searches))

Creating our Dataflow

A dataflow is the unit of work in Bytewax. Dataflows are data-parallel directed acyclic graphs that are made up of processing steps.

Let's start by creating an empty dataflow.

search-session/dataflow.py

Lines 77 to 78 in cdd5108

    
           # Create the Dataflow 
        
           flow = Dataflow("search_ctr")

Generating Input Data

Bytewax has a TestingSource class that takes an enumerable list of events that it will emit, one at a time into our dataflow. TestingSource will be initialized with the list of events we created earlier in the variable client_events.

search-session/dataflow.py

Lines 79 to 80 in cdd5108

    
           # Add input source 
        
           inp = op.input("inp", flow, TestingSource(client_events))

Mapping user events

We can use the op class along with op.map("user_event", inp, user_event) - this takes each event from the input and applies the user_event function. This function is transforming each event into a format suitable for grouping by user (key-value pairs where the key is the user ID).

search-session/dataflow.py

Lines 81 to 82 in cdd5108

    
           # Map user events function 
        
           user_event_map = op.map("user_event", inp, user_event)

The role of windowed data in analysis for CTR

We will now turn our attention to windowing the data. In a dataflow pipeline, the role of collecting windowed data, particularly after mapping user events, is crucial for segmenting the continuous stream of events into manageable, discrete chunks based on time or event characteristics. This step enables the aggregation and analysis of events within specific time frames or sessions, which is essential for understanding patterns, behaviors, and trends over time.

After user events are mapped, typically transforming each event into a tuple of (user_id, event_data), the next step is to group these events into windows. In this example, we will use a SessionWindow to group events by user sessions. We will also use an EventClockConfig to manage the timing and order of events as they are processed through the dataflow.

search-session/dataflow.py

Lines 83 to 90 in cdd5108

    
           # Collect windowed data 
        
           # Configuration for the Dataflow  
        
           event_time_config = EventClockConfig( 
        
               dt_getter=lambda e: e.time, 
        
               wait_for_system_duration=timedelta(seconds=1)   
        
           ) 
        
           # Configuration for the windowing operator 
        
           clock_config = SessionWindow(gap=timedelta(seconds=10))

The EventClockConfig is responsible for managing the timing and order of events as they are processed through the dataflow. It's crucial for ensuring that events are handled accurately in real-time or near-real-time streaming applications.
The SessionWindow specifies how to group these timestamped events into sessions. A session window collects all events that occur within a specified gap of each other, allowing for dynamic window sizes based on the flow of incoming data

These configurations ensure that your dataflow can handle streaming data effectively, capturing user behavior in sessions and calculating relevant metrics like CTR in a way that is timely and reflective of actual user interactions. This setup is ideal for scenarios where user engagement metrics over time are critical, such as in digital marketing analysis, website optimization, or interactive application monitoring.

Once the events are grouped into windows, further processing can be performed on these grouped events, such as calculating metrics like CTR within each session. This step often involves applying additional functions to the windowed data to extract insights, such as counting clicks and searches to compute the CTR.

We can do this as follows:

search-session/dataflow.py

Lines 92 to 97 in cdd5108

    
           window = wop.collect_window("windowed_data", \ 
        
                                       user_event_map, \ 
        
                                       clock=event_time_config, 
        
                                       windower=clock_config) 
        
           # Calculate search CTR. 
        
           calc = op.map("calc_ctr", window, calc_ctr)

In here, we are setting up data windowing as a step in the dataflow after the user events were created. We can then calculate the CTR using our function on the windowed data.

Returning results

Finally, we can add an output step to our dataflow to return the results of the CTR calculation. This step will emit the CTR for each search session, providing a comprehensive overview of user engagement with search results.

search-session/dataflow.py

Lines 98 to 99 in cdd5108

    
           # Output the results 
        
           op.output("out", calc, StdOutSink())

Execution

Now we're done with defining the dataflow. Let's run it!

> python -m bytewax.run dataflow:flow
>> User 1: 1 searches, 1 clicks
>> User 2: 1 searches, 2 clicks
>>('1', 1.0)
>>('2', 2.0)

Summary

That’s it, now you have an understanding of how you can build custom session windows, how you can define dataclasses to be used in Bytewax and how to calculate click through rate on a stream of logs.

We want to hear from you!

If you have any trouble with the process or have ideas about how to improve this document, come talk to us in the #troubleshooting Slack channel!

See our full gallery of tutorials →

Share your tutorial progress!

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataflow.py		dataflow.py
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

dataflow.py

dataflow.py

requirements.txt

requirements.txt

run.sh

run.sh

Repository files navigation

Building Sessions from Search Logs

Prerequisites

Your Takeaway

Table of content

Resources

Introduction and problem statement

Strategy

Assumptions

Imports and Setup

Data Model

Defining user events, adding events and calculating CTR

Creating our Dataflow

Generating Input Data

Mapping user events

The role of windowed data in analysis for CTR

Returning results

Execution

Summary

We want to hear from you!

About

Releases

Packages

Contributors 3

Languages

	from datetime import datetime, timedelta, timezone
	from dataclasses import dataclass
	from typing import List

	from bytewax.connectors.stdio import StdOutSink
	from bytewax.operators.window import SessionWindow, \
	EventClockConfig
	from bytewax.operators import window as wop
	import bytewax.operators as op
	from bytewax.dataflow import Dataflow
	from bytewax.testing import TestingSource

	@dataclass
	class AppOpen:
	user: int
	time: datetime


	@dataclass
	class Search:
	user: int
	query: str
	time: datetime

	@dataclass
	class Results:
	user: int
	items: List[str]
	time: datetime

	@dataclass
	class ClickResult:
	user: int
	item: str
	time: datetime

	# The time at which we want all of our windows to align to
	align_to = datetime(2024, 1, 1, tzinfo=timezone.utc)

	# Simulated events to emit into our Dataflow
	# In these events we have two users, each of which
	# performs a search, gets results, and clicks on a result
	# User 1 searches for dogs, clicks on rover
	# User 2 searches for cats, clicks on fluffy and kathy
	client_events = [
	Search(user=1, time=align_to + timedelta(seconds=5), query="dogs"),
	Results(user=1, time=align_to + timedelta(seconds=6), items=["fido", "rover", "buddy"]),
	ClickResult(user=1, time=align_to + timedelta(seconds=7), item="rover"),
	Search(user=2, time=align_to + timedelta(seconds=5), query="cats"),
	Results(
	user=2,
	time=align_to + timedelta(seconds=6),
	items=["fluffy", "burrito", "kathy"],
	),
	ClickResult(user=2, time=align_to + timedelta(seconds=7), item="fluffy"),
	ClickResult(user=2, time=align_to + timedelta(seconds=8), item="kathy"),
	]

	def calc_ctr(user__search_session):
	user, (window_metadata, search_session) = user__search_session

	searches = [event for event in search_session if isinstance(event, Search)]
	clicks = [event for event in search_session if isinstance(event, ClickResult)]

	# See counts of searches and clicks
	print(f"User {user}: {len(searches)} searches, {len(clicks)} clicks")

	if len(searches) == 0:
	return (user, 0)
	return (user, len(clicks) / len(searches))

	# Add input source
	inp = op.input("inp", flow, TestingSource(client_events))

	# Map user events function
	user_event_map = op.map("user_event", inp, user_event)

	# Collect windowed data
	# Configuration for the Dataflow
	event_time_config = EventClockConfig(
	dt_getter=lambda e: e.time,
	wait_for_system_duration=timedelta(seconds=1)
	)
	# Configuration for the windowing operator
	clock_config = SessionWindow(gap=timedelta(seconds=10))

	window = wop.collect_window("windowed_data", \
	user_event_map, \
	clock=event_time_config,
	windower=clock_config)
	# Calculate search CTR.
	calc = op.map("calc_ctr", window, calc_ctr)

License

bytewax/search-session

Folders and files

Latest commit

History

Repository files navigation

Building Sessions from Search Logs

Prerequisites

Your Takeaway

Table of content

Resources

Introduction and problem statement

Strategy

Assumptions

Imports and Setup

Data Model

Defining user events, adding events and calculating CTR

Creating our Dataflow

Generating Input Data

Mapping user events

The role of windowed data in analysis for CTR

Returning results

Execution

Summary

We want to hear from you!

About

Resources

License

Stars

Watchers

Forks

Languages