NYC Taxi Dataset¶
In this tutorial, we will walk through how to leverage Epsio's incremental materialized views to query the NYC Taxi dataset efficiantly. We will simulate a live workload, explore the dataset, and speed up query performances using Epsio views.
Prerequisites¶
Before you start, make sure you have:
- An empty PostgreSQL instance
- An Epsio instance connected to the PostgreSQL instance
Steps¶
Step 1: Simulate Workload¶
How do we simulate the workload?¶
In the real world, the taxi database isn't static and is updated in real-time. To mimic this situation, we'll use a Python script that replays pre-recorded NYC Taxi data into our database, as if the rides are happening live.
Download the Simulation Script¶
First, download the script to your local machine. This script populates your PostgreSQL database with the NYC Taxi dataset:
Run the Simulation¶
Run the script to simulate taxi data population & streaming into the database:
Step 2: Explore the dataset¶
The dataset is comprised of three main tables: trips
, rates
, and nyc_zones
.
-
trips
: Contains detailed records of each taxi ride. -
rates
: Includes information about different rate codes. -
nyc_zones
: Provides details about various pickup and drop-off zones in NYC.
The rides
table contains the actual rides data, and has the following schema:
Table "public.trips"
Column | Type | Collation | Nullable | Default
-----------------------+-----------------------------+-----------+----------+---------
id | integer | | |
vendorid | integer | | |
tpep_pickup_datetime | timestamp without time zone | | |
tpep_dropoff_datetime | timestamp without time zone | | |
passenger_count | numeric(16,2) | | |
trip_distance | numeric(16,2) | | |
ratecodeid | numeric(16,2) | | |
store_and_fwd_flag | character(1) | | |
pulocationid | integer | | |
dolocationid | integer | | |
payment_type | integer | | |
fare_amount | numeric(16,2) | | |
extra | numeric(16,2) | | |
mta_tax | numeric(16,2) | | |
tip_amount | numeric(16,2) | | |
tolls_amount | numeric(16,2) | | |
improvement_surcharge | numeric(16,2) | | |
total_amount | numeric(16,2) | | |
congestion_surcharge | numeric(16,2) | | |
airport_fee | numeric(16,2) | | |
You can find the number of rows (which will be continuously increasing due to the simulation) by running:
3. Accelerate Queries with Epsio¶
Complex SQL queries can be slow to run, especially as data accumulates over time. Epsio can significantly speed up these queries by incremetaly maintaining these querie's results.
Example 1: What is the average trip distance?¶
To find out, you can run a simple SQL query:
Instead of running this query every time you need this information and waiting for the database to recalculate the result from scratch, you can create an Episio view.
Using Epsio to query average trip distance
First, we create a view for the average trip distance:From now on, we can query the view to get instant and up-to-date result:
Example 2: How many rides of each rate type were taken?¶
In New York City, taxis apply various fare structures depending on the trip type. To find out how many rides of each rate type were taken, we can run the following query:
SELECT rates.description, COUNT(*) AS num_trips
FROM trips
JOIN rates ON trips.ratecodeid = rates.rate_code
GROUP BY rates.description
ORDER BY LOWER(rates.description);
To make this query significantly faster, we can create an Epsio view that maintains the results of the query.
Using Epsio to maintain rides count per rate type
First, we create a view for the rides count per rate type:CALL epsio.create_view('rides_per_rate',
'SELECT rates.description, COUNT(*) AS num_trips
FROM trips
JOIN rates ON trips.ratecodeid = rates.rate_code
GROUP BY rates.description');
From now on, we can query the view to get instant and up-to-date result:
Example 3: What are the top 10 zones where the most taxi rides originate from?¶
To find out the top 10 zones where the most taxi rides originate from, we can run the following query:
SELECT nz.zone, COUNT(*) AS ride_count
FROM trips AS t
JOIN nyc_zones AS nz ON t.pulocationid = nz.locationid
GROUP BY nz.zone
ORDER BY ride_count DESC LIMIT 10;
Using Epsio to maintain top zones where the most rides originate from
First, we create a view for the top 10 zones where the most taxi rides originate from:CALL epsio.create_view('top_10_zones',
'SELECT nz.zone, COUNT(*) AS ride_count
FROM trips AS t
JOIN nyc_zones AS nz ON t.pulocationid = nz.locationid
GROUP BY nz.zone
ORDER BY ride_count DESC LIMIT 10');
From now on, we can query the view to get instant and up-to-date result:
Summary¶
In this tutorial, we explored the NYC Taxi dataset and demonstrated how Epsio can dramatically reduce query times.
Here are some sample performance results:
PostgreSQL | Epsio | |
---|---|---|
Q1 | 3005.418 ms | 0.734 ms |
Q2 | 5964.768 ms | 1.108 ms |
Q3 | 4989.151 ms | 2.145 ms |