Skip to content

NYC Taxi Dataset

In this tutorial, we will walk through how to leverage Epsio's incremental materialized views to query the NYC Taxi dataset efficiantly. We will simulate a live workload, explore the dataset, and speed up query performances using Epsio views.

Prerequisites

Before you start, make sure you have:

  • An empty PostgreSQL instance
  • An Epsio instance connected to the PostgreSQL instance

Steps

Step 1: Simulate Workload

How do we simulate the workload?

In the real world, the taxi database isn't static and is updated in real-time. To mimic this situation, we'll use a Python script that replays pre-recorded NYC Taxi data into our database, as if the rides are happening live.

Download the Simulation Script

First, download the script to your local machine. This script populates your PostgreSQL database with the NYC Taxi dataset:

$ curl -O https://epsio-nyc-taxi-demo.s3.us-west-2.amazonaws.com/simulate.py

Run the Simulation

Run the script to simulate taxi data population & streaming into the database:

$ python3 simulate.py <host> <username> <password>


Step 2: Explore the dataset

The dataset is comprised of three main tables: trips, rates, and nyc_zones.

  • trips: Contains detailed records of each taxi ride.

  • rates: Includes information about different rate codes.

  • nyc_zones: Provides details about various pickup and drop-off zones in NYC.

The rides table contains the actual rides data, and has the following schema:

                             Table "public.trips"
        Column         |            Type             | Collation | Nullable | Default 
-----------------------+-----------------------------+-----------+----------+---------
 id                    | integer                     |           |          | 
 vendorid              | integer                     |           |          | 
 tpep_pickup_datetime  | timestamp without time zone |           |          | 
 tpep_dropoff_datetime | timestamp without time zone |           |          | 
 passenger_count       | numeric(16,2)               |           |          | 
 trip_distance         | numeric(16,2)               |           |          | 
 ratecodeid            | numeric(16,2)               |           |          | 
 store_and_fwd_flag    | character(1)                |           |          | 
 pulocationid          | integer                     |           |          | 
 dolocationid          | integer                     |           |          | 
 payment_type          | integer                     |           |          | 
 fare_amount           | numeric(16,2)               |           |          | 
 extra                 | numeric(16,2)               |           |          | 
 mta_tax               | numeric(16,2)               |           |          | 
 tip_amount            | numeric(16,2)               |           |          | 
 tolls_amount          | numeric(16,2)               |           |          | 
 improvement_surcharge | numeric(16,2)               |           |          | 
 total_amount          | numeric(16,2)               |           |          | 
 congestion_surcharge  | numeric(16,2)               |           |          | 
 airport_fee           | numeric(16,2)               |           |          | 

You can find the number of rows (which will be continuously increasing due to the simulation) by running:

SELECT COUNT(*) FROM trips;
Run >
Results:



3. Accelerate Queries with Epsio

Complex SQL queries can be slow to run, especially as data accumulates over time. Epsio can significantly speed up these queries by incremetaly maintaining these querie's results.

Example 1: What is the average trip distance?

To find out, you can run a simple SQL query:

SELECT AVG(trip_distance) FROM trips;
Run >
Results:

Instead of running this query every time you need this information and waiting for the database to recalculate the result from scratch, you can create an Episio view.


Using Epsio to query average trip distance

First, we create a view for the average trip distance:
CALL epsio.create_view('avg_trip_distance', 'SELECT AVG(trip_distance) FROM trips');
Run >
Results:

From now on, we can query the view to get instant and up-to-date result:
SELECT * FROM avg_trip_distance;
Run >
Results:


Example 2: How many rides of each rate type were taken?

In New York City, taxis apply various fare structures depending on the trip type. To find out how many rides of each rate type were taken, we can run the following query:

SELECT rates.description, COUNT(*) AS num_trips FROM trips JOIN rates ON trips.ratecodeid = rates.rate_code GROUP BY rates.description ORDER BY LOWER(rates.description);
Run >
Results:

To make this query significantly faster, we can create an Epsio view that maintains the results of the query.


Using Epsio to maintain rides count per rate type

First, we create a view for the rides count per rate type:
CALL epsio.create_view('rides_per_rate', 'SELECT rates.description, COUNT(*) AS num_trips FROM trips JOIN rates ON trips.ratecodeid = rates.rate_code GROUP BY rates.description');
Run >
Results:

From now on, we can query the view to get instant and up-to-date result:
SELECT * FROM rides_per_rate ORDER BY LOWER(description);
Run >
Results:


Example 3: What are the top 10 zones where the most taxi rides originate from?

To find out the top 10 zones where the most taxi rides originate from, we can run the following query:

SELECT nz.zone, COUNT(*) AS ride_count FROM trips AS t JOIN nyc_zones AS nz ON t.pulocationid = nz.locationid GROUP BY nz.zone ORDER BY ride_count DESC LIMIT 10;
Run >
Results:


Using Epsio to maintain top zones where the most rides originate from

First, we create a view for the top 10 zones where the most taxi rides originate from:
CALL epsio.create_view('top_10_zones', 'SELECT nz.zone, COUNT(*) AS ride_count FROM trips AS t JOIN nyc_zones AS nz ON t.pulocationid = nz.locationid GROUP BY nz.zone ORDER BY ride_count DESC LIMIT 10');
Run >
Results:

From now on, we can query the view to get instant and up-to-date result:
SELECT * FROM top_10_zones ORDER BY ride_count DESC;
Run >
Results:

Summary

In this tutorial, we explored the NYC Taxi dataset and demonstrated how Epsio can dramatically reduce query times.

Here are some sample performance results:

PostgreSQL Epsio
Q1 3005.418 ms 0.734 ms
Q2 5964.768 ms 1.108 ms
Q3 4989.151 ms 2.145 ms