Lesson 5	What is an index?
Objective	Use Database Table Index to speed up your Queries.

Database Table Index (How Indexes Speed Up SQL Queries)

As your database tables grow from hundreds of rows to millions, the speed of your SQL queries becomes a critical concern. A query that runs in milliseconds on a small table can take seconds or minutes on a large one — unless the table has the right indexes. This lesson explains what a database index is, how it works internally, when to create one, and how to write the CREATE INDEX statement. Lesson 6 covers the specifics of creating indexes on individual columns.

What Is a Database Index?

The Book Index Analogy

A database index works in the same way as the index at the back of a textbook. When you want to find every page that discusses "normalization," you do not read the entire book from page 1. You turn to the index, look up "normalization" in the alphabetically sorted list, find the page numbers, and go directly to those pages. The index makes the lookup fast because the entries are sorted — you can find any term in seconds using binary search rather than reading every page sequentially.

A database index works the same way. It is a separate, sorted data structure that the database engine maintains alongside the table. When a query includes a WHERE clause on an indexed column, the engine consults the index first — finding the matching values in the sorted index structure — and then jumps directly to the corresponding rows in the table. Without an index, the engine must read every row in the table to find the ones that match the WHERE condition.

The File Cabinet Analogy

Consider a file cabinet where folders are stored in random order. Finding a specific folder requires checking every folder in the cabinet from front to back — a sequential scan that gets slower as the cabinet fills up. Now imagine sorting the folders alphabetically and adding labeled dividers at each letter. To find a folder starting with "K," you go directly to the K divider and search only that section. The dividers are the index — a sorted guide that eliminates the need to check every folder.
In a SQL database, the same principle applies. You decide how you will be querying the data — which columns appear in WHERE clauses, JOIN conditions, and ORDER BY clauses — and you create indexes on those columns to give the engine a sorted guide to follow instead of scanning every row.

How the Database Engine Uses an Index

Once an index is created, the database engine manages it automatically. You do not call the index in your SQL statements — you continue writing the same SELECT, WHERE, and JOIN syntax you always use. The engine's query planner examines your query, determines whether an available index can speed up the execution, and uses it if appropriate. The index is invisible to the SQL developer during normal query writing; its effect is felt only in execution speed.

As you work with SQL, you will find yourself paying increasing attention to query performance. You can optimize queries by selecting only the columns you need, reducing the number of rows returned, and simplifying joins — but for large tables, an index on the right column is often the single most effective performance improvement available.

How Indexes Work — The B-Tree Structure

Full Table Scan vs Index Lookup

When no index exists on the column referenced in a WHERE clause, the database engine performs a full table scan — it reads every row in the table from the first to the last, evaluating the WHERE condition against each one. For a table with five rows, this is trivial. For a table with five million rows, a full table scan that examines every row to find a handful of matches is extremely expensive in both time and I/O.

The diagram below (Figure 4-5.1) shows the contrast directly. On the left, the engine scans all five rows sequentially to find D. Kim — even though the target is row 4, the engine has no choice but to check rows 1, 2, and 3 first. On the right, the engine consults the INDEX (CUSTOMER) sorted list, finds "D. Kim → Row 4" immediately, and jumps directly to row 4 in the table. Rows 1, 2, 3, and 5 are never read.

Why Sorted Order Enables Fast Lookup

The reason an index makes lookup fast is that sorted data supports binary search. In a sorted list of n entries, binary search finds any entry in at most log₂(n) comparisons — a table with one million rows requires at most 20 comparisons to locate any value. An unsorted full scan requires up to one million comparisons. The index achieves this by maintaining its entries in sorted order and organizing them into a B-tree (balanced tree) structure, where each comparison eliminates half the remaining candidates.

Most relational databases implement indexes as B-trees by default — PostgreSQL, MySQL InnoDB, SQL Server, and Oracle all use B-tree indexes unless a different index type is explicitly requested. The B-tree structure guarantees that lookup, insert, and delete operations on the index all complete in O(log n) time regardless of table size.

Two-panel diagram comparing a full table scan with no index against a direct index lookup — the index jumps to row 4 (D. Kim) without reading the other rows — Figure 4-5.1: Without an index, the database reads every row to find matches. With an index, it jumps directly to the relevant rows.

The CREATE INDEX Statement

Basic Syntax

The CREATE INDEX statement creates a new index on one or more columns of an existing table. The basic syntax is:

CREATE INDEX index_name
ON table_name (column1, column2, ...);

The index name should be descriptive and follow a consistent naming convention. A common convention is to prefix the index name with idx_ followed by the table name and column name, making it clear what the index covers at a glance.

Single-Column Index

A single-column index is the most common type. It indexes one column and speeds up queries that filter, sort, or join on that column:

CREATE INDEX idx_user_email
ON users (email);

This creates an index named idx_user_email on the email column in the users table. After this index is created, any query with WHERE email = 'someone@example.com' can use the index to locate matching rows directly rather than scanning the entire users table.

CREATE INDEX idx_product_description
ON Products (Product_Description);

This creates an index on the Product_Description column in the Products table. Queries that search for products by description — WHERE Product_Description LIKE 'Widget%' — benefit from this index.

Composite Index — Column Order Matters

A composite index covers multiple columns. It is useful when queries frequently filter or sort on a combination of columns together:

CREATE INDEX idx_user_name_birthday
ON users (last_name, first_name, birthday);

This composite index speeds up queries that filter on last_name alone, or on last_name and first_name together, or on all three columns together. The order of columns in the index definition matters: the index is sorted first by last_name, then by first_name within each last_name group, then by birthday within each name group. A query that filters only on first_name — without last_name in the WHERE clause — cannot use this index efficiently, because first_name is not the leading column.

The general rule for composite index column order: place the most selective column first (the one that eliminates the most rows), and align the column order with the most common query patterns against that table.

CREATE INDEX idx_product
ON Products (Product_ID, Product_Description);

Types of Indexes

Unique Index

A unique index enforces uniqueness on the indexed column — no two rows can have the same value in that column. If you attempt to insert a duplicate value into a column with a unique index, the database engine rejects the INSERT with a constraint violation error:

CREATE UNIQUE INDEX idx_unique_email
ON users (email);

A unique index on email guarantees that no two users share the same email address. This is enforced automatically at the database level, independent of application code.

Note that defining a PRIMARY KEY in CREATE TABLE automatically creates a unique index on the primary key column — you do not need to create a separate unique index for the primary key.

Index on Expressions

Some databases support creating indexes on expressions or functions applied to column values. PostgreSQL and MySQL 8.0+ support functional indexes:

CREATE INDEX idx_upper_lastname
ON users (UPPER(last_name));

This index is used by queries that search on the uppercase version of last_name — for example, WHERE UPPER(last_name) = 'SMITH' — enabling case-insensitive name searches to use an index rather than scanning every row and applying the UPPER function to each one.

Clustered vs Non-Clustered (SQL Server Specific)

SQL Server distinguishes between two index types that other databases handle differently. A clustered index physically reorders the table's rows to match the index order — SQL Server allows only one clustered index per table, and the primary key is clustered by default. A non-clustered index maintains a separate sorted structure with pointers back to the table rows, without altering the physical row order.

-- SQL Server only — not standard SQL
CREATE CLUSTERED INDEX idx_clustered_id
ON users (id);

CREATE NONCLUSTERED INDEX idx_nonclustered_name
ON users (last_name);

In PostgreSQL, MySQL, and most other databases, the distinction is handled differently — the primary key is the clustered key implicitly, and all other indexes are non-clustered by default. The CLUSTERED and NONCLUSTERED keywords are SQL Server-specific and do not transfer to other database engines.

The Primary Key as an Automatic Index

When you define a PRIMARY KEY in CREATE TABLE — as covered in lesson 4 — the database engine automatically creates a unique index on the primary key column. This is why primary key lookups are always fast: the primary key column is always indexed, and the index is unique, so the engine can find any row by its primary key in O(log n) time without any additional CREATE INDEX statement.

When to Create an Index

Columns That Benefit from Indexing

The columns that benefit most from indexes are those that appear frequently in WHERE clauses, JOIN conditions, and ORDER BY clauses — the columns the engine uses to filter, connect, and sort data. Analyzing your slow queries and identifying which columns appear in these positions is the most reliable way to determine where indexes will help.

Specifically, consider creating an index on a column when: it is used in a WHERE clause with equality or range conditions; it is the join column in a JOIN operation; it appears in an ORDER BY clause and the query returns a large result set; or it is a foreign key column that is frequently used to look up related records in the parent table.

Columns That Do NOT Benefit from Indexing

Not every column is a good index candidate. Low-cardinality columns — columns with very few distinct values — rarely benefit from indexes. A column that stores only three possible values (for example, an order status of PENDING, PROCESSING, or SHIPPED) has low cardinality. An index on a low-cardinality column may actually be slower than a full table scan, because the index points to a large fraction of the table's rows and the engine ends up reading nearly everything anyway.

Columns that are rarely used in WHERE, JOIN, or ORDER BY clauses offer no query performance benefit from an index — they only add write overhead. Very small tables — where a full scan reads only a few pages — also rarely benefit from indexes, because the overhead of consulting the index structure exceeds the cost of the scan itself.

How Many Indexes Is Too Many

Every index on a table must be updated whenever a row is inserted, updated, or deleted. A table with ten indexes requires ten index updates for every single INSERT — the same data modification that updates one row in the base table also modifies ten separate index structures. On tables with high write volumes (high-frequency INSERT, UPDATE, or DELETE operations), excessive indexes slow write performance significantly and consume additional storage space.

The practical guideline is to index the columns your most important and most frequent queries depend on, and no more. Measure query performance before and after adding an index to confirm it has the desired effect. Remove indexes that are never used by the query planner — most database systems provide system views or tools that report index usage statistics.

Index Maintenance and the Write Overhead

INSERT, UPDATE, DELETE Cost

When a row is inserted into a table, the database engine inserts a corresponding entry into every index defined on that table. When a row is updated and an indexed column's value changes, the engine removes the old index entry and inserts a new one. When a row is deleted, the engine removes its entry from every index. These index maintenance operations are performed automatically and transparently — but they consume time and I/O proportional to the number of indexes on the table.

A table used primarily for reading — a reporting table, a historical archive, a read-only reference table — can be indexed aggressively because writes are infrequent and index maintenance cost is low. A table with extremely high insert rates — a transaction log, a clickstream table, a real-time sensor data table — benefits from minimal indexing to avoid the write overhead becoming a bottleneck.

Keeping Statistics Up to Date

The query planner uses statistics about the distribution of data in each column to decide whether to use an index for a given query. These statistics include the number of distinct values in a column, the distribution of those values, and the total row count. If statistics are stale — reflecting the table's state from weeks ago rather than its current state — the planner may make poor decisions, choosing a full table scan when an index would be faster, or using an index when a full scan would be faster.

Most database systems update statistics automatically in the background (PostgreSQL autovacuum, MySQL InnoDB automatic statistics, SQL Server auto update statistics). After large bulk data loads or significant table modifications, it is good practice to manually trigger a statistics update — ANALYZE table_name in PostgreSQL, ANALYZE TABLE table_name in MySQL, or UPDATE STATISTICS table_name in SQL Server — to ensure the planner has accurate information.

Using EXPLAIN to Verify Index Usage

The EXPLAIN statement (or EXPLAIN ANALYZE in PostgreSQL) shows the query execution plan the engine chose for a given SQL statement — including whether it used an index or performed a full table scan. Running EXPLAIN before and after creating an index is the definitive way to confirm that the index is being used and is having the intended effect:

-- PostgreSQL / MySQL
EXPLAIN SELECT * FROM users WHERE email = 'gauss@example.com';

-- PostgreSQL with actual execution statistics
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'gauss@example.com';

The output shows the access method (Index Scan vs Seq Scan in PostgreSQL, index vs ALL in MySQL), the index name used, the estimated and actual row counts, and the execution time. An Index Scan in the EXPLAIN output confirms the index is being used; a Seq Scan (sequential scan) means the planner chose a full table scan — either because no suitable index exists, the index is not selective enough, or the table is too small for the index overhead to be worthwhile.

Lesson 6 covers the CREATE INDEX statement in detail, including how to create indexes on specific columns of the tables you have built in this module.