A DuckDB extension for reading Microsoft PST files with rich schemas for common MAPI types, built on Microsoft's official PST SDK. Query emails, contacts, appointments (and others). Use it to analyze PST data in-place (locally, or on object storage), import to DuckDB tables, or export to Parquet.
Quickly count all messages or folders in a directory full of PSTs (171 files, 77.4 GiB):
D select count(*) from read_pst_messages('enron/*.pst');
┌────────────────┐
│ count_star() │
│ int64 │
├────────────────┤
│ 1227193 │
│ (1.23 million) │
└────────────────┘
Run Time (s): real 0.534 user 0.466194 sys 0.650549What kinds of objects are in this PST?
select message_class, count(*) as c from read_pst_messages('test/*.pst') group by message_class order by c desc;
┌─────────────────┬───────┐
│ message_class │ c │
│ varchar │ int64 │
├─────────────────┼───────┤
│ IPM.Note │ 5 │
│ IPM.Contact │ 2 │
│ IPM.StickyNote │ 2 │
│ IPM.DistList │ 1 │
│ IPM.Task │ 1 │
│ IPM.Appointment │ 1 │
└─────────────────┴───────┘Read the first 5 messages (with limit applied during planning -- for large files):
memory D select * from read_pst_messages('enron/*.pst', read_limit=5);
┌────────────────────────┬────────────────┬──────────────────────┬─────────┬───┬──────────────┬──────────────────────┬──────────────────────┐
│ pst_path │ pst_name │ record_key │ node_id │ … │ message_size │ conversation_topic │ internet_message_id │
│ varchar │ varchar │ blob │ uint32 │ … │ uint64 │ varchar │ varchar │
├────────────────────────┼────────────────┼──────────────────────┼─────────┼───┼──────────────┼──────────────────────┼──────────────────────┤
│ enron/zl_delainey-d_0… │ delainey-d_000 │ \xD5\xEF\xD5\x86\x8… │ 2097316 │ … │ 15603 │ Construction Servic… │ <8a905b300bb6184fa2… │
│ enron/zl_delainey-d_0… │ delainey-d_000 │ \xD5\xEF\xD5\x86\x8… │ 2097284 │ … │ 2671 │ Mike McConnell 3-14… │ <fcd814da82d47f4795… │
│ enron/zl_delainey-d_0… │ delainey-d_000 │ \xD5\xEF\xD5\x86\x8… │ 2097252 │ … │ 2641 │ Mark Lindsey Mtg │ <fcd814da82d47f4795… │
│ enron/zl_delainey-d_0… │ delainey-d_000 │ \xD5\xEF\xD5\x86\x8… │ 2097220 │ … │ 2659 │ Jim Fallon (called) │ <fcd814da82d47f4795… │
│ enron/zl_delainey-d_0… │ delainey-d_000 │ \xD5\xEF\xD5\x86\x8… │ 2097188 │ … │ 2653 │ Call Jeff Shankman │ <fcd814da82d47f4795… │
├────────────────────────┴────────────────┴──────────────────────┴─────────┴───┴──────────────┴──────────────────────┴──────────────────────┤
│ 5 rows 26 columns (7 shown) │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 0.012 user 0.015968 sys 0.015498Have a remote URI for your PST? No problem:
select given_name, surname from read_pst_contacts('https://github.com/intellekthq/duckdb-pst/raw/refs/heads/main/test/unittest.pst');
┌────────────┬─────────┐
│ given_name │ surname │
│ varchar │ varchar │
├────────────┼─────────┤
│ Hopper │ Cat │
│ Linus │ Cat │
└────────────┴─────────┘See Common Queries for more examples of common PST analysis tasks.
PSTs have many database-like properties, allowing us to leverage advanced DuckDB features to enable performant reads:
- Query pushdown: projection and statistics pushdown
- Concurrent planning: parallel partition planning for directories with many PST files
- Late materialization: filter on virtual columns before expanding full projections (example)
- Progress tracking: implements progress API for monitoring large scans
The extension provides specialized table functions for different MAPI message types:
read_pst_messages - Returns all messages with the base IPM.Note schema. Use this for aggregate queries or when you need all message types (check the message_class column to determine specific types).
Type-specific functions (read_pst_contacts, read_pst_appointments, etc.) - Filter messages by type during query planning. These inherit all base IPM.Note fields plus additional fields for their specific type. Planning is slower due to filtering (unfortunately, this requires a comparison of the PR_MESSAGE_CLASS string), but you get a richer schema and reduced result set.
| Table Function | MAPI Message Class | Description |
|---|---|---|
read_pst_folders |
* |
Folders |
read_pst_messages |
* |
All messages, with base IPM.Note email projection |
read_pst_notes |
IPM.Note |
(Filtered) only IPM.Note (and unimplemented types) |
read_pst_contacts |
IPM.Contact |
(Filtered) only contacts with contact-specific fields |
read_pst_distribution_lists |
IPM.DistList |
(Filtered) distribution lists with member information |
read_pst_appointments |
IPM.Appointment |
(Filtered) only calendar appointments and meetings |
read_pst_sticky_notes |
IPM.StickyNote |
(Filtered) only sticky note items |
read_pst_tasks |
IPM.Task |
(Filtered) task items with task-specific fields |
To observe any read issues, enable the DuckDB logger. Errors typically reference an individual file or column:
CALL enable_logging(level = 'debug');
SELECT * from duckdb_logs();Note: Large directories with large PST files may trip your ulimit. Check the logs and adjust accordingly. If unable to adjust the system ulimit, you can try using the planning_concurrency parameter, or DuckDB's SET threads = n;.
All table functions accept the following named parameters. Note that by default message bodies are truncated to 1M and attachment contents are not read.
| Parameter | Default | Description |
|---|---|---|
read_body_size_bytes |
1000000 |
Maximum bytes to read into body and body_html. Set to 0 to read all. |
read_attachment_body |
false |
Whether to read attachment bytes into the bytes field |
read_limit |
NULL |
Maximum number of items to read (applied during planning, stops crawling fs) |
planning_concurrency |
UINT32_MAX |
Maximum concurrent async tasks during partition planning (applies when globbing multiple PST files). |
All table functions return PST metadata fields. Message-based functions inherit base IPM.Note fields plus type-specific additions.
Table of Contents:
- Common PST Metadata - Fields present in all table functions
- Folders - Folder-specific fields
- Base Messages - Core message fields (emails, notes)
- Contacts - Contact-specific fields
- Distribution Lists - Distribution list fields
- Appointments - Calendar/meeting fields
- Sticky Notes - Sticky note fields
- Tasks - Task management fields
- Struct Schemas - Schemas for recipients, attachments, and one-off members
- Virtual Columns - PST internal metadata for filtering and optimization
| Field | Type | Description |
|---|---|---|
pst_path |
VARCHAR |
Full path to the PST file |
pst_name |
VARCHAR |
PST filename without extension |
record_key |
BLOB |
Unique record identifier |
node_id |
UINTEGER |
Node ID within PST |
parent_node_id |
UINTEGER |
Parent node ID |
Includes PST metadata plus:
| Field | Type | Description |
|---|---|---|
container_class |
VARCHAR |
Folder container class |
display_name |
VARCHAR |
Folder display name |
subfolder_count |
UINTEGER |
Number of subfolders |
message_count |
BIGINT |
Total messages in folder |
unread_message_count |
BIGINT |
Unread message count |
Includes PST metadata plus:
| Field | Type | Description |
|---|---|---|
subject |
VARCHAR |
Message subject |
body |
VARCHAR |
Plain text body |
body_html |
VARCHAR |
HTML body |
display_name |
VARCHAR |
Display name |
comment |
VARCHAR |
Comment field |
sender_name |
VARCHAR |
Sender display name |
sender_email_address |
VARCHAR |
Sender email |
recipients |
LIST(STRUCT) |
List of recipients |
has_attachments |
BOOLEAN |
Whether message has attachments |
attachment_count |
UINTEGER |
Number of attachments |
attachments |
LIST(STRUCT) |
List of attachments |
importance |
ENUM |
Message importance: LOW, NORMAL, HIGH |
priority |
ENUM |
Message priority: NONURGENT, NORMAL, URGENT |
sensitivity |
ENUM |
Message sensitivity: NONE, PERSONAL, PRIVATE, CONFIDENTIAL |
creation_time |
TIMESTAMP_S |
Creation timestamp |
last_modified |
TIMESTAMP_S |
Last modification timestamp |
message_delivery_time |
TIMESTAMP_S |
Delivery timestamp |
message_class |
VARCHAR |
MAPI message class (e.g., IPM.Note, IPM.Contact) |
message_flags |
INTEGER |
Message flags |
message_size |
UBIGINT |
Message size in bytes |
conversation_topic |
VARCHAR |
Conversation topic |
internet_message_id |
VARCHAR |
Internet message ID |
Includes PST metadata + base message fields + contact-specific fields:
78 contact-specific fields (click to expand)
| Field | Type | Description |
|---|---|---|
display_name_prefix |
VARCHAR |
Name prefix (Mr., Mrs., etc.) |
given_name |
VARCHAR |
First name |
middle_name |
VARCHAR |
Middle name |
surname |
VARCHAR |
Last name |
generation_suffix |
VARCHAR |
Suffix (Jr., Sr., etc.) |
initials |
VARCHAR |
Initials |
nickname |
VARCHAR |
Nickname |
preferred_by_name |
VARCHAR |
Preferred name |
account_name |
VARCHAR |
Account name |
original_display_name |
VARCHAR |
Original display name |
transmittable_display_name |
VARCHAR |
Transmittable display name |
mhs_common_name |
VARCHAR |
MHS common name |
government_id_number |
VARCHAR |
Government ID number |
organizational_id_number |
VARCHAR |
Organizational ID number |
birthday |
TIMESTAMP_S |
Birthday |
wedding_anniversary |
TIMESTAMP_S |
Wedding anniversary |
spouse_name |
VARCHAR |
Spouse's name |
childrens_names |
VARCHAR |
Children's names |
gender |
SMALLINT |
Gender |
hobbies |
VARCHAR |
Hobbies |
profession |
VARCHAR |
Profession |
language |
VARCHAR |
Language |
location |
VARCHAR |
Location |
keyword |
VARCHAR |
Keyword |
company_name |
VARCHAR |
Company name |
title |
VARCHAR |
Job title |
department_name |
VARCHAR |
Department name |
office_location |
VARCHAR |
Office location |
manager_name |
VARCHAR |
Manager's name |
assistant |
VARCHAR |
Assistant's name |
customer_id |
VARCHAR |
Customer ID |
primary_telephone |
VARCHAR |
Primary telephone |
business_telephone |
VARCHAR |
Business telephone |
business_telephone_2 |
VARCHAR |
Business telephone 2 |
home_telephone |
VARCHAR |
Home telephone |
home_telephone_2 |
VARCHAR |
Home telephone 2 |
mobile_telephone |
VARCHAR |
Mobile telephone |
car_telephone |
VARCHAR |
Car telephone |
radio_telephone |
VARCHAR |
Radio telephone |
pager_telephone |
VARCHAR |
Pager telephone |
callback_number |
VARCHAR |
Callback number |
other_telephone |
VARCHAR |
Other telephone |
assistant_telephone |
VARCHAR |
Assistant telephone |
company_main_phone |
VARCHAR |
Company main phone |
ttytdd_phone |
VARCHAR |
TTY/TDD phone |
isdn_number |
VARCHAR |
ISDN number |
telex_number |
VARCHAR |
Telex number |
primary_fax |
VARCHAR |
Primary fax |
business_fax |
VARCHAR |
Business fax |
home_fax |
VARCHAR |
Home fax |
business_address_street |
VARCHAR |
Business street address |
business_address_city |
VARCHAR |
Business city |
business_address_state |
VARCHAR |
Business state |
business_postal_code |
VARCHAR |
Business postal code |
business_address_country |
VARCHAR |
Business country |
business_po_box |
VARCHAR |
Business PO box |
home_address_street |
VARCHAR |
Home street address |
home_address_city |
VARCHAR |
Home city |
home_address_state |
VARCHAR |
Home state |
home_address_postal_code |
VARCHAR |
Home postal code |
home_address_country |
VARCHAR |
Home country |
home_address_po_box |
VARCHAR |
Home PO box |
other_address_street |
VARCHAR |
Other street address |
other_address_city |
VARCHAR |
Other city |
other_address_state |
VARCHAR |
Other state |
other_address_postal_code |
VARCHAR |
Other postal code |
other_address_country |
VARCHAR |
Other country |
other_address_po_box |
VARCHAR |
Other PO box |
postal_address |
VARCHAR |
Postal address |
personal_home_page |
VARCHAR |
Personal home page |
business_home_page |
VARCHAR |
Business home page |
ftp_site |
VARCHAR |
FTP site |
computer_network_name |
VARCHAR |
Computer network name |
mail_permission |
BOOLEAN |
Mail permission |
send_rich_info |
BOOLEAN |
Send rich info |
conversation_prohibited |
BOOLEAN |
Conversation prohibited |
disclose_recipients |
BOOLEAN |
Disclose recipients |
Includes PST metadata + base message fields + distribution list-specific fields:
| Field | Type | Description |
|---|---|---|
member_node_ids |
LIST(UINTEGER) |
Node IDs of contact members within the PST |
one_off_members |
LIST(STRUCT) |
List of (one-off member struct) |
Includes PST metadata + base message fields + appointment-specific fields:
| Field | Type | Description |
|---|---|---|
location |
VARCHAR |
Meeting location |
start_time |
TIMESTAMP_S |
Start time |
end_time |
TIMESTAMP_S |
End time |
duration |
INTEGER |
Duration in minutes |
all_day_event |
BOOLEAN |
All-day event flag |
is_meeting |
BOOLEAN |
Is a meeting |
organizer_name |
VARCHAR |
Organizer's name |
required_attendees |
VARCHAR |
Required attendees |
optional_attendees |
VARCHAR |
Optional attendees |
meeting_workspace_url |
VARCHAR |
Meeting workspace URL |
busy_status |
INTEGER |
Busy status |
response_status |
INTEGER |
Response status |
is_recurring |
BOOLEAN |
Recurring event flag |
recurrence_pattern |
VARCHAR |
Recurrence pattern |
is_private |
BOOLEAN |
Private event flag |
Includes PST metadata + base message fields + sticky note-specific fields:
| Field | Type | Description |
|---|---|---|
note_color |
INTEGER |
Note color code |
note_width |
INTEGER |
Note width |
note_height |
INTEGER |
Note height |
note_x |
INTEGER |
X position |
note_y |
INTEGER |
Y position |
Includes PST metadata + base message fields + task-specific fields:
| Field | Type | Description |
|---|---|---|
task_status |
INTEGER |
Task status code |
is_complete |
BOOLEAN |
Completion flag |
percent_complete |
DOUBLE |
Completion percentage |
start_date |
TIMESTAMP_S |
Start date |
due_date |
TIMESTAMP_S |
Due date |
date_completed |
TIMESTAMP_S |
Completion date |
last_update |
TIMESTAMP_S |
Last update time |
estimated_effort |
INTEGER |
Estimated effort (minutes) |
actual_effort |
INTEGER |
Actual effort (minutes) |
task_owner |
VARCHAR |
Task owner |
task_assigner |
VARCHAR |
Task assigner |
ownership |
INTEGER |
Ownership code |
last_user |
VARCHAR |
Last user to modify |
is_team_task |
BOOLEAN |
Team task flag |
is_recurring |
BOOLEAN |
Recurring task flag |
The following struct types are used in list fields throughout the message schemas:
Used in the recipients field. Each recipient contains:
| Field | Type | Description |
|---|---|---|
display_name |
VARCHAR |
Display name of the recipient |
account_name |
VARCHAR |
Account name |
email_address |
VARCHAR |
Email address |
address_type |
VARCHAR |
Address type (e.g., "SMTP", "EX") |
recipient_type |
ENUM |
Recipient type: TO, CC, BCC |
recipient_type_raw |
INTEGER |
Raw recipient type value |
Used in the attachments field. Each attachment contains:
| Field | Type | Description |
|---|---|---|
filename |
VARCHAR |
Attachment filename |
mime_type |
VARCHAR |
MIME type of the attachment |
size |
UBIGINT |
Attachment size in bytes |
attach_content_id |
VARCHAR |
Content ID for inline attachments |
attach_method |
ENUM |
Attachment method: NO_ATTACHMENT, BY_VALUE, BY_REFERENCE, BY_REF_RESOLVE, BY_REF_ONLY, EMBEDDED_MESSAGE, OLE |
is_message |
BOOLEAN |
Whether attachment is an embedded message |
bytes |
BLOB |
Raw attachment data |
Used in the one_off_members field of distribution lists. Each one-off member contains:
| Field | Type | Description |
|---|---|---|
display_name |
VARCHAR |
Display name (e.g., "John Doe (john@example.com)") |
address_type |
VARCHAR |
Address type (typically "SMTP") |
email_address |
VARCHAR |
Email address |
Virtual columns provide access to PST internal metadata without materializing them in the default projection. These columns are available in all table functions and are primarily used by the late materialization optimizer to cull scans. However, if you know the specific NIDs of one or more objects within a PST, you can use __node_id in a filter predicate to push down an efficient scan for only those objects.
| Field | Type | Description |
|---|---|---|
__partition |
UBIGINT |
Nondeterministic internal partition number assigned during query planning |
__node_id |
UINTEGER |
Node ID within the PST (duplicate of the node_id metadata field) |
These queries represent common use-cases for traversing PST files or doing basic e-discovery.
Table of Contents:
- Find messages matching a conversation topic
- Select directory tree from a given folder
- Find all parent directories of a given folder
This pushes down a late-materialized scan against a large directory of PST files:
select pst_name, conversation_topic, message_size, creation_time
from read_pst_messages('enron/*.pst')
where conversation_topic LIKE '%EnronOnline%' order by creation_time asc limit 10;
┌────────────────┬────────────────────────────────────────────────┬──────────────┬─────────────────────┐
│ pst_name │ conversation_topic │ message_size │ creation_time │
│ varchar │ varchar │ uint64 │ timestamp_s │
├────────────────┼────────────────────────────────────────────────┼──────────────┼─────────────────────┤
│ skilling-j_000 │ EnronOnline Executive Summary for May 15, 2001 │ 26953 │ 2010-06-17 04:39:00 │
│ skilling-j_000 │ EnronOnline - Dec 14 Mgt Report │ 10255 │ 2010-06-17 04:39:08 │
│ skilling-j_000 │ EnronOnline - Dec 14 Mgt Report │ 10253 │ 2010-06-17 04:39:08 │
│ rogers-b_000 │ EnronOnline Desk to Desk ID and Password │ 7266 │ 2010-06-17 04:39:23 │
│ rogers-b_000 │ EnronOnline Desk to Desk ID and Password │ 8750 │ 2010-06-17 04:39:26 │
│ jones-t_000 │ EnronOnline Discussion │ 17672 │ 2010-06-17 04:39:32 │
│ beck-s_000 │ Storing of data on EnronOnline │ 6833 │ 2010-06-17 04:39:41 │
│ jones-t_000 │ Design Agency for EnronOnline │ 7442 │ 2010-06-17 04:40:16 │
│ jones-t_000 │ Design Agency for EnronOnline │ 11830 │ 2010-06-17 04:40:17 │
│ jones-t_000 │ EnronOnline and Offline NDA Lists │ 387401 │ 2010-06-17 04:41:21 │
├────────────────┴────────────────────────────────────────────────┴──────────────┴─────────────────────┤
│ 10 rows 4 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 12.695 user 22.185424 sys 7.577354To trigger late materialization, a LIMIT clause must generally be present in your query. Limit, Top N, and sample plans are the only ones eligible for LM pushdown. To see whether or not a query will push down a late-materialized scan, use EXPLAIN and look for a HASH_JOIN operator with READ_PST_* scan children.
┌─────────────┴─────────────┐
│ HASH_JOIN │
│ ──────────────────── │
│ Join Type: SEMI │
│ │
│ Conditions: ├──────────────┐
│ __node_id = __node_id │ │
│ __partition = __partition │ │
│ │ │
│ ~0 rows │ │
└─────────────┬─────────────┘ │
This recursively crawls all children of folder with NID 32802.
with recursive dirtree as (
select display_name, node_id, parent_node_id
from read_pst_folders('test/unittest.pst')
where node_id = 32802
union
select f.display_name, f.node_id, f.parent_node_id
from read_pst_folders('test/unittest.pst') f
inner join dirtree d on d.node_id = f.parent_node_id
)
select * from dirtree;
┌──────────────────────────────┬─────────┬────────────────┐
│ display_name │ node_id │ parent_node_id │
│ varchar │ uint32 │ uint32 │
├──────────────────────────────┼─────────┼────────────────┤
│ Top of Outlook data file │ 32802 │ 290 │
│ Deleted Items │ 32866 │ 32802 │
│ Calendar │ 32994 │ 32802 │
│ Sent Items │ 32962 │ 32802 │
│ Outbox │ 32930 │ 32802 │
│ Inbox │ 32898 │ 32802 │
│ Quick Step Settings │ 33250 │ 32802 │
│ Conversation Action Settings │ 33218 │ 32802 │
│ RSS Feeds │ 33186 │ 32802 │
│ Drafts │ 33154 │ 32802 │
│ Tasks │ 33122 │ 32802 │
│ Notes │ 33090 │ 32802 │
│ Journal │ 33058 │ 32802 │
│ Contacts │ 33026 │ 32802 │
├──────────────────────────────┴─────────┴────────────────┤
│ 14 rows 3 columns │
└─────────────────────────────────────────────────────────┘This traverses the directory tree upward, to the root of the PST file.
with recursive parent_tree as (
select display_name, node_id, parent_node_id
from read_pst_folders('test/unittest.pst')
where node_id = 33058
union
select f.display_name, f.node_id, f.parent_node_id
from read_pst_folders('test/unittest.pst') f
inner join parent_tree d on d.parent_node_id = f.node_id
)
select * from parent_tree;
┌──────────────────────────┬─────────┬────────────────┐
│ display_name │ node_id │ parent_node_id │
│ varchar │ uint32 │ uint32 │
├──────────────────────────┼─────────┼────────────────┤
│ Journal │ 33058 │ 32802 │
│ Top of Outlook data file │ 32802 │ 290 │
│ │ 290 │ 290 │
└──────────────────────────┴─────────┴────────────────┘git submodule update --init --recursive
# GEN=ninja make release
# Build with UI extension
GEN=ninja make debug
./build/debug/duckdb -uicmake -G "Ninja" -S duckdb -B build ^
-DFORCE_COLORED_OUTPUT=1 ^
-DEXTENSION_STATIC_BUILD=1 ^
-DDUCKDB_EXTENSION_CONFIGS='C:/path/to/duckdb-pst/extension_config.cmake' ^
-DCMAKE_CXX_STANDARD=17 ^
-DVCPKG_BUILD=1 ^
-DCMAKE_TOOLCHAIN_FILE='C:/path/to/vcpkg/scripts/buildsystems/vcpkg.cmake' ^
-DVCPKG_MANIFEST_DIR='C:/path/to/duckdb-pst/' ^
-DCMAKE_BUILD_TYPE=Release
cd build
NinjaBuilt with love, by Intellekt.
This extension is built on top of the microsoft-pst-sdk by Terry Mahaffey, Microsoft's official C++ SDK for reading PST files.