Skip to content

Discussion: SIMD strategy for pandas C/C++ code #64884

@jbrockmendel

Description

@jbrockmendel

Everything below this line is Claude, opened upon request in #64515.

Summary

SIMD intrinsics have come up in multiple PRs now (#64515, and @Alvaro-Kothe's work). Before merging any of these, we should align on an approach. This issue collects the tradeoffs discussed so far.

Current state

  • PERF: Use SIMD for read_csv C tokenizer #64515 uses hand-written SSE2 (x86-64) and NEON (arm64) intrinsics in the C tokenizer to scan for special characters 16 bytes at a time
  • These are baseline instruction sets — __SSE2__ and __ARM_NEON are predefined by the compiler on their respective architectures, no special flags needed
  • CI exercises both paths: x86-64 (Linux, macOS Intel, Windows) and arm64 (Linux, macOS Apple Silicon, Windows ARM64)
  • cpp_std=c++17 is now in the build (from fast_float), so C++ is available

Options

1. Hand-written intrinsics (status quo in #64515)

  • Pros: no new dependencies, minimal code (~100 lines for 4 functions), compile-time selection via #ifdef
  • Cons: must handle compiler portability ourselves (e.g. __builtin_ctz vs MSVC _BitScanForward), duplicated logic per architecture

2. xsimd (used by Arrow C++)

  • Header-only C++14 library (~4.7 MB), would need vendoring or a build dependency
  • Provides a unified API across SSE2/AVX2/AVX-512/NEON/SVE/etc.
  • Would require extracting SIMD code into .cpp files with extern "C" linkage (since tokenizer.c is C)
  • Tested across many architectures by the xsimd project itself

3. Google Highway (used by NumPy)

  • C++17 library, not header-only (needs ~10 compiled source files, ~31 MB repo)
  • NumPy vendors it as a git submodule
  • Designed for runtime dispatch across many ISA levels — more machinery than we currently need
  • Heavier integration cost

4. Compiler vector extensions / autovectorization

  • GCC/Clang support __attribute__((vector_size(N))) portable vector types
  • Compiler does the architecture mapping, no library needed
  • Less control over generated code; may not handle the "find first matching byte" pattern well

5. Meson SIMD module

  • Designed for compiling separate source files with non-baseline flags (e.g. -mavx2) and runtime dispatch
  • Not applicable to the current use case (SSE2/NEON are baseline, no runtime dispatch needed)
  • Could become relevant if we wanted optional AVX2/AVX-512 paths in the future

Questions to resolve

  1. Is the scope of SIMD usage in pandas likely to grow beyond the tokenizer, or is this a one-off?
  2. If one-off, do hand-written intrinsics suffice? The maintenance burden so far has been one portability fix (__builtin_ctz on MSVC).
  3. If we expect growth, is xsimd the right choice given that C++17 is already in the build?
  4. Should we block PERF: Use SIMD for read_csv C tokenizer #64515 on this decision, or merge the hand-written version and migrate later if needed?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions