-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Which component has the problem?
CUTLASS C++
Bug Report
Summary
While reviewing the RegularTileIterator specialization for layout::PitchLinear, I encountered a potential inconsistency regarding how pointer offsets are interpreted (Element units vs. byte units), along with an apparent asymmetry between the load() and store() tile offset calculations when kElementsPerAccess > 1.
I would like to clarify if this behavior is intended for specific SIMT configurations or if it represents a latent issue for vectorized paths.
1. Observation: Unit contract in add_pointer_offset()
In the CUTLASS TileIterator concept, add_pointer_offset(LongIndex offset) is generally documented to operate in units of Elements. However, the PitchLinear specialization appears to apply the offset directly to a uint8_t*:
// regular_tile_iterator_pitch_linear.h
void add_pointer_offset(LongIndex pointer_offset) {
pointer_ += pointer_offset; // pointer_ is uint8_t*
}
This effectively treats the input as bytes. Furthermore, add_tile_offset() pre-calculates a byte-sized offset before passing it to this function:
int offset = sizeof_bits<Element>::value * (...) / 8;
add_pointer_offset(offset);
This suggests that while the internal usage is consistent with byte semantics, it may deviate from the broader Element-based contract expected by generic abstractions.
2. Potential Asymmetry in Contiguous Offset Handling
There appears to be a divergence in how load() and store() handle their contiguous coordinates.
In the load() path (Line 142):
The contiguous coordinate is divided by ThreadMap::kElementsPerAccess:
tile_offset.contiguous() * Shape::kContiguous / ThreadMap::kElementsPerAccess
In the store() path (Line 175):
No such division is applied:
tile_offset.contiguous() * Shape::kContiguous
Since both results are eventually passed to load_with_pointer_offset (or store_with_...), which applies a sizeof(Element) multiplier, this leads to different base addresses for the same logical tile offset when kElementsPerAccess > 1.
3. Example illustrating the mismatch
Assume a configuration where:
tile_offset.contiguous() * Shape::kContiguous= 64 elementsElement=float(4 bytes)kElementsPerAccess= 4
Resulting address offsets:
store()path: bytesload()path: bytes
This suggests that for vectorized configurations, load() and store() may reference different memory locations for the same tile offset.
4. Context and Possible Directions
This behavior might be masked in many SIMT kernels where ThreadMap::kElementsPerAccess defaults to 1, making the division a no-op. However, it could lead to unexpected results in vectorized paths.
Possible directions for resolution:
- Align
load()andstore()to use a consistent unit convention for contiguous offsets. - Clarify the intended unit contract for
add_pointer_offset()(Elements vs. Bytes). - If byte semantics are intended, consider renaming the internal API to
add_byte_offset()to prevent ambiguity in generic contexts.