Skip to content

Common OFI Mistakes to Avoid

Erik Paulson edited this page Oct 11, 2017 · 3 revisions

Purpose

This document is meant to document or highlight commonly misunderstood aspects of the OFI documentation in order to help application developers avoid potential problems. This should not be taken as a substitute for a thorough reading of the manpages.

Areas of Interest

Completion Queues

  • All endpoints that issue asynchronous operations must be bound to a relevant CQ, even if they don't report completions.
    • This is for error reporting purposes. Take for example an endpoint configured with FI_SELECTIVE_COMPLETION and bound to a counter, which only reads from the counter. If the counter read returns an error, the application may then read a more detailed error entry from the completion queue.
    • The endpoint needs only to be bound to a CQ for the operation types it will initiate. For example, if an endpoint will only issue receive operations, it only needs to be bound to a CQ using the FI_RECV flag.

FI_CONTEXT

  • When the FI_CONTEXT (or FI_CONTEXT2) mode bit is specified, the application must pass in a valid 'struct fi_context' (or struct fi_context2) when initiating an operation, such as send/recv/rma write. The memory pointed to by the structure needs to remain valid during the entire duration of the operation, until a valid completion is received by the application by reading the completion queue. Additionally, no other operation may reuse the struct fi_context during the time the original operation is outstanding. If the application erroneously reuses the struct fi_context, or erroneously reuses the context, an undefined error might occur. In some cases the error might appear as stack corruption which can be hard to debug!
    • The exception to this rule is when FI_SELECTIVE_COMPLETION is enabled to suppress completion entries, and an operation is initiated without FI_COMPLETION flag set, the context parameter is ignored.