Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update BitBlt support (primarily for 64-bit ARM) #565

Open
wants to merge 17 commits into
base: Cog
Choose a base branch
from

Commits on May 4, 2021

  1. Configuration menu
    Copy the full SHA
    9ebc245 View commit details
    Browse the repository at this point in the history
  2. Correct various "#if ENABLE_FAST_BLT" to "#ifdef"

    ENABLE_FAST_BLT is typically not assigned a value even when it is defined,
    so "#if" form is tecnically a syntax error.
    bavison committed May 4, 2021
    Configuration menu
    Copy the full SHA
    38c4283 View commit details
    Browse the repository at this point in the history
  3. Don't assume sourcePPW is valid on entry to copyBitsFallback

    This is not the case when being called from "fuzz" or "bench" test
    applications. It may also not be accurate if a fast path has been
    synthesised from a combination of copyBitsFallback and one or more other
    fast paths.
    bavison committed May 4, 2021
    Configuration menu
    Copy the full SHA
    43ce975 View commit details
    Browse the repository at this point in the history
  4. Fallback routines need extra help to detect intra-image operations

    In some places, sourceForm and destForm were being compared to determine
    which code path to follow. However, when being called from fuzz or other
    test tools, these structures aren't used to pass parameters, so the pointers
    haven't been initialised and default to 0, so the wrong code path is followed.
    Detect such cases and initialise them from sourceBits and destBits instead,
    since these will perform the same under equality tests.
    bavison committed May 4, 2021
    Configuration menu
    Copy the full SHA
    df667ca View commit details
    Browse the repository at this point in the history
  5. Remove invalid shortcut in rgbComponentAlphawith

    This shortcut is triggered more frequently than it used to be, due to
    improvements in copyLoop() that avoid buffer overruns.
    bavison committed May 4, 2021
    Configuration menu
    Copy the full SHA
    51df83e View commit details
    Browse the repository at this point in the history
  6. Fix bug in 32-bit ARM fast paths

    When classed as "wide" because each line is long enough to warrant pipelined
    prefetching as we go along, the inner loop is unrolled enough that there is
    at least one prefetch instruction per iteration. Loading the source image
    can only be done in atoms of 32 bit words due to big-endian packing, so
    when destination pixels are 8 times wider (or more) than source pixels, the
    loads happen less frequently than the store atoms (quadwords) and a
    conditional branch per subblock is required to decide whether to do a load
    or not, depending on the skew and the number of pixels remaining to process.
    The 'x' register is only updated once per loop, so an assembly-time constant
    derived from the unrolling subblock number needs to be factored in, but
    since the number of pixels remaining decreases as the subblock number
    increases, this should have been a subtraction.
    
    In practice, since only the least-significant bits of the result matter,
    addition and subtraction behave the same when the source:destination pixel
    ratio is 8, so the only operations affected were 1->16bpp, 2->32bpp and
    1->32bpp. The exact threshold that counts as "wide" depends on the prefetch
    distance that was selected empirically, but typically would require an
    operation that is several hundreds of pixels wide.
    bavison committed May 4, 2021
    Configuration menu
    Copy the full SHA
    a64c5b6 View commit details
    Browse the repository at this point in the history
  7. Fix buffer overflow bugs

    In fastPathDepthConv (which combines sourceWord colour-depth conversion with
    another fast path for another combinationRule at a constant colour depth)
    and fastPathRightToLeft, it could overflow the temporary buffer and thereby
    corrupt other local variables if the last chunk of a pixel row was 2048
    bytes (or just under). This was most likely to happen with 32bpp destination
    images and widths of about 512 pixels.
    bavison committed May 4, 2021
    Configuration menu
    Copy the full SHA
    b577ab2 View commit details
    Browse the repository at this point in the history
  8. Fix corruption bugs with wide 1bpp source images

    For images that were wide enough to invoke intra-line preloads, there was a
    register clash between the preload address calculation and one of the
    registers holding the deskewed source pixels (this only occurred once per
    destination cacheline).
    bavison committed May 4, 2021
    Configuration menu
    Copy the full SHA
    9084c17 View commit details
    Browse the repository at this point in the history
  9. Fix type of halftone array for 64-bit targets

    The halftone array is accessed using a hard-coded multiplier of 4 bytes,
    therefore the type of each element needs to be 32 bit on every platform.
    `sqInt` is not appropriate for this use, since it is a 64-bit type on
    64-bit platforms. Rather than unilaterally introduce C99 stdint types,
    use `unsigned int` since this wil be 32-bit on both current fast path
    binary targets.
    bavison committed May 4, 2021
    Configuration menu
    Copy the full SHA
    2b0279a View commit details
    Browse the repository at this point in the history
  10. Detect and add a new fast path flag for effective-1bpp colour maps

    Sometimes, colour maps are used such that all entries except the first
    contain the same value. Combined with the fact that only source colour 0
    uses colour map entry 0 (any other colours for which all non-0 bits would
    otherwise be discarded during index generation are forced to use entry 1
    instead), this effectively acts as a 2-entry (or 1bpp) map, depending on
    whether the source colour is 0 or not. This is far more efficiently coded
    in any fast path by a test against zero, than by a table lookup - it frees
    up 2 KB, 16 KB or 128 KB of data cache space, depending on whether a 9-,
    12- or 15-bit colour map was used. There is an up-front cost to scanning
    the colour map to see if its entries are of this nature, however in most
    "normal" colour maps, this scan will rapidly be aborted.
    bavison committed May 4, 2021
    Configuration menu
    Copy the full SHA
    e22ae0b View commit details
    Browse the repository at this point in the history
  11. C fast path for 32bpp alphaBlend

    This runs approx 2.6x faster when benchmarked on Cortex-A72 in AArch64.
    bavison committed May 4, 2021
    Configuration menu
    Copy the full SHA
    e44e2c8 View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    45649aa View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    405f35b View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    dac723f View commit details
    Browse the repository at this point in the history
  15. Apply scalar halftoning to colour map entries instead for 32bpp desti…

    …nation
    
    This makes better use of existing fast paths, and applies to all platforms.
    bavison committed May 4, 2021
    Configuration menu
    Copy the full SHA
    80cd2da View commit details
    Browse the repository at this point in the history
  16. Configuration menu
    Copy the full SHA
    e4a27ec View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    10d8a11 View commit details
    Browse the repository at this point in the history