    gpu: always allocate page-sized chunks, then use LinearAllocator
    Daniel Krebs authored
    This was neccessary in order to make the memory available via GDRcopy
    when multiple small allocations were made. cudaMalloc() would return
    multiple memory chunks located in the same GPU page, which GDRcopy
    pretty much dislikes (`gdrdrv:offset != 0 is not supported`).
    As a side effect, this will keep the number of BAR-mappings done
    via GDRcopy low, because they seem to be quite limited.