Parallel shellsort algorithm for many-core GPUs with CUDA

Chun Yuan Lin*, Wei Sheng Lee, Chuan Yi Tang

*Corresponding author for this work

    Research output: Contribution to journalJournal Article peer-review

    Abstract

    Sorting is a classic algorithmic problem and its importance has led to the design and implementation of various sorting algorithms on many-core graphics processing units (GPUs). CUDPP Radix sort is the most efficient sorting on GPUs and GPU Sample sort is the best comparison-based sorting. Although the implementations of these algorithms are efficient, they either need an extra space for the data rearrangement or the atomic operation for the acceleration. Sorting applications usually deal with a large amount of data, thus the memory utilization is an important consideration. Furthermore, these sorting algorithms on GPUs without the atomic operation support can result in the performance degradation or fail to work. In this paper, an efficient implementation of a parallel shellsort algorithm, CUDA she llsort, is proposed for many-core GPUs with CUDA. Experimental results show that, on average, the performance of CUDA shellsort is nearly twice faster than GPU quicksort and 37% faster than Thrust merge sort under uniform distribution. Moreover, its performance is the same as GPU sample sort up to 32 million data elements, but only needs a constant space usage. CUDA shellsortis also robustovervarious data distributions and could be suitable for ot hermany-core architectures.

    Original languageEnglish
    Pages (from-to)1-16
    Number of pages16
    JournalInternational Journal of Grid and High Performance Computing
    Volume4
    Issue number2
    DOIs
    StatePublished - 04 2012

    Keywords

    • Atomic operation
    • CUDA
    • Graphics processing units (GPUs)
    • In-place algorithm
    • Shellsort

    Fingerprint

    Dive into the research topics of 'Parallel shellsort algorithm for many-core GPUs with CUDA'. Together they form a unique fingerprint.

    Cite this