Class SortingCollection<T>

  • All Implemented Interfaces:
    Iterable<T>

    public class SortingCollection<T>
    extends Object
    implements Iterable<T>
    Collection to which many records can be added. After all records are added, the collection can be iterated, and the records will be returned in order defined by the comparator. Records may be spilled to a temporary directory if there are more records added than will fit in memory. As a result of this, the objects returned may not be identical to the objects added to the collection, but they should be equal as determined by the codec used to write them to disk and read them back.

    When iterating over the collection, the number of file handles required is numRecordsInCollection/maxRecordsInRam. If this becomes a limiting factor, a file handle cache could be added.

    If Snappy DLL is available and snappy.disable system property is not set to true, then Snappy is used to compress temporary files.

    • Method Detail

      • add

        public void add​(T rec)
      • doneAdding

        public void doneAdding()
        This method can be called after caller is done adding to collection, in order to possibly free up memory. If iterator() is called immediately after caller is done adding, this is not necessary, because iterator() triggers the same freeing.
      • isDestructiveIteration

        public boolean isDestructiveIteration()
        Returns:
        True if this collection is allowed to discard data during iteration in order to reduce memory footprint, precluding a second iteration over the collection.
      • setDestructiveIteration

        public void setDestructiveIteration​(boolean destructiveIteration)
        Tell this collection that it is allowed to discard data during iteration in order to reduce memory footprint, precluding a second iteration. This is true by default.
      • spillToDisk

        public void spillToDisk()
        Sort the records in memory, write them to a file, and clear the buffer of records in memory.
      • iterator

        public CloseableIterator<T> iterator()
        Prepare to iterate through the records in order. This method may be called more than once, but add() may not be called after this method has been called.
        Specified by:
        iterator in interface Iterable<T>
      • cleanup

        public void cleanup()
        Delete any temporary files. After this method is called, iterator() may not be called.
      • newInstance

        @Deprecated
        public static <T> SortingCollection<T> newInstance​(Class<T> componentType,
                                                           SortingCollection.Codec<T> codec,
                                                           Comparator<T> comparator,
                                                           int maxRecordsInRAM,
                                                           File... tmpDir)
        Deprecated.
        Syntactic sugar around the ctor, to save some typing of type parameters
        Parameters:
        componentType - Class of the record to be sorted. Necessary because of Java generic lameness.
        codec - For writing records to file and reading them back into RAM
        comparator - Defines output sort order
        maxRecordsInRAM - how many records to accumulate in memory before spilling to disk
        tmpDir - Where to write files of records that will not fit in RAM
      • newInstance

        @Deprecated
        public static <T> SortingCollection<T> newInstance​(Class<T> componentType,
                                                           SortingCollection.Codec<T> codec,
                                                           Comparator<T> comparator,
                                                           int maxRecordsInRAM,
                                                           Collection<File> tmpDirs)
        Syntactic sugar around the ctor, to save some typing of type parameters
        Parameters:
        componentType - Class of the record to be sorted. Necessary because of Java generic lameness.
        codec - For writing records to file and reading them back into RAM
        comparator - Defines output sort order
        maxRecordsInRAM - how many records to accumulate in memory before spilling to disk
        tmpDirs - Where to write files of records that will not fit in RAM
      • newInstance

        public static <T> SortingCollection<T> newInstance​(Class<T> componentType,
                                                           SortingCollection.Codec<T> codec,
                                                           Comparator<T> comparator,
                                                           int maxRecordsInRAM,
                                                           boolean printRecordSizeSampling)
        Syntactic sugar around the ctor, to save some typing of type parameters. Writes files to java.io.tmpdir
        Parameters:
        componentType - Class of the record to be sorted. Necessary because of Java generic lameness.
        codec - For writing records to file and reading them back into RAM
        comparator - Defines output sort order
        maxRecordsInRAM - how many records to accumulate in memory before spilling to disk
        printRecordSizeSampling - If true record size will be sampled and output at DEBUG log level
      • newInstance

        public static <T> SortingCollection<T> newInstance​(Class<T> componentType,
                                                           SortingCollection.Codec<T> codec,
                                                           Comparator<T> comparator,
                                                           int maxRecordsInRAM,
                                                           boolean printRecordSizeSampling,
                                                           Path... tmpDir)
        Syntactic sugar around the ctor, to save some typing of type parameters
        Parameters:
        componentType - Class of the record to be sorted. Necessary because of Java generic lameness.
        codec - For writing records to file and reading them back into RAM
        comparator - Defines output sort order
        maxRecordsInRAM - how many records to accumulate in memory before spilling to disk
        printRecordSizeSampling - If true record size will be sampled and output at DEBUG log level
        tmpDir - Where to write files of records that will not fit in RAM
      • newInstance

        public static <T> SortingCollection<T> newInstance​(Class<T> componentType,
                                                           SortingCollection.Codec<T> codec,
                                                           Comparator<T> comparator,
                                                           int maxRecordsInRAM)
        Syntactic sugar around the ctor, to save some typing of type parameters. Writes files to java.io.tmpdir
        Parameters:
        componentType - Class of the record to be sorted. Necessary because of Java generic lameness.
        codec - For writing records to file and reading them back into RAM
        comparator - Defines output sort order
        maxRecordsInRAM - how many records to accumulate in memory before spilling to disk
      • newInstance

        public static <T> SortingCollection<T> newInstance​(Class<T> componentType,
                                                           SortingCollection.Codec<T> codec,
                                                           Comparator<T> comparator,
                                                           int maxRecordsInRAM,
                                                           Path... tmpDir)
        Syntactic sugar around the ctor, to save some typing of type parameters
        Parameters:
        componentType - Class of the record to be sorted. Necessary because of Java generic lameness.
        codec - For writing records to file and reading them back into RAM
        comparator - Defines output sort order
        maxRecordsInRAM - how many records to accumulate in memory before spilling to disk
        tmpDir - Where to write files of records that will not fit in RAM
      • newInstanceFromPaths

        public static <T> SortingCollection<T> newInstanceFromPaths​(Class<T> componentType,
                                                                    SortingCollection.Codec<T> codec,
                                                                    Comparator<T> comparator,
                                                                    int maxRecordsInRAM,
                                                                    Collection<Path> tmpDirs)
        Syntactic sugar around the ctor, to save some typing of type parameters
        Parameters:
        componentType - Class of the record to be sorted. Necessary because of Java generic lameness.
        codec - For writing records to file and reading them back into RAM
        comparator - Defines output sort order
        maxRecordsInRAM - how many records to accumulate in memory before spilling to disk
        tmpDirs - Where to write files of records that will not fit in RAM