Package htsjdk.variant.bcf2
Class BCF2Utils
- java.lang.Object
-
- htsjdk.variant.bcf2.BCF2Utils
-
public final class BCF2Utils extends Object
Common utilities for working with BCF2 files Includes convenience methods for encoding, decoding BCF2 type descriptors (size + type)- Since:
- 5/12
-
-
Field Summary
Fields Modifier and Type Field Description static BCF2Type[]
ID_TO_ENUM
static BCF2Type[]
INTEGER_TYPES_BY_SIZE
static int
MAX_ALLELES_IN_GENOTYPES
static int
MAX_INLINE_ELEMENTS
static int
OVERFLOW_ELEMENT_MARKER
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static String
collapseStringList(List<String> strings)
Collapse multiple strings into a comma separated list ["s1", "s2", "s3"] => ",s1,s2,s3"static int
decodeSize(byte typeDescriptor)
static BCF2Type
decodeType(byte typeDescriptor)
static int
decodeTypeID(byte typeDescriptor)
static BCF2Type
determineIntegerType(int value)
static BCF2Type
determineIntegerType(int[] values)
static BCF2Type
determineIntegerType(List<Integer> values)
static byte
encodeTypeDescriptor(int nElements, BCF2Type type)
static List<String>
explodeStringList(String collapsed)
Inverse operation of collapseStringList.static boolean
headerLinesAreOrderedConsistently(VCFHeader outputHeader, VCFHeader genotypesBlockHeader)
Are the elements and their order in the output and input headers consistent so that we can write out the raw genotypes block without decoding and recoding it? If the order of INFO, FILTER, or contrig elements in the output header is different than in the input header we must decode the blocks using the input header and then recode them based on the new output order.static boolean
isCollapsedString(String s)
static ArrayList<String>
makeDictionary(VCFHeader header)
Create a strings dictionary from the VCF header The dictionary is an ordered list of common VCF identifers (FILTER, INFO, and FORMAT) fields.static BCF2Type
maxIntegerType(BCF2Type t1, BCF2Type t2)
Returns the maximum BCF2 integer size of t1 and t2 For example, if t1 == INT8 and t2 == INT16 returns INT16static byte
readByte(InputStream stream)
static File
shadowBCF(File vcfFile)
Returns a good name for a shadow BCF file for vcfFile.static boolean
sizeIsOverflow(byte typeDescriptor)
static <T> List<T>
toList(Class<T> c, Object o)
Helper function that takes an object and returns a list representation of it: o == null => [] o is a list => o else => [o]
-
-
-
Field Detail
-
MAX_ALLELES_IN_GENOTYPES
public static final int MAX_ALLELES_IN_GENOTYPES
- See Also:
- Constant Field Values
-
OVERFLOW_ELEMENT_MARKER
public static final int OVERFLOW_ELEMENT_MARKER
- See Also:
- Constant Field Values
-
MAX_INLINE_ELEMENTS
public static final int MAX_INLINE_ELEMENTS
- See Also:
- Constant Field Values
-
INTEGER_TYPES_BY_SIZE
public static final BCF2Type[] INTEGER_TYPES_BY_SIZE
-
ID_TO_ENUM
public static final BCF2Type[] ID_TO_ENUM
-
-
Method Detail
-
makeDictionary
public static ArrayList<String> makeDictionary(VCFHeader header)
Create a strings dictionary from the VCF header The dictionary is an ordered list of common VCF identifers (FILTER, INFO, and FORMAT) fields. Note that its critical that the list be dedupped and sorted in a consistent manner each time, as the BCF2 offsets are encoded relative to this dictionary, and if it isn't determined exactly the same way as in the header each time it's very bad- Parameters:
header
- the VCFHeader from which to build the dictionary- Returns:
- a non-null dictionary of elements, may be empty
-
encodeTypeDescriptor
public static byte encodeTypeDescriptor(int nElements, BCF2Type type)
-
decodeSize
public static int decodeSize(byte typeDescriptor)
-
decodeTypeID
public static int decodeTypeID(byte typeDescriptor)
-
decodeType
public static BCF2Type decodeType(byte typeDescriptor)
-
sizeIsOverflow
public static boolean sizeIsOverflow(byte typeDescriptor)
-
readByte
public static byte readByte(InputStream stream) throws IOException
- Throws:
IOException
-
collapseStringList
public static String collapseStringList(List<String> strings)
Collapse multiple strings into a comma separated list ["s1", "s2", "s3"] => ",s1,s2,s3"- Parameters:
strings
- size > 1 list of strings- Returns:
-
explodeStringList
public static List<String> explodeStringList(String collapsed)
Inverse operation of collapseStringList. ",s1,s2,s3" => ["s1", "s2", "s3"]- Parameters:
collapsed
-- Returns:
-
isCollapsedString
public static boolean isCollapsedString(String s)
-
shadowBCF
public static final File shadowBCF(File vcfFile)
Returns a good name for a shadow BCF file for vcfFile. foo.vcf => foo.bcf foo.xxx => foo.xxx.bcf If the resulting BCF file cannot be written, return null. Happens when vcfFile = /dev/null for example- Parameters:
vcfFile
-- Returns:
- the BCF
-
determineIntegerType
public static BCF2Type determineIntegerType(int value)
-
determineIntegerType
public static BCF2Type determineIntegerType(int[] values)
-
maxIntegerType
public static BCF2Type maxIntegerType(BCF2Type t1, BCF2Type t2)
Returns the maximum BCF2 integer size of t1 and t2 For example, if t1 == INT8 and t2 == INT16 returns INT16- Parameters:
t1
-t2
-- Returns:
-
toList
public static <T> List<T> toList(Class<T> c, Object o)
Helper function that takes an object and returns a list representation of it: o == null => [] o is a list => o else => [o]- Parameters:
c
- the class of the objecto
- the object to convert to a Java List- Returns:
-
headerLinesAreOrderedConsistently
public static boolean headerLinesAreOrderedConsistently(VCFHeader outputHeader, VCFHeader genotypesBlockHeader)
Are the elements and their order in the output and input headers consistent so that we can write out the raw genotypes block without decoding and recoding it? If the order of INFO, FILTER, or contrig elements in the output header is different than in the input header we must decode the blocks using the input header and then recode them based on the new output order. If they are consistent, we can simply pass through the raw genotypes block bytes, which is a *huge* performance win for large blocks. Many common operations on BCF2 files (merging them for -nt, selecting a subset of records, etc) don't modify the ordering of the header fields and so can safely pass through the genotypes undecoded. Some operations -- those at add filters or info fields -- can change the ordering of the header fields and so produce invalid BCF2 files if the genotypes aren't decoded
-
-