Packages

package spark

Provides classes and routines for running on Apache Spark. The main entry point is the Discount class. Once configured, it can be used to generate other classes of interest, such as GroupedSegments and CountedKmers.

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. spark
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Type Members

  1. type AnyMinSplitter = MinSplitter[MinimizerPriorities]
  2. sealed trait CountMethod extends AnyRef

    Defines a strategy for counting k-mers in Spark.

  3. class CountedKmers extends AnyRef

    A collection of counted k-mers represented in encoded form.

    A collection of counted k-mers represented in encoded form. Each k-mer is represented individually, making this dataset large if cached or persisted.

  4. final case class Discount(k: Int, minimizers: MinimizerSource = Bundled, m: Int = 10, ordering: MinimizerOrdering = Frequency, sample: Double = 0.01, maxSequenceLength: Int = 1000000, normalize: Boolean = false, method: CountMethod = Auto, partitions: Int = 200)(implicit spark: SparkSession) extends Product with Serializable

    Main API entry point for Discount.

    Main API entry point for Discount. Also see the command line examples in the documentation for more information on these options.

    k

    k-mer length

    minimizers

    source of minimizers. See MinimizerSource

    m

    minimizer width

    ordering

    minimizer ordering. See MinimizerOrdering

    sample

    sample fraction for frequency orderings

    maxSequenceLength

    max length of a single sequence (for short reads)

    normalize

    whether to normalize k-mer orientation during counting. Causes every sequence to be scanned in both forward and reverse, after which only forward orientation k-mers are kept.

    method

    counting method to use (or None for automatic selection). See CountMethod

    partitions

    number of shuffle partitions/index buckets

    spark

    the SparkSession

  5. class FastaOutputFormat[K, V] extends TextOutputFormat[K, V]
  6. class FastaShortInput extends InputReader

    Input reader for FASTA sequences of a fixed maximum length.

    Input reader for FASTA sequences of a fixed maximum length. Uses FASTAshortInputFileFormat

  7. class FastqShortInput extends InputReader

    Input reader for FASTQ short reads.

    Input reader for FASTQ short reads. Uses FASTQInputFileFormat

  8. class GroupedSegments extends AnyRef

    A collection of counted super-mers grouped into bins (by minimizer).

    A collection of counted super-mers grouped into bins (by minimizer). Super-mers are segments of length >= k where every k-mer shares the same minimizer.

    Unlike with the Index, every k-mer in the super-mers is guaranteed to be present.

  9. final case class HashSegment(hash: BucketId, segment: ZeroNTBitArray) extends Product with Serializable

    A single hashed sequence segment (super-mer) with its minimizer.

    A single hashed sequence segment (super-mer) with its minimizer.

    hash

    The minimizer

    segment

    The super-mer

  10. class Index extends AnyRef

    A bucketed k-mer index.

    A bucketed k-mer index. Indexes store super-mers in a Dataset of ReducibleBucket, where each k-mer is associated with a tag. Typically tags are k-mer counts, and then the Index becomes a multiset of counted k-mers. Indexes are immutable, like other Spark datastructures, and operations like filtering return a new Index rather than change the existing one in place. Indexes can be combined using operations like union, intersect, and subtract, and can be written to disk in various formats. The default format used by the write() and read() methods is bucketed parquet files, which gives good data compression and avoids shuffling when the same Index is used repeatedly.

  11. case class IndexParams(bcSplit: Broadcast[AnyMinSplitter], buckets: Int, location: String) extends Product with Serializable

    Parameters for a k-mer index.

    Parameters for a k-mer index.

    bcSplit

    The broadcast splitter (minimizer scheme/ordering)

    buckets

    The number of buckets (Spark partitions) to partition the index into - NB, not the same as minimizer bins

    location

    The location (directory/prefix name) where the index is stored

  12. class IndexedFastaInput extends InputReader

    Input reader for FASTA files containing potentially long sequences, with a .fai index FAI indexes can be created with tools such as seqkit.

    Input reader for FASTA files containing potentially long sequences, with a .fai index FAI indexes can be created with tools such as seqkit. Uses IndexedFastaFormat

  13. abstract class InputReader extends AnyRef

    A reader that reads input data from one file using a specific Hadoop format

  14. class Inputs extends AnyRef

    A set of input files that can be parsed into com.jnpersson.discount.hash.InputFragment

  15. class Kmers extends AnyRef

    Convenience methods for interacting with k-mers from a set of input files.

  16. trait MinimizerSource extends AnyRef

    A method for obtaining a set of minimizers for given values of k and m.

    A method for obtaining a set of minimizers for given values of k and m. Except for the case of All, the sets obtained should be universal hitting sets (UHSs).

  17. final case class Path(path: String) extends MinimizerSource with Product with Serializable

    A file, or a directory containing multiple files with names like minimizers_{k}_{m}.txt, in which case the best file will be selected.

    A file, or a directory containing multiple files with names like minimizers_{k}_{m}.txt, in which case the best file will be selected. These files may specify an ordering.

    path

    the file, or directory to scan

  18. sealed trait Rule extends Serializable

    k-mer combination (reduction) rules for combining indexes.

    k-mer combination (reduction) rules for combining indexes. Most of these support both intersection and union. An intersection is an operation that requires the k-mer to be present in every input index, or it will not be present in the output. A union may preserve the k-mer even if it is present in only one input index. Except for the case of the union Sum reduction, indexes must be compacted prior to reduction, that is, each k-mer must occur in each index with a nonzero value only once.

    These rules were inspired by the design of KMC3: https://github.com/refresh-bio/KMC

  19. class Sampling extends AnyRef

    Routines for creating and managing frequency sampled minimizer orderings.

Value Members

  1. object All extends MinimizerSource with Product with Serializable

    Use all m-mers as minimizers.

    Use all m-mers as minimizers. Can be auto-generated for any m. The initial ordering is lexicographic.

  2. object Auto extends CountMethod with Product with Serializable

    Indicate that a strategy should be auto-selected

  3. object Bundled extends MinimizerSource with Product with Serializable

    Bundled minimizers on the classpath (only available for some values of k and m).

  4. object Discount extends SparkTool with Serializable

    Main command-line interface to Discount.

  5. object GroupedSegments
  6. object HDFSUtil

    HDFS helper routines

  7. object Helpers
  8. object Index
  9. object IndexParams extends Serializable
  10. object Output

    Output format helper methods

  11. object Pregrouped extends CountMethod with Product with Serializable

    Pregrouped counting: groups and counts identical super-mers before counting k-mers.

    Pregrouped counting: groups and counts identical super-mers before counting k-mers. Faster for datasets with high redundancy.

  12. object Rule extends Serializable
  13. object Sampling
  14. object Simple extends CountMethod with Product with Serializable

    Non-pregrouped: counts k-mers immediately.

    Non-pregrouped: counts k-mers immediately. Faster for datasets with low redundancy.

Inherited from AnyRef

Inherited from Any

Ungrouped