binarymap¶

Defines BinaryMap objects for handling binary representations of variants and their functional scores.

class dms_variants.binarymap.BinaryMap(func_scores_df, *, substitutions_col='aa_substitutions', func_score_col='func_score', func_score_var_col='func_score_var', n_pre_col='pre_count', n_post_col='post_count', cols_optional=True, alphabet='A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', '*', expand=False, wtseq=None)[source]¶

Bases: object

Binary representations of variants and their functional scores.

Note

These maps represent variants as arrays of 0 and 1 integers indicating whether a particular variant has a substitution. The wildtype is all 0. Such representations are useful for fitting estimates of the effect of each substitution.

Unless you are using the expand option, the binary maps only cover substitutions relative to wildtype that are present in at least one of the variants used to create the map.

Parameters

func_scores_df (pandas.DataFrame) – Data frame of variants and their functional scores. Each row is a different variant, defined by space-delimited list of substitutions. Data frames of this type are returned by dms_variants.codonvarianttable.CodonVariantTable.func_scores().
substitutions_col (str) – Column in func_scores_df giving substitutions for each variant.
func_score_col (str or None) – Column in func_scores_df giving functional score for each variant, or None if no functional scores available.
func_score_var_col (str or None) – Column in func_scores_df giving variance on functional score estimate, or None if no variance available.
n_pre_col (str or None) – Column in func_scores_df giving pre-selection counts for each variant, or None if counts not available.
n_post_col (str or None) – Column in func_scores_df giving post-selection counts for each variant, or None if counts not available.
cols_optional (True) – All of the *_col parameters are optional except substitutions_col. If cols_optional is True, the absence of any of these columns is taken the same as setting that column’s parameter to zero: the corresponding attribute is set to None.
alphabet (list or tuple) – Allowed characters (e.g., amino acids or codons).
expand (bool) – If False (the default) the encoding only covers substitutions relative to wildtype that are observed in the set of variants. If True then the encoding covers all allowed characters at each site regardless of whether they are wildtype or observed. In this latter case, each binary representation is of length (alphabet size) \(\times\) (sequence length), and sums to the sequence length.
wtseq (None or str) – Only set this option if expand is True. In that case, it should be the wildtype sequence.

binarylength¶

Length of the binary representation of each variant.

Type: int

nvariants¶

Number of variants.

Type: int

binary_variants¶

Sparse matrix of shape nvariants by binarylength. Row binary_variants[ivariant] gives the binary representation of variant ivariant, and binary_variants[ivariant, i] is 1 if the variant has the substitution BinaryMap.i_to_sub() and 0 otherwise. To convert to dense numpy.ndarray, use toarray method of the sparse matrix.

Type: scipy.sparse.csr.csr_matrix of dtype int8

substitution_variants¶

All variants as substitution strings as provided in substitutions_col of func_scores_df.

Type: list

func_scores¶

A 1D array of length nvariants giving score for each variant.

Type: numpy.ndarray of floats

func_scores_var¶

A 1D array of length nvariants giving variance on score for each variant, or None if no variance estimates provided.

Type: numpy.ndarray of floats, or None

n_pre¶

A 1D array of length nvariants giving pre-selection counts for each variant, or None if counts not provided.

Type: numpy.dnarray of integers, or None

n_post¶

A 1D array of length nvariants giving post-selection counts for each variant, or None if counts not provided.

Type: numpy.dnarray of integers, or None

alphabet¶

Allowed characters (e.g., amino acids or codons).

Type: tuple

substitutions_col¶

Value set when initializing object.

Type: str

Example

Create a binary map:

>>> func_scores_df = pd.DataFrame.from_records(
...         [('', 0.0, 0.2),
...          ('M1A', -0.2, 0.1),
...          ('M1C K3A', -0.4, 0.3),
...          ('', 0.01, 0.15),
...          ('A2C K3A', -0.05, 0.1),
...          ('A2*', -1.2, 0.4),
...          ],
...         columns=['aa_substitutions', 'func_score', 'func_score_var'])
>>> binmap = BinaryMap(func_scores_df)

The length of the binary representation equals the number of unique substitutions:

>>> binmap.binarylength
5
>>> binmap.all_subs
['M1A', 'M1C', 'A2*', 'A2C', 'K3A']

Scores, score variances, binary and string representations:

>>> binmap.nvariants
6
>>> binmap.func_scores
array([ 0.  , -0.2 , -0.4 ,  0.01, -0.05, -1.2 ])
>>> binmap.func_scores_var
array([0.2 , 0.1 , 0.3 , 0.15, 0.1 , 0.4 ])
>>> type(binmap.binary_variants)
<class 'scipy.sparse.csr.csr_matrix'>
>>> binmap.binary_variants.toarray()
array([[0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 1, 0, 0, 1],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 1, 1],
       [0, 0, 1, 0, 0]], dtype=int8)
>>> binmap.substitution_variants
['', 'M1A', 'M1C K3A', '', 'A2C K3A', 'A2*']
>>> binmap.substitutions_col
'aa_substitutions'

Validate binary map interconverts binary representations and substitutions:

>>> for ivar in range(binmap.nvariants):
...     binvar = binmap.binary_variants.toarray()[ivar]
...     subs_from_df = func_scores_df.at[ivar, 'aa_substitutions']
...     assert subs_from_df == binmap.binary_to_sub_str(binvar)
...     assert all(binvar == binmap.sub_str_to_binary(subs_from_df))

Demonstrate BinaryMap.sub_str_to_indices():

>>> for sub in binmap.substitution_variants:
...     print(binmap.sub_str_to_indices(sub))
[]
[0]
[1, 4]
[]
[3, 4]
[2]

Now do similar operation but using expand to include full alphabet (although to keep size manageable, we use an alphabet smaller than all amino acids):

>>> wtseq = 'MAKG'
>>> alphabet = ['A', 'C', 'G', 'K', 'M', '*']
>>> binmap_expand = BinaryMap(func_scores_df,
...                           alphabet=alphabet,
...                           expand=True,
...                           wtseq=wtseq)
>>> binmap_expand.binarylength == len(wtseq) * len(alphabet)
True

>>> binmap_expand.all_subs
... 
['M1A', 'M1C', 'M1G', 'M1K', 'M1*',
 'A2C', 'A2G', 'A2K', 'A2M', 'A2*',
 'K3A', 'K3C', 'K3G', 'K3M', 'K3*',
 'G4A', 'G4C', 'G4K', 'G4M', 'G4*']

>>> binmap_expand.binary_variants.toarray()
... 
array([[0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
        0, 0],
       [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
        0, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0],
       [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
        0, 0],
       [0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
        0, 0]], dtype=int8)

>>> all(numpy.sum(binmap_expand.binary_variants.toarray(), axis=1) ==
...     numpy.full(binmap_expand.nvariants, len(wtseq)))
True

>>> binmap_expand.substitution_variants
['', 'M1A', 'M1C K3A', '', 'A2C K3A', 'A2*']

>>> for ivar in range(binmap_expand.nvariants):
...     binvar = binmap_expand.binary_variants.toarray()[ivar]
...     subs_from_df = func_scores_df.at[ivar, 'aa_substitutions']
...     assert subs_from_df == binmap_expand.binary_to_sub_str(binvar)
...     assert all(binvar == binmap_expand.sub_str_to_binary(subs_from_df))

Note that binamp does not have n_pre and n_post attributes set:

>>> binmap.n_pre == binmap.n_post == None
True

We would not have been able to initialize binmap if we weren’t using the cols_optional flag:

>>> BinaryMap(func_scores_df, alphabet=alphabet, cols_optional=False)
Traceback (most recent call last):
  ...
ValueError: `func_scores_df` lacks column pre_count

Now assign values to n_pre and n_post attributes:

>>> func_scores_df_counts = (
...         func_scores_df.assign(pre_count=[10, 20, 15, 5, 6, 8],
...                               post_count=[0, 3, 12, 11, 9, 8])
...         )
>>> binmap_counts = BinaryMap(func_scores_df_counts, alphabet=alphabet)
>>> binmap_counts.n_pre
array([10, 20, 15,  5,  6,  8])
>>> binmap_counts.n_post
array([ 0,  3, 12, 11,  9,  8])

property all_subs¶

Substitutions in order encoded in binary map.

Type: list

binary_to_sub_str(binary)[source]¶

Convert binary representation to space-delimited substitutions.

Note

This method is the inverse of BinaryMap.sub_str_to_binary().

Parameters: binary (numpy.ndarray) – Binary representation.
Returns: Space-delimited substitutions.
Return type: str

i_to_sub(i)[source]¶

Substitution corresponding to index in binary representation.

Parameters: i (int) – Index in binary representation, 0 <= i < binarylength.
Returns: The substitution corresponding to that index.
Return type: str

sub_str_to_binary(sub_str)[source]¶

Convert space-delimited substitutions to binary representation.

Parameters: sub_str (str) – Space-delimited substitutions.
Returns: Binary representation.
Return type: numpy.ndarray of dtype int8

sub_str_to_indices(sub_str)[source]¶

Convert space-delimited substitutions to list of non-zero indices.

Parameters: sub_str (str) – Space-delimited substitutions.
Returns: Contains binary representation index for each mutation, so wildtype is an empty list.
Return type: list

sub_to_i(sub)[source]¶

Index in binary representation corresponding to substitution.

Parameters: sub (str) – The substitution.
Returns: Index in binary representation, will be >= 0 and < binarylength.
Return type: int

binarymap¶

dms_variants

Navigation

Related Topics