binarymap¶
Defines BinaryMap
objects for handling binary representations
of protein/nucleotide variants and their functional scores.
Specifically, let \(v\) be a variant. We convert
\(v\) into a binary representation with respect to some wildtype
sequence. This representation is a vector \(\mathbf{b}\left(v\right)\)
with element \(b\left(v\right)_m\) equal to 1 if the variant has mutation
\(m\) and 0 otherwise, and \(m\) ranging over all \(M\) mutations
observed in the overall set of variants (so \(\mathbf{b}\left(v\right)\)
is of length \(M\)). Variants can be converted into this binary form
using a BinaryMap
.
- binarymap.binarymap.AAS_NOSTOP = ('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y')¶
Amino-acid one-letter codes alphabetized, doesn’t include stop.
- Type
tuple
- binarymap.binarymap.AAS_WITHGAP = ('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', '-')¶
Amino-acid one-letter codes alphabetized plus gap as
-
.- Type
tuple
- binarymap.binarymap.AAS_WITHSTOP = ('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', '*')¶
Amino-acid one-letter codes alphabetized plus stop as
*
.- Type
tuple
- binarymap.binarymap.AAS_WITHSTOP_WITHGAP = ('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', '*', '-')¶
Amino-acid one-letter codes plus stop as
*
and gap as-
.- Type
tuple
- class binarymap.binarymap.BinaryMap(func_scores_df, *, substitutions_col='aa_substitutions', func_score_col='func_score', func_score_var_col='func_score_var', n_pre_col='pre_count', n_post_col='post_count', cols_optional=True, alphabet=('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', '*'), allowed_subs=None, sites_as_str=False, expand=False, wtseq=None)[source]¶
Bases:
object
Binary representations of variants and their functional scores.
Note
These maps represent variants as arrays of 0 and 1 integers indicating whether a particular variant has a substitution. The wildtype is all 0. Such representations are useful for fitting estimates of the effect of each substitution.
Unless you are using the expand option, the binary maps only cover substitutions relative to wildtype that are present in at least one of the variants used to create the map.
- Parameters
func_scores_df (pandas.DataFrame) – Data frame of variants and their functional scores. Each row is a different variant, defined by space-delimited list of substitutions.
substitutions_col (str) – Column in func_scores_df giving substitutions for each variant.
func_score_col (str or None) – Column in func_scores_df giving functional score for each variant, or None if no functional scores available.
func_score_var_col (str or None) – Column in func_scores_df giving variance on functional score estimate, or None if no variance available.
n_pre_col (str or None) – Column in func_scores_df giving pre-selection counts for each variant, or None if counts not available.
n_post_col (str or None) – Column in func_scores_df giving post-selection counts for each variant, or None if counts not available.
cols_optional (True) – All of the *_col parameters are optional except substitutions_col. If cols_optional is True, the absence of any of these columns is taken the same as setting that column’s parameter to zero: the corresponding attribute is set to None.
alphabet (list or tuple) – Allowed characters (e.g., amino acids or codons).
allowed_subs (array-like) – The created binary map will include exactly this set of substitutions, and error will be raised if attempts to initialize with variant containing substitution not in this set. Incompatible with
expand
option.sites_as_str (bool) – Site numbers are str rather than int. If you use this option, you are allowed to have sites with a lowercase letter suffix (e.g., “214a”) as sometimes arise when a protein is being numbered in alignment with a reference.
expand (bool) – If False (the default) the encoding only covers substitutions relative to wildtype that are observed in the set of variants. If True then the encoding covers all allowed characters at each site regardless of whether they are wildtype or observed. In this latter case, each binary representation is of length (alphabet size) \(\times\) (sequence length), and sums to the sequence length. You can not use this option in conjunction with sites_as_str.
wtseq (None or str) – Only set this option if expand is True. In that case, it should be the wildtype sequence.
- binarylength¶
Length of the binary representation of each variant.
- Type
int
- nvariants¶
Number of variants.
- Type
int
- binary_variants¶
Sparse matrix of shape nvariants by binarylength. Row binary_variants[ivariant] gives the binary representation of variant ivariant, and binary_variants[ivariant, i] is 1 if the variant has the substitution
BinaryMap.i_to_sub()
and 0 otherwise. To convert to dense numpy.ndarray, use toarray method of the sparse matrix.- Type
scipy.sparse.csr_matrix of dtype int8
- binary_sites¶
Array of length binarylength giving the site number corresponding to each mutation in the binary order. Entries or int or str depending on value of sites_as_str.
- Type
numpy.ndarray
- substitution_variants¶
All variants as substitution strings as provided in substitutions_col of func_scores_df.
- Type
list
- func_scores¶
A 1D array of length nvariants giving score for each variant.
- Type
numpy.ndarray of floats
- func_scores_var¶
A 1D array of length nvariants giving variance on score for each variant, or None if no variance estimates provided.
- Type
numpy.ndarray of floats, or None
- n_pre¶
A 1D array of length nvariants giving pre-selection counts for each variant, or None if counts not provided.
- Type
numpy.dnarray of integers, or None
- n_post¶
A 1D array of length nvariants giving post-selection counts for each variant, or None if counts not provided.
- Type
numpy.dnarray of integers, or None
- alphabet¶
Allowed characters (e.g., amino acids or codons).
- Type
tuple
- substitutions_col¶
Value set when initializing object.
- Type
str
- sites_as_str¶
Value set when initializing object.
- Type
bool
Example
Create a binary map:
>>> func_scores_df = pd.DataFrame.from_records( ... [('', 0.0, 0.2), ... ('M1A', -0.2, 0.1), ... ('M1C K3A', -0.4, 0.3), ... ('', 0.01, 0.15), ... ('A2C K3A', -0.05, 0.1), ... ('A2*', -1.2, 0.4), ... ], ... columns=['aa_substitutions', 'func_score', 'func_score_var']) >>> binmap = BinaryMap(func_scores_df)
The length of the binary representation equals the number of unique substitutions, and we can also see which entries correspond to which substitution:
>>> binmap.binarylength 5 >>> binmap.all_subs ['M1A', 'M1C', 'A2C', 'A2*', 'K3A'] >>> binmap.binary_sites array([1, 1, 2, 2, 3])
Scores, score variances, binary and string representations:
>>> binmap.nvariants 6 >>> binmap.func_scores array([ 0. , -0.2 , -0.4 , 0.01, -0.05, -1.2 ]) >>> binmap.func_scores_var array([0.2 , 0.1 , 0.3 , 0.15, 0.1 , 0.4 ]) >>> type(binmap.binary_variants) == scipy.sparse.csr_matrix True >>> binmap.binary_variants.toarray() array([[0, 0, 0, 0, 0], [1, 0, 0, 0, 0], [0, 1, 0, 0, 1], [0, 0, 0, 0, 0], [0, 0, 1, 0, 1], [0, 0, 0, 1, 0]], dtype=int8) >>> binmap.substitution_variants ['', 'M1A', 'M1C K3A', '', 'A2C K3A', 'A2*'] >>> binmap.substitutions_col 'aa_substitutions'
Validate binary map interconverts binary representations and substitutions:
>>> for ivar in range(binmap.nvariants): ... binvar = binmap.binary_variants.toarray()[ivar] ... subs_from_df = func_scores_df.at[ivar, 'aa_substitutions'] ... assert subs_from_df == binmap.binary_to_sub_str(binvar) ... assert all(binvar == binmap.sub_str_to_binary(subs_from_df))
Demonstrate
BinaryMap.sub_str_to_indices()
:>>> for sub in binmap.substitution_variants: ... print(binmap.sub_str_to_indices(sub)) [] [0] [1, 4] [] [2, 4] [3]
Specify allowed substitutions including one not in
func_scores_df
:>>> allowed_subs = ['K3G', 'M1A', 'M1C', 'A2C', 'A2*', 'K3A'] >>> BinaryMap(func_scores_df, allowed_subs=allowed_subs).all_subs ['M1A', 'M1C', 'A2C', 'A2*', 'K3A', 'K3G']
But we cannot initialize if all substitutions not in
allowed_subs
:>>> BinaryMap(func_scores_df, allowed_subs=['M1A', 'M1C', 'A2*']) Traceback (most recent call last): ... ValueError: substitutions not in `allowed_subs`: ['A2C', 'K3A']
Now do similar operation but using expand to include full alphabet (although to keep size manageable, we use an alphabet smaller than all amino acids):
>>> wtseq = 'MAKG' >>> alphabet = ['A', 'C', 'G', 'K', 'M', '*'] >>> binmap_expand = BinaryMap(func_scores_df, ... alphabet=alphabet, ... expand=True, ... wtseq=wtseq) >>> binmap_expand.binarylength == len(wtseq) * len(alphabet) True >>> binmap_expand.binary_sites array([1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4])
>>> binmap_expand.all_subs ... ['M1A', 'M1C', 'M1G', 'M1K', 'M1*', 'A2C', 'A2G', 'A2K', 'A2M', 'A2*', 'K3A', 'K3C', 'K3G', 'K3M', 'K3*', 'G4A', 'G4C', 'G4K', 'G4M', 'G4*']
>>> binmap_expand.binary_variants.toarray() ... array([[0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0], [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0], [0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0]], dtype=int8)
>>> all(numpy.sum(binmap_expand.binary_variants.toarray(), axis=1) == ... numpy.full(binmap_expand.nvariants, len(wtseq))) True
>>> binmap_expand.substitution_variants ['', 'M1A', 'M1C K3A', '', 'A2C K3A', 'A2*']
>>> for ivar in range(binmap_expand.nvariants): ... binvar = binmap_expand.binary_variants.toarray()[ivar] ... subs_from_df = func_scores_df.at[ivar, 'aa_substitutions'] ... assert subs_from_df == binmap_expand.binary_to_sub_str(binvar) ... assert all(binvar == binmap_expand.sub_str_to_binary(subs_from_df))
Note that binmap does not have n_pre and n_post attributes set:
>>> binmap.n_pre == binmap.n_post == None True
We would not have been able to initialize binmap if we weren’t using the cols_optional flag:
>>> BinaryMap(func_scores_df, alphabet=alphabet, cols_optional=False) Traceback (most recent call last): ... ValueError: `func_scores_df` lacks column pre_count
Now assign values to n_pre and n_post attributes:
>>> func_scores_df_counts = ( ... func_scores_df.assign(pre_count=[10, 20, 15, 5, 6, 8], ... post_count=[0, 3, 12, 11, 9, 8]) ... ) >>> binmap_counts = BinaryMap(func_scores_df_counts, alphabet=alphabet) >>> binmap_counts.n_pre array([10, 20, 15, 5, 6, 8]) >>> binmap_counts.n_post array([ 0, 3, 12, 11, 9, 8])
Use an alphabet that allows gaps:
>>> func_scores_gap_df = pd.concat( ... [ ... func_scores_df, ... pd.DataFrame([("M1-", 0, 0.1)], columns=func_scores_df.columns), ... ] ... ) >>> bmap_gap = BinaryMap(func_scores_gap_df, alphabet=AAS_WITHSTOP_WITHGAP) >>> bmap_gap.all_subs ['M1A', 'M1C', 'M1-', 'A2C', 'A2*', 'K3A']
Use str as sites to enable letter suffixes on sites:
>>> func_scores_sitestr_df = pd.concat( ... [ ... func_scores_df, ... pd.DataFrame([("L3aT", 0.3, 0.1)], columns=func_scores_df.columns), ... ] ... ) >>> BinaryMap(func_scores_sitestr_df) ... Traceback (most recent call last): ... ValueError: substitution L3aT is invalid for alphabet ('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', '*')
>>> bmap_sitestr = BinaryMap(func_scores_sitestr_df, sites_as_str=True) >>> bmap_sitestr.all_subs ['M1A', 'M1C', 'A2C', 'A2*', 'K3A', 'L3aT'] >>> bmap_sitestr.binary_sites array(['1', '1', '2', '2', '3', '3a'], dtype='<U2') >>> type(bmap_sitestr.binary_variants) == scipy.sparse.csr_matrix True >>> bmap_sitestr.binary_variants.toarray() array([[0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 1, 0], [0, 0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 1]], dtype=int8)
- property all_subs¶
Substitutions in order encoded in binary map.
- Type
list
- binary_to_sub_str(binary)[source]¶
Convert binary representation to space-delimited substitutions.
Note
This method is the inverse of
BinaryMap.sub_str_to_binary()
.- Parameters
binary (numpy.ndarray) – Binary representation.
- Returns
Space-delimited substitutions.
- Return type
str
- i_to_sub(i)[source]¶
Substitution corresponding to index in binary representation.
- Parameters
i (int) – Index in binary representation, 0 <= i < binarylength.
- Returns
The substitution corresponding to that index.
- Return type
str
- sub_str_to_binary(sub_str)[source]¶
Convert space-delimited substitutions to binary representation.
- Parameters
sub_str (str) – Space-delimited substitutions.
- Returns
Binary representation.
- Return type
numpy.ndarray of dtype int8