Basic example¶
Import packages / modules¶
Import dmslogo
along with the other Python packages used in these examples:
[1]:
# NBVAL_IGNORE_OUTPUT
import os
import random
from IPython.display import display, Image
import matplotlib.pyplot as plt
import numpy
import pandas as pd
import dmslogo
from dmslogo.colorschemes import CBPALETTE
Set options to display pandas DataFrames:
[2]:
pd.set_option("display.max_columns", 20)
pd.set_option("display.width", 500)
Simple example on toy data¶
Draw a basic logo¶
Simple plotting can be done using the dmslogo.logo.draw_logo function.
This function takes in as input a pandas DataFrame that has columns with:
site in sequential integer numbering
letter (i.e., amino acid or nucleotide)
height of letter (can be any positive number)
Here make a simple data frame that fits these specs:
[3]:
data = pd.DataFrame.from_records(
data=[
(1, "A", 1),
(1, "C", 0.7),
(2, "C", 0.1),
(2, "D", 1.2),
(5, "A", 0.4),
(5, "K", 0.4),
],
columns=["site", "letter", "height"],
)
data
[3]:
site | letter | height | |
---|---|---|---|
0 | 1 | A | 1.0 |
1 | 1 | C | 0.7 |
2 | 2 | C | 0.1 |
3 | 2 | D | 1.2 |
4 | 5 | A | 0.4 |
5 | 5 | K | 0.4 |
Use dmslogo.logo.draw_logo to draw the logo plot, passing the names of the columns with each piece of required data:
[4]:
# NBVAL_IGNORE_OUTPUT
fig, ax = dmslogo.draw_logo(
data=data, x_col="site", letter_col="letter", letter_height_col="height"
)
Add a title:
[5]:
# NBVAL_IGNORE_OUTPUT
fig, ax = dmslogo.draw_logo(
data=data,
x_col="site",
letter_col="letter",
letter_height_col="height",
title="basic example",
)
Note that the call to dmslogo.logo.draw_logo returns matplotlib Figure and Axis instances, which we have called fig
and ax
. We can save the figure to a file using the savefig command of fig
. Below we show an example of how to do this saving to a temporary file:
[6]:
outputdir = "output_files"
os.makedirs(outputdir, exist_ok=True)
pngfile = os.path.join(outputdir, "basic_example_logo.png")
print(f"Saving figure to {pngfile}")
fig.savefig(pngfile, dpi=450, bbox_inches="tight")
print(f"Here is a rendering of the saved figure:")
display(Image(pngfile, width=200))
Saving figure to output_files/basic_example_logo.png
Here is a rendering of the saved figure:
Drawing without breaks¶
Note how the above plot has a “break” (gap and dashed line) to indicate a break in the sequential numbering in x_col
between 2 and 5. This is useful as it indicates when we are breaking the sequence when drawing just snippets of a protein. If you do not want to indicate breaks in this way, turn off the addbreaks
option. Now the logo just goes directly from 2 to 5 without indicating a break:
[7]:
# NBVAL_IGNORE_OUTPUT
fig, ax = dmslogo.draw_logo(
data=data,
x_col="site",
letter_col="letter",
letter_height_col="height",
addbreaks=False,
)
Setting letter colors¶
The above plot colored letters using a default amino-acid coloring scheme. You can set a different coloring scheme using colorscheme
and missing_color
, or you can set letter colors at a site-specific level by adding a column to data
that specifies the colors. Here we color letters at site-specific level:
[8]:
data["color"] = ["red", "gray", "gray", "gray", "red", "gray"]
data
[8]:
site | letter | height | color | |
---|---|---|---|---|
0 | 1 | A | 1.0 | red |
1 | 1 | C | 0.7 | gray |
2 | 2 | C | 0.1 | gray |
3 | 2 | D | 1.2 | gray |
4 | 5 | A | 0.4 | red |
5 | 5 | K | 0.4 | gray |
Now plot using color_col
to set the colors:
[9]:
# NBVAL_IGNORE_OUTPUT
fig, ax = dmslogo.draw_logo(
data=data,
x_col="site",
letter_col="letter",
letter_height_col="height",
color_col="color",
)
Labeling x-axis ticks¶
Sometimes we want to label sites with something other than the sequential integer numbers. We can do this by adding a column for the xtick labels to data
:
[10]:
data["site_label"] = ["D1", "D1", "A2", "A2", "F5", "F5"]
data
[10]:
site | letter | height | color | site_label | |
---|---|---|---|---|---|
0 | 1 | A | 1.0 | red | D1 |
1 | 1 | C | 0.7 | gray | D1 |
2 | 2 | C | 0.1 | gray | A2 |
3 | 2 | D | 1.2 | gray | A2 |
4 | 5 | A | 0.4 | red | F5 |
5 | 5 | K | 0.4 | gray | F5 |
Now use xtick_col
to set the xticks:
[11]:
# NBVAL_IGNORE_OUTPUT
fig, ax = dmslogo.draw_logo(
data=data,
x_col="site",
letter_col="letter",
letter_height_col="height",
color_col="color",
xtick_col="site_label",
)
Shading some sites¶
Sometimes we may want to shade certain sites. To do this, we specify the shade color and transparency (alpha) for all sites we want to shade (must be the same for all letters at a site):
[12]:
data["shade_color"] = ["yellow", "yellow", None, None, "brown", "brown"]
data["shade_alpha"] = [0.15, 0.15, None, None, 0.3, 0.3]
data
[12]:
site | letter | height | color | site_label | shade_color | shade_alpha | |
---|---|---|---|---|---|---|---|
0 | 1 | A | 1.0 | red | D1 | yellow | 0.15 |
1 | 1 | C | 0.7 | gray | D1 | yellow | 0.15 |
2 | 2 | C | 0.1 | gray | A2 | None | NaN |
3 | 2 | D | 1.2 | gray | A2 | None | NaN |
4 | 5 | A | 0.4 | red | F5 | brown | 0.30 |
5 | 5 | K | 0.4 | gray | F5 | brown | 0.30 |
Now use shade_color_col
and shade_alpha_col
to draw the logo with the shaded sites:
[13]:
# NBVAL_IGNORE_OUTPUT
fig, ax = dmslogo.draw_logo(
data=data,
x_col="site",
letter_col="letter",
letter_height_col="height",
color_col="color",
xtick_col="site_label",
shade_color_col="shade_color",
shade_alpha_col="shade_alpha",
)
Adjusting size, axis labels, axes¶
We can do additional formatting by scaling the width (widthscale
), the height (heightscale
), the axis font (axisfontscale
), the x-axis (xlabel
) and y-axis (ylabel
) labels, and removing the axes altogether (hide_axis
).
First, we make a plot where we adjust the size, change the y-axis label, and get rid of the x-axis label:
[14]:
# NBVAL_IGNORE_OUTPUT
fig, ax = dmslogo.draw_logo(
data=data,
x_col="site",
letter_col="letter",
letter_height_col="height",
color_col="color",
xtick_col="site_label",
xlabel="",
ylabel="immune selection",
heightscale=2,
axisfontscale=1.5,
)
Now we make a plot where we hide the axes and their labels altogether:
[15]:
# NBVAL_IGNORE_OUTPUT
fig, ax = dmslogo.draw_logo(
data=data,
x_col="site",
letter_col="letter",
letter_height_col="height",
color_col="color",
xtick_col="site_label",
hide_axis=True,
)
Multiple logos in one figure¶
So far we have made individual plots on newly generate figures created by dmslogo.logo.draw_logo.
But we can also create a multi-axis figure, and then draw several logos onto that. The easiest way to do this is with the dmslogo.facet.facet_plot command described below. But we can also do it using matplotlib
subplots as here:
[16]:
# NBVAL_IGNORE_OUTPUT
# make figure with two subplots: two rows, one column
fig, axes = plt.subplots(2, 1)
fig.subplots_adjust(hspace=0.3) # add more vertical space for axis titles
fig.set_size_inches(4, 5)
# draw top plot, no x-axis ticks or label, default coloring
_ = dmslogo.draw_logo(
data.assign(no_ticks=""),
x_col="site",
letter_col="letter",
letter_height_col="height",
ax=axes[0],
xlabel="",
ylabel="",
xtick_col="no_ticks",
title="colored by amino acid",
)
# draw bottom plot, color as specified in `data`
_ = dmslogo.draw_logo(
data,
x_col="site",
letter_col="letter",
letter_height_col="height",
color_col="color",
ax=axes[1],
ylabel="",
title="user-specified colors",
)
Real HIV data from Dingens et al¶
In An Antigenic Atlas of HIV-1 Escape from Broadly Neutralizing Antibodies Distinguishes Functional and Structural Epitopes (Dingens et al, 2019), there are plots of immune selection on HIV envelope (Env) from anti-HIV antibodies at just a subset of “strongly selected” sites for each antibody.
Here we use dmslogo
to re-create one of those plots (the one in Figure 3D,E) showing antibodies PG9 and PGT145.
Get data to plot¶
We have already downloaded CSV files with the data from the paper’s GitHub repo giving the immune selection (as fraction surviving above average) for these two antibodies We read the data into a pandas data frame:
[17]:
antibodies = ["PG9", "PGT145"]
data_hiv = []
for antibody in antibodies:
fname = f"input_files/summary_{antibody}-medianmutfracsurvive.csv"
print(f"Reading data for {antibody} from {fname}")
data_hiv.append(pd.read_csv(fname).assign(antibody=antibody))
data_hiv = pd.concat(data_hiv)
Reading data for PG9 from input_files/summary_PG9-medianmutfracsurvive.csv
Reading data for PGT145 from input_files/summary_PGT145-medianmutfracsurvive.csv
Here are the first few lines of the data frame. For each mutation it gives the immune selection (mutfracsurvive):
[18]:
data_hiv.head(n=5)
[18]:
site | wildtype | mutation | mutfracsurvive | antibody | |
---|---|---|---|---|---|
0 | 160 | N | I | 0.256342 | PG9 |
1 | 160 | N | L | 0.207440 | PG9 |
2 | 160 | N | R | 0.184067 | PG9 |
3 | 171 | K | E | 0.176118 | PG9 |
4 | 428 | Q | Y | 0.150981 | PG9 |
The sites in this data frame are in the HXB2 numbering scheme, which is not the same as sequential integer numbering of the actual BG505 Env for which the immune selection was measured. So for our plotting, we also need to create a column (which we will call isite
) that numbers the sites a sequential numbering. A file that converts between HXB2 and and BG505 numbering is part of the paper’s GitHub
repo; here we use a downloaded copy of that file to make the isite
column:
[19]:
numberfile = "input_files/BG505_to_HXB2.csv"
data_hiv = (
pd.read_csv(numberfile)
.rename(columns={"original": "isite", "new": "site"})[["site", "isite"]]
.merge(data_hiv, on="site", validate="one_to_many")
)
Now see how this data frame also has the isite
column which has sequential integer numbering of the sequence:
[20]:
data_hiv.head(n=5)
[20]:
site | isite | wildtype | mutation | mutfracsurvive | antibody | |
---|---|---|---|---|---|---|
0 | 31 | 30 | A | Y | 0.030824 | PG9 |
1 | 31 | 30 | A | K | 0.006860 | PG9 |
2 | 31 | 30 | A | D | 0.006774 | PG9 |
3 | 31 | 30 | A | S | 0.004407 | PG9 |
4 | 31 | 30 | A | R | 0.003501 | PG9 |
We add a column (site_label
) that gives the site labeled with the wildtype identity that we can use for axis ticks. We also indicate which sites to show (column show_site
) in our logoplot snippet (these are just the same ones in Figure 3 of the Dingens et al (2019) paper):
[21]:
# same sites in Figure 3D,E of Dingens et al (2019)
sites_to_show = map(
str,
list(range(119, 125))
+ [127]
+ list(range(156, 174))
+ list(range(199, 205))
+ list(range(312, 316)),
)
data_hiv = data_hiv.assign(
site_label=lambda x: x["wildtype"] + x["site"],
show_site=lambda x: x["site"].isin(sites_to_show),
)
See how the data frame now has the site_label
and show_site
columns:
[22]:
data_hiv.head(n=5)
[22]:
site | isite | wildtype | mutation | mutfracsurvive | antibody | site_label | show_site | |
---|---|---|---|---|---|---|---|---|
0 | 31 | 30 | A | Y | 0.030824 | PG9 | A31 | False |
1 | 31 | 30 | A | K | 0.006860 | PG9 | A31 | False |
2 | 31 | 30 | A | D | 0.006774 | PG9 | A31 | False |
3 | 31 | 30 | A | S | 0.004407 | PG9 | A31 | False |
4 | 31 | 30 | A | R | 0.003501 | PG9 | A31 | False |
Draw a logo plot¶
Now we make logo plots of the sites that we have selected to show, here just for the PG9 antibody. Note how we query data_hiv
for only the antibody (PG9) and the sites of interest (show_site
is True
):
[23]:
# NBVAL_IGNORE_OUTPUT
fig, ax = dmslogo.draw_logo(
data_hiv.query('antibody == "PG9"').query("show_site"),
x_col="isite",
letter_col="mutation",
letter_height_col="mutfracsurvive",
xtick_col="site_label",
title="PG9",
)
Draw site-level line plots¶
The logo plot above shows selection at a subset of sites. But we might also want to summarize the selection across all sites (as is done in Figure 2 of Dingens et al (2019)).
An easy way to do this is to create a summary statistic at each site. Here we compute the total fracsurvive at each site across all mutations, and add that to our data frame:
[24]:
data_hiv = data_hiv.query(
"mutation != wildtype"
).assign( # only care about mutations; get rid of wildtype values
totfracsurvive=lambda x: x.groupby(["antibody", "site"])[
"mutfracsurvive"
].transform("sum")
)
Now the data frame has a column (totfracsurvive
) giving the average fraction surviving at each site:
[25]:
data_hiv.head(n=5)
[25]:
site | isite | wildtype | mutation | mutfracsurvive | antibody | site_label | show_site | totfracsurvive | |
---|---|---|---|---|---|---|---|---|---|
0 | 31 | 30 | A | Y | 0.030824 | PG9 | A31 | False | 0.062502 |
1 | 31 | 30 | A | K | 0.006860 | PG9 | A31 | False | 0.062502 |
2 | 31 | 30 | A | D | 0.006774 | PG9 | A31 | False | 0.062502 |
3 | 31 | 30 | A | S | 0.004407 | PG9 | A31 | False | 0.062502 |
4 | 31 | 30 | A | R | 0.003501 | PG9 | A31 | False | 0.062502 |
Now we use the dmslogo.line.draw_line function to draw the line plot for antibody PG9. Note how we provide our new totfracsurvive
column as height_col
. We also provide our previously defined show_site
column (which indicates which sites were shown in the logo plot) as the show_col
, so that the line plot has the sites shown in the above logo plot underlined in orange:
[26]:
# NBVAL_IGNORE_OUTPUT
fig, ax = dmslogo.draw_line(
data_hiv.query('antibody == "PG9"'),
x_col="isite",
height_col="totfracsurvive",
xtick_col="site",
show_col="show_site",
title="PG9",
widthscale=2,
)
Combining site-level line and mutation-level logo plots¶
Of course, a line plot isn’t that hard to make, but the advantage of doing this using the approach above is that we can combine dmslogo.line.draw_line and dmslogo.logo.draw_logo to create a single figure that shows the site-selection in a line plot and the selected sites as logo plots.
The easiest way to do this using the dmslogo.facet.facet_plot command described below. But first here we do it using matplotlib
subplots. Note how the resulting plot combines the line and logo plots, with the line plot using the orange underline to indicate which sites are zoomed in the logo plot:
[27]:
# NBVAL_IGNORE_OUTPUT
fig, axes = plt.subplots(1, 2, gridspec_kw={"width_ratios": [1, 1.5]})
fig.subplots_adjust(wspace=0.12)
fig.set_size_inches(24, 3)
_ = dmslogo.draw_line(
data_hiv.query('antibody == "PG9"'),
x_col="isite",
height_col="totfracsurvive",
xtick_col="site",
show_col="show_site",
ax=axes[0],
)
_ = dmslogo.draw_logo(
data_hiv.query('antibody == "PG9"').query("show_site"),
x_col="isite",
letter_col="mutation",
letter_height_col="mutfracsurvive",
ax=axes[1],
xtick_col="site_label",
)
Faceting line and logo plots together¶
The easiest way to facet line and logo plots together is using dmslogo.facet.facet_plot.
The cell below shows how this is done. You pass the data to this function, as well any columns and rows we would like to facet, the x_col
and show_col
arguments shared between the line and logo plots, and additional keyword arguments for dmslogo.logo.draw_logo and dmslogo.line.draw_line:
[28]:
# NBVAL_IGNORE_OUTPUT
fig, axes = dmslogo.facet_plot(
data_hiv,
gridrow_col="antibody",
x_col="isite",
show_col="show_site",
draw_line_kwargs=dict(height_col="totfracsurvive", xtick_col="site"),
draw_logo_kwargs=dict(
letter_col="mutation",
letter_height_col="mutfracsurvive",
xtick_col="site_label",
xlabel="site",
),
line_titlesuffix="site-level selection",
logo_titlesuffix="mutation-level selection",
)
There are various options to tweak the formatting of the faceted plot. Here we demonstrate a few of them:
We assign a more generic ylabel (“immune selection”) to each plot via the appropriate
*_kwargs
option.We use the
share_ylim_across_rows=False
option to allow each row to have its own y-axis limits.We use the
share_xlabel
andshare_ylabel
options to share the x- and y-labels across the line and logo plots.
[29]:
# NBVAL_IGNORE_OUTPUT
fig, axes = dmslogo.facet_plot(
data_hiv,
gridrow_col="antibody",
x_col="isite",
show_col="show_site",
draw_line_kwargs=dict(
height_col="totfracsurvive", xtick_col="site", ylabel="immune selection"
),
draw_logo_kwargs=dict(
letter_col="mutation",
letter_height_col="mutfracsurvive",
xtick_col="site_label",
xlabel="site",
ylabel="immune selection",
),
line_titlesuffix="site-level selection",
logo_titlesuffix="mutation-level selection",
share_ylim_across_rows=False,
share_xlabel=True,
share_ylabel=True,
)
Finally, here is the same plot where we have shaded the N-linked glycan at 160:
[30]:
# NBVAL_IGNORE_OUTPUT
data_hiv["shade_color"] = numpy.where(
data_hiv["site"].isin(["160", "161", "162"]), "gray", None
)
data_hiv["shade_alpha"] = numpy.where(
data_hiv["site"].isin(["160", "161", "162"]), 0.15, None
)
fig, axes = dmslogo.facet_plot(
data_hiv,
gridrow_col="antibody",
x_col="isite",
show_col="show_site",
draw_line_kwargs=dict(
height_col="totfracsurvive", xtick_col="site", ylabel="immune selection"
),
draw_logo_kwargs=dict(
letter_col="mutation",
letter_height_col="mutfracsurvive",
xtick_col="site_label",
xlabel="site",
ylabel="immune selection",
shade_color_col="shade_color",
shade_alpha_col="shade_alpha",
),
line_titlesuffix="site-level selection",
logo_titlesuffix="mutation-level selection",
share_ylim_across_rows=False,
share_xlabel=True,
share_ylabel=True,
)
Some details about fonts¶
Write DMSLOGO in Comic Sans font¶
Generate data to plot by creating the pandas DataFrame word_data
. In this data frame, we choose large heights and bright colors for the letters in our word (DMSLOGO), and smaller letters and gray for other letters.
[31]:
word = "DMSLOGO"
lettercolors = [CBPALETTE[1]] * len("dms") + [CBPALETTE[2]] * len("logo")
# make data frame with data to plot
random.seed(0)
word_data = {"x": [], "letter": [], "height": [], "color": []}
for x, (letter, color) in enumerate(zip(word, lettercolors)):
word_data["x"].append(x)
word_data["letter"].append(letter)
word_data["color"].append(color)
word_data["height"].append(random.uniform(1, 1.5))
for otherletter in random.sample(sorted(set("ACTG") - {letter}), 3):
word_data["x"].append(x)
word_data["letter"].append(otherletter)
word_data["color"].append(CBPALETTE[0])
word_data["height"].append(random.uniform(0.1, 0.5))
word_data = pd.DataFrame(word_data)
word_data.head(n=6)
[31]:
x | letter | height | color | |
---|---|---|---|---|
0 | 0 | D | 1.422211 | #E69F00 |
1 | 0 | T | 0.486186 | #999999 |
2 | 0 | A | 0.294371 | #999999 |
3 | 0 | C | 0.467294 | #999999 |
4 | 1 | M | 1.414926 | #E69F00 |
5 | 1 | T | 0.301875 | #999999 |
Now draw the logo. We use the fontfamily
argument to set a Comic Sans font. This also requires us to increase fontaspect
since this font is wider, and increase letterpad
as the font height sometimes sticks out beyond its baseline:
[32]:
# NBVAL_IGNORE_OUTPUT
fig, ax = dmslogo.draw_logo(
data=word_data,
letter_height_col="height",
x_col="x",
letter_col="letter",
color_col="color",
fontfamily="Comic Sans MS",
hide_axis=True,
fontaspect=0.85,
letterpad=0.05,
)
Subtleties of non-default fonts¶
Note however that you in general may have difficulty using most fonts (other than the dmslogo
default) for good-looking logos. The reason is that for a clean and accurate letter-height logo plots, the font must:
be mono-spaced
not have descenders
have all letters go exactly from the baseline to the top
You can manually edit a font to do this as has been done for the current dmslogo
default font; to see more information on this look here for details.