Basic example

Import packages / modules

Import dmslogo along with the other Python packages used in these examples:

[1]:
# NBVAL_IGNORE_OUTPUT

import os
import random

from IPython.display import display, Image

import matplotlib.pyplot as plt

import numpy

import pandas as pd

import dmslogo
from dmslogo.colorschemes import CBPALETTE

Set options to display pandas DataFrames:

[2]:
pd.set_option("display.max_columns", 20)
pd.set_option("display.width", 500)

Simple example on toy data

Drawing without breaks

Note how the above plot has a “break” (gap and dashed line) to indicate a break in the sequential numbering in x_col between 2 and 5. This is useful as it indicates when we are breaking the sequence when drawing just snippets of a protein. If you do not want to indicate breaks in this way, turn off the addbreaks option. Now the logo just goes directly from 2 to 5 without indicating a break:

[7]:
# NBVAL_IGNORE_OUTPUT

fig, ax = dmslogo.draw_logo(
    data=data,
    x_col="site",
    letter_col="letter",
    letter_height_col="height",
    addbreaks=False,
)
_images/basic_example_13_0.png

Setting letter colors

The above plot colored letters using a default amino-acid coloring scheme. You can set a different coloring scheme using colorscheme and missing_color, or you can set letter colors at a site-specific level by adding a column to data that specifies the colors. Here we color letters at site-specific level:

[8]:
data["color"] = ["red", "gray", "gray", "gray", "red", "gray"]
data
[8]:
site letter height color
0 1 A 1.0 red
1 1 C 0.7 gray
2 2 C 0.1 gray
3 2 D 1.2 gray
4 5 A 0.4 red
5 5 K 0.4 gray

Now plot using color_col to set the colors:

[9]:
# NBVAL_IGNORE_OUTPUT

fig, ax = dmslogo.draw_logo(
    data=data,
    x_col="site",
    letter_col="letter",
    letter_height_col="height",
    color_col="color",
)
_images/basic_example_17_0.png

Labeling x-axis ticks

Sometimes we want to label sites with something other than the sequential integer numbers. We can do this by adding a column for the xtick labels to data:

[10]:
data["site_label"] = ["D1", "D1", "A2", "A2", "F5", "F5"]
data
[10]:
site letter height color site_label
0 1 A 1.0 red D1
1 1 C 0.7 gray D1
2 2 C 0.1 gray A2
3 2 D 1.2 gray A2
4 5 A 0.4 red F5
5 5 K 0.4 gray F5

Now use xtick_col to set the xticks:

[11]:
# NBVAL_IGNORE_OUTPUT

fig, ax = dmslogo.draw_logo(
    data=data,
    x_col="site",
    letter_col="letter",
    letter_height_col="height",
    color_col="color",
    xtick_col="site_label",
)
_images/basic_example_21_0.png

Shading some sites

Sometimes we may want to shade certain sites. To do this, we specify the shade color and transparency (alpha) for all sites we want to shade (must be the same for all letters at a site):

[12]:
data["shade_color"] = ["yellow", "yellow", None, None, "brown", "brown"]
data["shade_alpha"] = [0.15, 0.15, None, None, 0.3, 0.3]
data
[12]:
site letter height color site_label shade_color shade_alpha
0 1 A 1.0 red D1 yellow 0.15
1 1 C 0.7 gray D1 yellow 0.15
2 2 C 0.1 gray A2 None NaN
3 2 D 1.2 gray A2 None NaN
4 5 A 0.4 red F5 brown 0.30
5 5 K 0.4 gray F5 brown 0.30

Now use shade_color_col and shade_alpha_col to draw the logo with the shaded sites:

[13]:
# NBVAL_IGNORE_OUTPUT

fig, ax = dmslogo.draw_logo(
    data=data,
    x_col="site",
    letter_col="letter",
    letter_height_col="height",
    color_col="color",
    xtick_col="site_label",
    shade_color_col="shade_color",
    shade_alpha_col="shade_alpha",
)
_images/basic_example_25_0.png

Adjusting size, axis labels, axes

We can do additional formatting by scaling the width (widthscale), the height (heightscale), the axis font (axisfontscale), the x-axis (xlabel) and y-axis (ylabel) labels, and removing the axes altogether (hide_axis).

First, we make a plot where we adjust the size, change the y-axis label, and get rid of the x-axis label:

[14]:
# NBVAL_IGNORE_OUTPUT

fig, ax = dmslogo.draw_logo(
    data=data,
    x_col="site",
    letter_col="letter",
    letter_height_col="height",
    color_col="color",
    xtick_col="site_label",
    xlabel="",
    ylabel="immune selection",
    heightscale=2,
    axisfontscale=1.5,
)
_images/basic_example_27_0.png

Now we make a plot where we hide the axes and their labels altogether:

[15]:
# NBVAL_IGNORE_OUTPUT

fig, ax = dmslogo.draw_logo(
    data=data,
    x_col="site",
    letter_col="letter",
    letter_height_col="height",
    color_col="color",
    xtick_col="site_label",
    hide_axis=True,
)
_images/basic_example_29_0.png

Multiple logos in one figure

So far we have made individual plots on newly generate figures created by dmslogo.logo.draw_logo.

But we can also create a multi-axis figure, and then draw several logos onto that. The easiest way to do this is with the dmslogo.facet.facet_plot command described below. But we can also do it using matplotlib subplots as here:

[16]:
# NBVAL_IGNORE_OUTPUT

# make figure with two subplots: two rows, one column
fig, axes = plt.subplots(2, 1)
fig.subplots_adjust(hspace=0.3)  # add more vertical space for axis titles
fig.set_size_inches(4, 5)

# draw top plot, no x-axis ticks or label, default coloring
_ = dmslogo.draw_logo(
    data.assign(no_ticks=""),
    x_col="site",
    letter_col="letter",
    letter_height_col="height",
    ax=axes[0],
    xlabel="",
    ylabel="",
    xtick_col="no_ticks",
    title="colored by amino acid",
)

# draw bottom plot, color as specified in `data`
_ = dmslogo.draw_logo(
    data,
    x_col="site",
    letter_col="letter",
    letter_height_col="height",
    color_col="color",
    ax=axes[1],
    ylabel="",
    title="user-specified colors",
)
_images/basic_example_31_0.png

Real HIV data from Dingens et al

In An Antigenic Atlas of HIV-1 Escape from Broadly Neutralizing Antibodies Distinguishes Functional and Structural Epitopes (Dingens et al, 2019), there are plots of immune selection on HIV envelope (Env) from anti-HIV antibodies at just a subset of “strongly selected” sites for each antibody.

Here we use dmslogo to re-create one of those plots (the one in Figure 3D,E) showing antibodies PG9 and PGT145.

Get data to plot

We have already downloaded CSV files with the data from the paper’s GitHub repo giving the immune selection (as fraction surviving above average) for these two antibodies We read the data into a pandas data frame:

[17]:
antibodies = ["PG9", "PGT145"]

data_hiv = []
for antibody in antibodies:
    fname = f"input_files/summary_{antibody}-medianmutfracsurvive.csv"
    print(f"Reading data for {antibody} from {fname}")
    data_hiv.append(pd.read_csv(fname).assign(antibody=antibody))

data_hiv = pd.concat(data_hiv)
Reading data for PG9 from input_files/summary_PG9-medianmutfracsurvive.csv
Reading data for PGT145 from input_files/summary_PGT145-medianmutfracsurvive.csv

Here are the first few lines of the data frame. For each mutation it gives the immune selection (mutfracsurvive):

[18]:
data_hiv.head(n=5)
[18]:
site wildtype mutation mutfracsurvive antibody
0 160 N I 0.256342 PG9
1 160 N L 0.207440 PG9
2 160 N R 0.184067 PG9
3 171 K E 0.176118 PG9
4 428 Q Y 0.150981 PG9

The sites in this data frame are in the HXB2 numbering scheme, which is not the same as sequential integer numbering of the actual BG505 Env for which the immune selection was measured. So for our plotting, we also need to create a column (which we will call isite) that numbers the sites a sequential numbering. A file that converts between HXB2 and and BG505 numbering is part of the paper’s GitHub repo; here we use a downloaded copy of that file to make the isite column:

[19]:
numberfile = "input_files/BG505_to_HXB2.csv"

data_hiv = (
    pd.read_csv(numberfile)
    .rename(columns={"original": "isite", "new": "site"})[["site", "isite"]]
    .merge(data_hiv, on="site", validate="one_to_many")
)

Now see how this data frame also has the isite column which has sequential integer numbering of the sequence:

[20]:
data_hiv.head(n=5)
[20]:
site isite wildtype mutation mutfracsurvive antibody
0 31 30 A Y 0.030824 PG9
1 31 30 A K 0.006860 PG9
2 31 30 A D 0.006774 PG9
3 31 30 A S 0.004407 PG9
4 31 30 A R 0.003501 PG9

We add a column (site_label) that gives the site labeled with the wildtype identity that we can use for axis ticks. We also indicate which sites to show (column show_site) in our logoplot snippet (these are just the same ones in Figure 3 of the Dingens et al (2019) paper):

[21]:
# same sites in Figure 3D,E of Dingens et al (2019)
sites_to_show = map(
    str,
    list(range(119, 125))
    + [127]
    + list(range(156, 174))
    + list(range(199, 205))
    + list(range(312, 316)),
)

data_hiv = data_hiv.assign(
    site_label=lambda x: x["wildtype"] + x["site"],
    show_site=lambda x: x["site"].isin(sites_to_show),
)

See how the data frame now has the site_label and show_site columns:

[22]:
data_hiv.head(n=5)
[22]:
site isite wildtype mutation mutfracsurvive antibody site_label show_site
0 31 30 A Y 0.030824 PG9 A31 False
1 31 30 A K 0.006860 PG9 A31 False
2 31 30 A D 0.006774 PG9 A31 False
3 31 30 A S 0.004407 PG9 A31 False
4 31 30 A R 0.003501 PG9 A31 False

Draw a logo plot

Now we make logo plots of the sites that we have selected to show, here just for the PG9 antibody. Note how we query data_hiv for only the antibody (PG9) and the sites of interest (show_site is True):

[23]:
# NBVAL_IGNORE_OUTPUT

fig, ax = dmslogo.draw_logo(
    data_hiv.query('antibody == "PG9"').query("show_site"),
    x_col="isite",
    letter_col="mutation",
    letter_height_col="mutfracsurvive",
    xtick_col="site_label",
    title="PG9",
)
_images/basic_example_45_0.png

Draw site-level line plots

The logo plot above shows selection at a subset of sites. But we might also want to summarize the selection across all sites (as is done in Figure 2 of Dingens et al (2019)).

An easy way to do this is to create a summary statistic at each site. Here we compute the total fracsurvive at each site across all mutations, and add that to our data frame:

[24]:
data_hiv = data_hiv.query(
    "mutation != wildtype"
).assign(  # only care about mutations; get rid of wildtype values
    totfracsurvive=lambda x: x.groupby(["antibody", "site"])[
        "mutfracsurvive"
    ].transform("sum")
)

Now the data frame has a column (totfracsurvive) giving the average fraction surviving at each site:

[25]:
data_hiv.head(n=5)
[25]:
site isite wildtype mutation mutfracsurvive antibody site_label show_site totfracsurvive
0 31 30 A Y 0.030824 PG9 A31 False 0.062502
1 31 30 A K 0.006860 PG9 A31 False 0.062502
2 31 30 A D 0.006774 PG9 A31 False 0.062502
3 31 30 A S 0.004407 PG9 A31 False 0.062502
4 31 30 A R 0.003501 PG9 A31 False 0.062502

Now we use the dmslogo.line.draw_line function to draw the line plot for antibody PG9. Note how we provide our new totfracsurvive column as height_col. We also provide our previously defined show_site column (which indicates which sites were shown in the logo plot) as the show_col, so that the line plot has the sites shown in the above logo plot underlined in orange:

[26]:
# NBVAL_IGNORE_OUTPUT

fig, ax = dmslogo.draw_line(
    data_hiv.query('antibody == "PG9"'),
    x_col="isite",
    height_col="totfracsurvive",
    xtick_col="site",
    show_col="show_site",
    title="PG9",
    widthscale=2,
)
_images/basic_example_51_0.png

Combining site-level line and mutation-level logo plots

Of course, a line plot isn’t that hard to make, but the advantage of doing this using the approach above is that we can combine dmslogo.line.draw_line and dmslogo.logo.draw_logo to create a single figure that shows the site-selection in a line plot and the selected sites as logo plots.

The easiest way to do this using the dmslogo.facet.facet_plot command described below. But first here we do it using matplotlib subplots. Note how the resulting plot combines the line and logo plots, with the line plot using the orange underline to indicate which sites are zoomed in the logo plot:

[27]:
# NBVAL_IGNORE_OUTPUT

fig, axes = plt.subplots(1, 2, gridspec_kw={"width_ratios": [1, 1.5]})
fig.subplots_adjust(wspace=0.12)
fig.set_size_inches(24, 3)

_ = dmslogo.draw_line(
    data_hiv.query('antibody == "PG9"'),
    x_col="isite",
    height_col="totfracsurvive",
    xtick_col="site",
    show_col="show_site",
    ax=axes[0],
)

_ = dmslogo.draw_logo(
    data_hiv.query('antibody == "PG9"').query("show_site"),
    x_col="isite",
    letter_col="mutation",
    letter_height_col="mutfracsurvive",
    ax=axes[1],
    xtick_col="site_label",
)
_images/basic_example_53_0.png

Faceting line and logo plots together

The easiest way to facet line and logo plots together is using dmslogo.facet.facet_plot.

The cell below shows how this is done. You pass the data to this function, as well any columns and rows we would like to facet, the x_col and show_col arguments shared between the line and logo plots, and additional keyword arguments for dmslogo.logo.draw_logo and dmslogo.line.draw_line:

[28]:
# NBVAL_IGNORE_OUTPUT

fig, axes = dmslogo.facet_plot(
    data_hiv,
    gridrow_col="antibody",
    x_col="isite",
    show_col="show_site",
    draw_line_kwargs=dict(height_col="totfracsurvive", xtick_col="site"),
    draw_logo_kwargs=dict(
        letter_col="mutation",
        letter_height_col="mutfracsurvive",
        xtick_col="site_label",
        xlabel="site",
    ),
    line_titlesuffix="site-level selection",
    logo_titlesuffix="mutation-level selection",
)
_images/basic_example_55_0.png

There are various options to tweak the formatting of the faceted plot. Here we demonstrate a few of them:

  • We assign a more generic ylabel (“immune selection”) to each plot via the appropriate *_kwargs option.

  • We use the share_ylim_across_rows=False option to allow each row to have its own y-axis limits.

  • We use the share_xlabel and share_ylabel options to share the x- and y-labels across the line and logo plots.

[29]:
# NBVAL_IGNORE_OUTPUT

fig, axes = dmslogo.facet_plot(
    data_hiv,
    gridrow_col="antibody",
    x_col="isite",
    show_col="show_site",
    draw_line_kwargs=dict(
        height_col="totfracsurvive", xtick_col="site", ylabel="immune selection"
    ),
    draw_logo_kwargs=dict(
        letter_col="mutation",
        letter_height_col="mutfracsurvive",
        xtick_col="site_label",
        xlabel="site",
        ylabel="immune selection",
    ),
    line_titlesuffix="site-level selection",
    logo_titlesuffix="mutation-level selection",
    share_ylim_across_rows=False,
    share_xlabel=True,
    share_ylabel=True,
)
_images/basic_example_57_0.png

Finally, here is the same plot where we have shaded the N-linked glycan at 160:

[30]:
# NBVAL_IGNORE_OUTPUT

data_hiv["shade_color"] = numpy.where(
    data_hiv["site"].isin(["160", "161", "162"]), "gray", None
)
data_hiv["shade_alpha"] = numpy.where(
    data_hiv["site"].isin(["160", "161", "162"]), 0.15, None
)

fig, axes = dmslogo.facet_plot(
    data_hiv,
    gridrow_col="antibody",
    x_col="isite",
    show_col="show_site",
    draw_line_kwargs=dict(
        height_col="totfracsurvive", xtick_col="site", ylabel="immune selection"
    ),
    draw_logo_kwargs=dict(
        letter_col="mutation",
        letter_height_col="mutfracsurvive",
        xtick_col="site_label",
        xlabel="site",
        ylabel="immune selection",
        shade_color_col="shade_color",
        shade_alpha_col="shade_alpha",
    ),
    line_titlesuffix="site-level selection",
    logo_titlesuffix="mutation-level selection",
    share_ylim_across_rows=False,
    share_xlabel=True,
    share_ylabel=True,
)
_images/basic_example_59_0.png

Some details about fonts

Write DMSLOGO in Comic Sans font

Generate data to plot by creating the pandas DataFrame word_data. In this data frame, we choose large heights and bright colors for the letters in our word (DMSLOGO), and smaller letters and gray for other letters.

[31]:
word = "DMSLOGO"
lettercolors = [CBPALETTE[1]] * len("dms") + [CBPALETTE[2]] * len("logo")

# make data frame with data to plot
random.seed(0)
word_data = {"x": [], "letter": [], "height": [], "color": []}
for x, (letter, color) in enumerate(zip(word, lettercolors)):
    word_data["x"].append(x)
    word_data["letter"].append(letter)
    word_data["color"].append(color)
    word_data["height"].append(random.uniform(1, 1.5))
    for otherletter in random.sample(sorted(set("ACTG") - {letter}), 3):
        word_data["x"].append(x)
        word_data["letter"].append(otherletter)
        word_data["color"].append(CBPALETTE[0])
        word_data["height"].append(random.uniform(0.1, 0.5))
word_data = pd.DataFrame(word_data)
word_data.head(n=6)
[31]:
x letter height color
0 0 D 1.422211 #E69F00
1 0 T 0.486186 #999999
2 0 A 0.294371 #999999
3 0 C 0.467294 #999999
4 1 M 1.414926 #E69F00
5 1 T 0.301875 #999999

Now draw the logo. We use the fontfamily argument to set a Comic Sans font. This also requires us to increase fontaspect since this font is wider, and increase letterpad as the font height sometimes sticks out beyond its baseline:

[32]:
# NBVAL_IGNORE_OUTPUT

fig, ax = dmslogo.draw_logo(
    data=word_data,
    letter_height_col="height",
    x_col="x",
    letter_col="letter",
    color_col="color",
    fontfamily="Comic Sans MS",
    hide_axis=True,
    fontaspect=0.85,
    letterpad=0.05,
)
_images/basic_example_63_0.png

Subtleties of non-default fonts

Note however that you in general may have difficulty using most fonts (other than the dmslogo default) for good-looking logos. The reason is that for a clean and accurate letter-height logo plots, the font must:

  • be mono-spaced

  • not have descenders

  • have all letters go exactly from the baseline to the top

You can manually edit a font to do this as has been done for the current dmslogo default font; to see more information on this look here for details.