pdb_utils

Functions to manipulate PDB files.

polyclonal.pdb_utils.extract_atom_locations(input_pdbfile, target_chains, target_atom='CA')[source]

Extract atom locations from target chains of a PDB file.

By default the locations of alpha carbons are extracted, but any atom can be specified. If a residue does not have the specified atom, it is not included in the output file.

Parameters:
  • input_pdbfile (str) – Path to input PDB file.

  • target_chains (list) – List of target chains to extract atom locations from. Chains must be in the PDB and match the chain ids.

  • target_atom (str or None) – Which type of atom to extract locations for. Default is alpha carbon, or ‘CA’. Use None to get all atoms for a residue. If the specified type of atom is present multiple times for a residue, that residue will end up having multiple entries in the output.

Returns:

Has columns ‘chain’, ‘site’, ‘atom’, ‘x’, ‘y’, and ‘z’.

Return type:

pandas.DataFrame

Example

>>> pdb_url = 'https://files.rcsb.org/download/6M0J.pdb'
>>> r = requests.get(pdb_url)
>>> with tempfile.TemporaryDirectory() as tmpdir:
...    pdbfile = os.path.join(tmpdir, '6M0J.pdb')
...    with open(pdbfile, 'wb') as f:
...        _ = f.write(r.content)
...    output = extract_atom_locations(pdbfile, ['A'])
...    output_all_atoms = extract_atom_locations(pdbfile, ['A'], target_atom=None)

Check the first ten lines of the ouput to make sure we got the expected atom locations:

>>> output.head(n=10)
  chain  site atom          x          y      z
0     A    19   CA -31.358999  50.852001  2.040
1     A    20   CA -29.424000  50.561001 -1.234
2     A    21   CA -30.722000  48.633999 -4.234
3     A    22   CA -28.080999  45.924999 -3.794
4     A    23   CA -28.982000  45.372002 -0.131
5     A    24   CA -32.637001  44.912998 -1.106
6     A    25   CA -31.709999  42.499001 -3.889
7     A    26   CA -29.688999  40.509998 -1.334
8     A    27   CA -32.740002  40.337002  0.917
9     A    28   CA -34.958000  39.424000 -2.028
>>> output_all_atoms.head(n=10)
  chain  site atom          x          y      z
0     A    19    N -31.455000  49.473999  2.505
1     A    19   CA -31.358999  50.852001  2.040
2     A    19    C -31.051001  50.891998  0.548
3     A    19    O -31.921000  51.243999 -0.251
4     A    19   CB -30.297001  51.626999  2.826
5     A    19   OG -30.882000  52.734001  3.490
6     A    20    N -29.822001  50.528000  0.169
7     A    20   CA -29.424000  50.561001 -1.234
8     A    20    C -30.215000  49.535000 -2.042
9     A    20    O -30.926001  48.687000 -1.500
polyclonal.pdb_utils.inter_residue_distances(input_pdbfile, target_chains, target_atom=None)[source]

Get inter-residue distances from a PDB file.

If a residue number is present in multiple chains, gets the closest distance for each partner from all residues with that number. This is useful for homo-oligomers.

Parameters:
  • input_pdbfile (str) – Path to input PDB file.

  • target_chains (list) – List of target chains for which we get residues.

  • target_atom (str or None) – Which type of atoms to consider when getting distances. None means all atoms; you could also want to use ‘CA’ for alpha carbons.

Returns:

Columns are “site_1”, “site_2”, “distance”, “chain_1”, and “chain_2”. The distance is the Euclidean distance between the sites, and the chain columns indicate the chain for which the closest residue is drawn for that pair. Only returns the unique combinations of site_1 and site_2. Eg, has entries for sites 1 and 2, but not 1 and 1 or 2 and 1. The distances are in angstroms.

Return type:

pandas.DataFrame

Example

Get distances from one and multiple chains from spike trimer PDB file:

>>> pdb_url = 'https://files.rcsb.org/download/6XM4.pdb'
>>> r = requests.get(pdb_url)
>>> with tempfile.NamedTemporaryFile() as tmpf:
...    _ = tmpf.write(r.content)
...    tmpf.flush()
...    dist_chain_a = inter_residue_distances(tmpf.name, ["A"])
...    dist_chain_a_b = inter_residue_distances(tmpf.name, ["A", "B"])
>>> dist_chain_a
        site_1  site_2    distance chain_1 chain_2
0           27      28    1.332629       A       A
1           27      29    4.612508       A       A
2           27      30    8.219518       A       A
3           27      31   11.016782       A       A
4           27      32   13.087037       A       A
...        ...     ...         ...     ...     ...
548623    1308    1310   30.826773       A       A
548624    1308    1311   75.350853       A       A
548625    1309    1310   12.374796       A       A
548626    1309    1311  115.681534       A       A
548627    1310    1311  106.112328       A       A

[548628 rows x 5 columns]
>>> dist_chain_a_b
        site_1  site_2   distance chain_1 chain_2
0           27      28   1.330841       B       B
1           27      29   4.502822       B       B
2           27      30   8.128284       B       B
3           27      31  10.591589       B       B
4           27      32  13.087037       A       A
...        ...     ...        ...     ...     ...
572980     845     847   4.439680       B       B
572981     845     848   6.885335       B       B
572982     846     847   1.333544       B       B
572983     846     848   3.365701       B       B
572984     847     848   1.328880       B       B

[572985 rows x 5 columns]

There are some sites where the closest residues are in different monomers:

>>> dist_chain_a_b.query("chain_1 != chain_2")
        site_1  site_2   distance chain_1 chain_2
252         27     330  46.996426       B       A
254         27     332  47.407013       B       A
255         27     333  47.920368       B       A
256         27     334  49.039017       B       A
257         27     335  51.971394       B       A
...        ...     ...        ...     ...     ...
572727    1311     844  65.875092       A       B
572728    1311     845  64.882248       A       B
572729    1311     846  62.240368       A       B
572730    1311     847  60.743019       A       B
572731    1311     848  56.456402       A       B

[282044 rows x 5 columns]
polyclonal.pdb_utils.reassign_b_factor(input_pdbfile, output_pdbfile, df, metric_col, *, site_col='site', chain_col='chain', missing_metric=0, model_index=0)[source]

Reassign B factors in PDB file to some other metric.

B-factor re-assignment is useful because PDB images can be colored by B factor using programs such as pymol using commands like:

show surface, RBD; spectrum b, white red, RBD, minimum=0, maximum=1
Parameters:
  • input_pdbfile (str) – Path to input PDB file.

  • output_pdbfile (str) – Name of created output PDB file with re-assigned B factors.

  • df (pandas.DataFrame) – Data frame with metric used to re-assign B factor.

  • metric_col (str) – Name of column in df that has the numerical metric that the B factor is re-assigned to.

  • site_col (str) – Name of column in df with site numbers, which should map numbers used in PDB.

  • chain_col (str) – Name of column in df with chain labels.

  • missing_metric (float or dict) – How do we handl sites that are missing in df? If a float, reassign B factors for all missing sites to this value. If a dict, should be keyed by chain and assign all missing sites in each chain to indicated value.

  • model_index (int) – Which model in the PDB to use. If a X-ray structure, there is probably just one model so you can use default of 0.

Return type:

None

Example

Create data frame df that assigns metric to two sites in chain E:

>>> df = pd.DataFrame({'chain': ['E', 'E'],
...                    'site': [333, 334],
...                    'metric': [0.5, 1.2]})

Create dict missing_metric that assigns -1 to sites with missing metrics in chain E, and 0 to sites in other chains:

>>> missing_metric = collections.defaultdict(lambda: 0)
>>> missing_metric['E'] = -1

Download PDB, do the re-assignment of B factors, read the lines from the resulting re-assigned PDB:

>>> pdb_url = 'https://files.rcsb.org/download/6M0J.pdb'
>>> r = requests.get(pdb_url)
>>> with tempfile.TemporaryDirectory() as tmpdir:
...    original_pdbfile = os.path.join(tmpdir, 'original.pdb')
...    with open(original_pdbfile, 'wb') as f:
...        _ = f.write(r.content)
...    reassigned_pdbfile = os.path.join(tmpdir, 'reassigned.pdb')
...    reassign_b_factor(input_pdbfile=original_pdbfile,
...                      output_pdbfile=reassigned_pdbfile,
...                      df=df,
...                      metric_col='metric',
...                      missing_metric=missing_metric)
...    pdb_text = open(reassigned_pdbfile).readlines()

Now spot check some key lines in the output PDB. Chain A has all sites with B factors (last entry) re-assigned to 0:

>>> print(pdb_text[0].strip())
ATOM      1  N   SER A  19     -31.455  49.474   2.505  1.00  0.00           N

Chain E has sites 333 and 334 with B-factors assigned to values in df, and other sites (such as 335) assigned to -1:

>>> print('\n'.join(line.strip() for line in pdb_text[5010: 5025]))
ATOM   5010  O   THR E 333     -34.954  13.568  46.370  1.00  0.50           O
ATOM   5011  CB  THR E 333     -33.695  14.409  48.627  1.00  0.50           C
ATOM   5012  OG1 THR E 333     -34.797  14.149  49.507  1.00  0.50           O
ATOM   5013  CG2 THR E 333     -32.495  14.879  49.438  1.00  0.50           C
ATOM   5014  N   ASN E 334     -35.532  15.604  45.605  1.00  1.20           N
ATOM   5015  CA  ASN E 334     -36.287  15.087  44.474  1.00  1.20           C
ATOM   5016  C   ASN E 334     -35.475  15.204  43.182  1.00  1.20           C
ATOM   5017  O   ASN E 334     -34.533  15.994  43.076  1.00  1.20           O
ATOM   5018  CB  ASN E 334     -37.622  15.823  44.337  1.00  1.20           C
ATOM   5019  CG  ASN E 334     -38.660  15.006  43.586  1.00  1.20           C
ATOM   5020  OD1 ASN E 334     -38.568  13.776  43.514  1.00  1.20           O
ATOM   5021  ND2 ASN E 334     -39.649  15.686  43.016  1.00  1.20           N
ATOM   5022  N   LEU E 335     -35.849  14.391  42.194  1.00 -1.00           N
ATOM   5023  CA  LEU E 335     -35.084  14.305  40.955  1.00 -1.00           C
ATOM   5024  C   LEU E 335     -35.466  15.426  39.992  1.00 -1.00           C