Analyses – pbxplore.analysis

Build occurence matrix

pbxplore.analysis.count_matrix(pb_seq)[source]

Count the occurences of each block at each position.

The occurence matrix has one row per sequence, and one column per block. The columns are ordered in as pbxplore.PB.NAMES.

Parameters:pb_seq – a list of PB sequences.
Returns:pb_count – The occurence matrix.
Return type:numpy array
Raises:pbxplore.PB.InvalidBlockError – encountered an unexpected PB
pbxplore.analysis.read_occurence_file(name)[source]

Read an occurence matrix from a file. It will return the matrix as a numpy array and the indexes of residues.

Parameters:name (str) – Name of the file.
Returns:
  • count_mat (numpy array) – the occurence matrix without the residue number
  • residues (list) – the list of residues indexes
Raises:ValueError – when something is wrong about the file
pbxplore.analysis.plot_map(fname, count_mat, idx_first_residue=1, residue_min=1, residue_max=None)[source]

Generate a map of the distribution of PBs along protein sequence from an occurence matrix.

Parameters:
  • fname (str) – The path to the file to write in
  • count_mat (numpy array) – an occurence matrix returned by count_matrix.
  • idx_first_residue (int) – the index of the first residue in the matrix
  • residue_min (int) – the lower bound of the protein sequence
  • residue_max (int) – the upper bound of the protein sequence

Compare protein block sequences

pbxplore.analysis.compare(header_lst, seq_lst, substitution_mat, fname)[source]

Command line wrapper for the comparison of all sequences with the first one

When the –compare option is given to the command line, the program compares all the sequences to the first one and writes these comparison as sequences of digits. These digits represent the distance between the PB in the target and the one in the reference at the same position. The digits are normalized in the [0; 9] range.

This function run the comparison, write the result in a fasta file, and display on screen informations about the process.

Parameters:
  • header_lst (list of strings) – The list of sequence headers ordered as the sequences
  • seq_lst (list of strings) – The list of sequences ordered as the headers
  • substitution_mat (numpy.array) – A substitution matrix expressed as similarity scores
  • fname (str) – The output file name

Visualize deformability

pbxplore.analysis.compute_neq(count_mat)[source]

Compute the Neq for each residue from an occurence matrix.

Parameters:count_mat (numpy array) – an occurence matrix returned by count_matrix.
Returns:neq_array – a 1D array containing the neq values
Return type:numpy array
pbxplore.analysis.plot_neq(fname, neq_array, idx_first_residue=1, residue_min=1, residue_max=None)[source]

Generate the Neq plot along the protein sequence

Parameters:
  • fname (str) – The path to the file to write in
  • neq_array (numpy array) – an array containing the neq value associated to the residue number
  • idx_first_residue (int) – the index of the first residue in the array
  • residue_min (int) – the lower bound of the protein sequence
  • residue_max (int) – the upper bound of the protein sequence

Generates logo representation of PBs frequency along protein sequence through the weblogo library.

The weblogo reference: G. E. Crooks, G. Hon, J.-M. Chandonia, and S. E. Brenner. ‘WebLogo: A Sequence Logo Generator.’ Genome Research 14:1188–90 (2004) doi:10.1101/gr.849004. http://weblogo.threeplusone.com/

Parameters:
  • fname (str) – The path to the file to write in
  • count_mat (numpy array) – an occurence matrix returned by count_matrix.
  • idx_first_residue (int) – the index of the first residue in the matrix
  • residue_min (int) – the lower bound of residue frame
  • residue_max (int) – the upper bound of residue frame
  • title (str) – the title of the weblogo. Default is empty.

Utils

pbxplore.analysis.substitution_score(substitution_matrix, seqA, seqB)[source]

Compute the substitution score to go from seqA to seqB

Both sequences must have the same length.

The score is either expressed as a similarity or a distance depending on the substitution matrix.

pbxplore.analysis.compute_freq_matrix(count_mat)[source]

Compute a PB frequency matrix from an occurence matrix.

The frequency matrix has one row per sequence, and one column per block. The columns are ordered in as pbxplore.PB.NAMES.

Parameters:count_mat (numpy array) – an occurence matrix returned by count_matrix.
Returns:freq – The frequency matrix
Return type:numpy array
pbxplore.analysis.compute_score_by_position(score_mat, seq1, seq2)[source]

Computes substitution score between two sequences position per position

The substitution score can represent a similarity or a distance depending on the score matrix provided. The score matrix should be provided as a 2D numpy array with score[i, j] the score to swich the PB at the i-th position in pbxplore.PB.NAMES to the PB at the j-th position in pbxplore.PB.NAMES.

The function returns the result as a list of substitution scores to go from seq1 to seq2 for each position. Both sequences must have the same length.

Note

The score to move from or to a Z block (dummy block) is always 0.

Raises:pbxplore.PB.InvalidBlockError – encountered an unexpected PB