Analyses – `pbxplore.analysis`¶

Build occurence matrix¶

pbxplore.analysis.count_matrix(pb_seq)[source]¶

Count the occurences of each block at each position.

The occurence matrix has one row per sequence, and one column per block. The columns are ordered in as pbxplore.PB.NAMES.

Parameters:	pb_seq – a list of PB sequences.
Returns:	pb_count – The occurence matrix.
Return type:	numpy array
Raises:	`pbxplore.PB.InvalidBlockError` – encountered an unexpected PB

pbxplore.analysis.read_occurence_file(name)[source]¶

Read an occurence matrix from a file. It will return the matrix as a numpy array and the indexes of residues.

Parameters:	name (str) – Name of the file.
Returns:	count_mat (numpy array) – the occurence matrix without the residue number residues (list) – the list of residues indexes
Raises:	`ValueError` – when something is wrong about the file

pbxplore.analysis.plot_map(fname, count_mat, idx_first_residue=1, residue_min=1, residue_max=None)[source]¶

Generate a map of the distribution of PBs along protein sequence from an occurence matrix.

Parameters:	fname (str) – The path to the file to write in count_mat (numpy array) – an occurence matrix returned by count_matrix. idx_first_residue (int) – the index of the first residue in the matrix residue_min (int) – the lower bound of the protein sequence residue_max (int) – the upper bound of the protein sequence

Compare protein block sequences¶

pbxplore.analysis.compare(header_lst, seq_lst, substitution_mat, fname)[source]¶

Command line wrapper for the comparison of all sequences with the first one

When the –compare option is given to the command line, the program compares all the sequences to the first one and writes these comparison as sequences of digits. These digits represent the distance between the PB in the target and the one in the reference at the same position. The digits are normalized in the [0; 9] range.

This function run the comparison, write the result in a fasta file, and display on screen informations about the process.

Parameters:	header_lst (list of strings) – The list of sequence headers ordered as the sequences seq_lst (list of strings) – The list of sequences ordered as the headers substitution_mat (numpy.array) – A substitution matrix expressed as similarity scores fname (str) – The output file name

Visualize deformability¶

pbxplore.analysis.compute_neq(count_mat)[source]¶

Compute the Neq for each residue from an occurence matrix.

Parameters:	count_mat (numpy array) – an occurence matrix returned by count_matrix.
Returns:	neq_array – a 1D array containing the neq values
Return type:	numpy array

pbxplore.analysis.plot_neq(fname, neq_array, idx_first_residue=1, residue_min=1, residue_max=None)[source]¶

Generate the Neq plot along the protein sequence

Parameters:	fname (str) – The path to the file to write in neq_array (numpy array) – an array containing the neq value associated to the residue number idx_first_residue (int) – the index of the first residue in the array residue_min (int) – the lower bound of the protein sequence residue_max (int) – the upper bound of the protein sequence

pbxplore.analysis.generate_weblogo(fname, count_mat, idx_first_residue=1, residue_min=1, residue_max=None, title='')[source]¶

Generates logo representation of PBs frequency along protein sequence through the weblogo library.

The weblogo reference: G. E. Crooks, G. Hon, J.-M. Chandonia, and S. E. Brenner. ‘WebLogo: A Sequence Logo Generator.’ Genome Research 14:1188–90 (2004) doi:10.1101/gr.849004. http://weblogo.threeplusone.com/

Parameters:

fname (str) – The path to the file to write in
count_mat (numpy array) – an occurence matrix returned by count_matrix.
idx_first_residue (int) – the index of the first residue in the matrix
residue_min (int) – the lower bound of residue frame
residue_max (int) – the upper bound of residue frame
title (str) – the title of the weblogo. Default is empty.

Utils¶

pbxplore.analysis.substitution_score(substitution_matrix, seqA, seqB)[source]¶

Compute the substitution score to go from seqA to seqB

Both sequences must have the same length.

The score is either expressed as a similarity or a distance depending on the substitution matrix.

pbxplore.analysis.compute_freq_matrix(count_mat)[source]¶

Compute a PB frequency matrix from an occurence matrix.

The frequency matrix has one row per sequence, and one column per block. The columns are ordered in as pbxplore.PB.NAMES.

Parameters:	count_mat (numpy array) – an occurence matrix returned by `count_matrix`.
Returns:	freq – The frequency matrix
Return type:	numpy array

pbxplore.analysis.compute_score_by_position(score_mat, seq1, seq2)[source]¶

Computes substitution score between two sequences position per position

The substitution score can represent a similarity or a distance depending on the score matrix provided. The score matrix should be provided as a 2D numpy array with score[i, j] the score to swich the PB at the i-th position in pbxplore.PB.NAMES to the PB at the j-th position in pbxplore.PB.NAMES.

The function returns the result as a list of substitution scores to go from seq1 to seq2 for each position. Both sequences must have the same length.

Note

The score to move from or to a Z block (dummy block) is always 0.

Raises:	`pbxplore.PB.InvalidBlockError` – encountered an unexpected PB

Analyses – pbxplore.analysis¶