Biopython

**Biopython**
原作者	Chapman B, Chang J
首次发布	2002年12月17日，21年前
当前版本	1.81 (2023年2月12日；穩定版本); 183 (2024年1月10日；穩定版本);
源代码库	https://github.com/biopython/biopython
编程语言	Python和C语言
平台	跨平台
类型	生物信息学
许可协议	Biopython许可证
网站	biopython.org

Biopython项目包含一系列用于计算生物学和生物信息学的非商业性Python工具，是一个国际性开发者协会所创建的开源软件集。^[1] ^[4]^[5]其中包含表示生物序列和序列注释的类，且能够读取和写入多种文件格式。它还允许通过编程访问在线的生物学数据库，例如美国国家生物技术信息中心（NCBI）数据库。单独模块还能使Biopython的功能扩展到序列比對、蛋白质结构、群体遗传学、系统发生学、序列基序和机器学习。Biopython项目意图减少计算生物学中的代码重复问题，并与相似项目一样以Bio为前缀命名。^[6]

历史编辑

Biopython的开发始于1999年，并于2000年7月首次发布^[7]。同时期被开发的同类项目还有BioPerl、BioRuby和BioJava，这些项目的命名都体现了开发所使用的编程语言。该项目的早期开发人员包含Jeff Chang、Andrew Dalke和Brad Chapman，至今有100余人做出贡献。^[8]2007年时建立了类似的Python项目PyCogent。^[9]

Biopython最初可以访问、索引和处理生物序列文件，这也是其主要目标。在之后几年中，新增的其他模块使其功能也涵盖其他生物学领域（见主要特点和示例）。

1.77版本起，Biopython项目结束对Python 2的支持。 ^[10]

设计编辑

Biopython尽可能遵循Python语言的惯例，以便Python用户轻松使用。例如，Seq和SeqRecord对象可以通过切片进行操作，与Python的字符串和列表相似。其功能设计也与其他Bio*项目（如BioPerl）相似。^[7]

Biopython的每个功能区可读写常见的文件格式，并且许可证宽松，能与其他多数软件的许可证兼容，因此能在许多软件项目中使用Biopython。^[5]

主要特点和示例编辑

序列编辑

Biopython的一个核心概念是由Seq类表示的生物序列。^[11]Seq对象与Python字符串很相似：它支持Python切分符号，可与其他序列连接，并且不可变。此外，它有序列专用的方法，并能指定要使用的特定的生物字母表。

>>> # 该脚本创建 DNA 序列并执行一些典型的操作
>>> from Bio.Seq import Seq
>>> dna_sequence = Seq("AGGCTTCTCGTA", IUPAC.unambiguous_dna)
>>> dna_sequence
Seq('AGGCTTCTCGTA', IUPACUnambiguousDNA())
>>> dna_sequence[2:7]
Seq('GCTTC', IUPACUnambiguousDNA())
>>> dna_sequence.reverse_complement()
Seq('TACGAGAAGCCT', IUPACUnambiguousDNA())
>>> rna_sequence = dna_sequence.transcribe()
>>> rna_sequence
Seq('AGGCUUCUCGUA', IUPACUnambiguousRNA())
>>> rna_sequence.translate()
Seq('RLLV', IUPACProtein())

序列注释编辑

SeqRecord类以SeqFeature对象的形式描述序列以及名称、描述和特征等信息。每个SeqFeature对象指定特征的类型及其位置。特征类型可以是“gene”、“CDS”（编码序列）、“repeat_region”、“mobile_element”或其他，特征在序列中的位置可以是精确的或近似的。

>>> # 该脚本从文件中加载带注释的序列并查看其部分内容。
>>> from Bio import SeqIO
>>> seq_record = SeqIO.read("pTC2.gb", "genbank")
>>> seq_record.name
'NC_019375'
>>> seq_record.description
'Providencia stuartii plasmid pTC2, complete sequence.'
>>> seq_record.features[14]
SeqFeature(FeatureLocation(ExactPosition(4516), ExactPosition(5336), strand=1), type='mobile_element')
>>> seq_record.seq
Seq("GGATTGAATATAACCGACGTGACTGTTACATTTAGGTGGCTAAACCCGTCAAGC...GCC", IUPACAmbiguousDNA())

输入输出编辑

Biopython可以读写多种常见的序列格式，包括FASTA、FASTQ、GenBank、Clustal、PHYLIP和NEXUS。读取文件时，文件中的描述性信息会填充Biopython类的成员，例如SeqRecord，因此可以将某种文件格式的记录转换成其他格式。

超大的序列文件可能占满计算机的内存资源，因此Biopython提供了多种选项来访问大型文件中的记录。文件可以完全加载到Python数据结构（例如列表或字典）的内存中，以占用内存为代价提供快速访问。也可以按需从磁盘读取文件，这样访问性能较差，但内存用量较低。

>>> # 该脚本加载一个包含多个序列的文件，并以不同的格式保存每个序列。
>>> from Bio import SeqIO
>>> genomes = SeqIO.parse("salmonella.gb", "genbank")
>>> for genome in genomes:
...     SeqIO.write(genome, genome.id + ".fasta", "fasta")

访问在线数据库编辑

Biopython用户可以通过Bio.Entrez模块从NCBI数据库下载生物学数据。Entrez搜索引擎提供的各项功能都可通过该模块的功能实现，包括搜索、数据记录下载。

>>> # 该脚本从 NCBI 核苷酸数据库下载基因组并将其保存在 FASTA 文件中。
>>> from Bio import Entrez
>>> from Bio import SeqIO
>>> output_file = open("all_records.fasta", "w")
>>> Entrez.email = "my_email@example.com"
>>> records_to_download = ["FO834906.1", "FO203501.1"]
>>> for record_id in records_to_download:
...     handle = Entrez.efetch(db="nucleotide", id=record_id, rettype="gb")
...     seqRecord = SeqIO.read(handle, format="gb")
...     handle.close()
...     output_file.write(seqRecord.format("fasta"))

图1：Bio创建的有根系统发育树。Phylo显示不同生物体Apaf-1同源物之间的关系^[12]

图2：与上面相同的树，使用Graphviz via Bio.Phylo绘制的无根树。

Bio.Phylo模块提供了用于处理和可视化系統發生樹的工具，且支持多种文件格式的读写，包括Newick、Nexus和phyloXML。通过Tree和Clade对象支持常见的树操作和遍历。示例包括转换和整理树文件、从树中提取子集、更改树的根以及分析分支特征（例如长度或分数）。^[13]

有根树可以用ASCII或使用matplotlib绘制（见图1），且Graphviz库可用于创建无根布局（见图2）。

基因组图编辑

图3：pKPS77质粒上的基因图， ^[14]使用Biopython中的GenomeDiagram模块进行可视化

GenomeDiagram模块为Biopython提供了可视化序列的方法。^[15]序列可以以线性或圆形形式绘制（参见图 3），并且支持许多输出格式，包括PDF和PNG 。制作轨迹然后向轨迹添加序列特征可以创建图表。通过遍历序列的特征和使用其属性，可以决定是否、如何将其添加到图表的轨迹，且可以对最终图表的外观进行更多控制。可以在不同轨迹之间绘制交叉链接，从而在单个图表中比较多个序列。

高分子结构编辑

2003年时Bio.PDB模块被添加到Biopython^[16]，它可以从PDB和mmCIF文件加载分子结构，Structure对象是该模块的核心，它以分层方式组织大分子结构：Structure对象包含Model对象，Model对象包含Chain对象，Chain对象包含Residue对象，Residue对象包含Atom对象。无序残基和原子有自己的类， DisorderedResidue和DisorderedAtom ，描述它们的不确定位置。

使用Bio.PDB可以浏览大分子结构文件的各个组成部分，例如检查蛋白质中的每个原子。可以进行常见的分析，例如测量距离或角度、比较残留物以及计算残留物深度。

群体遗传学编辑

Bio.PopGen模块增加了对Biopython for Genepop的支持，Genepop是一个用于群体遗传学统计分析的软件包。 ^[17]这允许分析哈迪-温伯格平衡、连锁不平衡和群体等位基因频率的其他特征。

该模块还可以使用fastsimcoal2程序，利用凝聚态理论进行群体遗传模拟。^[18]

命令行工具的包装编辑

Biopython的许多模块都包含常用工具的命令行包装器，允许在Biopython中使用这些工具。这些包装器包括BLAST、Clustal、PhyML、EMBOSS和SAMtools。用户可以将通用封装类子类化，以添加对其他命令行工具的支持。

参见编辑

开放生物信息学基金会（英语：Open Bioinformatics Foundation）
BioPerl（英语：BioPerl）
BioRuby（英语：BioRuby）
BioJS（英语：BioJS）
BioJava

参考文献编辑

^ ^1.0 ^1.1 Chapman, Brad; Chang, Jeff. Biopython: Python tools for computational biology. ACM SIGBIO Newsletter. August 2000, 20 (2): 15–19. S2CID 9417766. doi:10.1145/360262.360268  .
^ Release biopython-181: Commit Release 1.81 (#4233). [2023年4月22日].
^ Release 183. 2024年1月10日 [2024年1月19日].
^ Cock, Peter JA; Antao, Tiago; Chang, Jeffery T; Chapman, Brad A; Cox, Cymon J; Dalke, Andrew; Friedberg, Iddo; Hamelryck, Thomas; Kauff, Frank; Wilczynski, Bartek; de Hoon, Michiel JL. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 20 March 2009, 25 (11): 1422–3. PMC 2682512  . PMID 19304878. doi:10.1093/bioinformatics/btp163.
^ ^5.0 ^5.1 Refer to the Biopython website for other papers describing Biopython （页面存档备份，存于互联网档案馆）, and a list of over one hundred publications using/citing Biopython （页面存档备份，存于互联网档案馆）.
^ Mangalam, Harry. The Bio* toolkits—a brief overview. Briefings in Bioinformatics. September 2002, 3 (3): 296–302. PMID 12230038. doi:10.1093/bib/3.3.296  .
^ ^7.0 ^7.1 Chapman, Brad, The Biopython Project: Philosophy, functionality and facts (PDF), 11 March 2004 [11 September 2014], （原始内容存档 (PDF)于2023-06-03）
^ List of Biopython contributors, [11 September 2014], （原始内容存档于11 September 2014）
^ Knight, R; Maxwell, P; Birmingham, A; Carnes, J; Caporaso, J. G.; Easton, B. C.; Eaton, M; Hamady, M; Lindsay, H; Liu, Z; Lozupone, C. Py Cogent: A toolkit for making sense from sequence. Genome Biology. 2007, 8 (8): R171. PMC 2375001  . PMID 17708774. doi:10.1186/gb-2007-8-8-r171  .
^ Daley, Chris, Biopython 1.77 released, [6 October 2021], （原始内容存档于2023-10-29）
^ Chang, Jeff; Chapman, Brad; Friedberg, Iddo; Hamelryck, Thomas; de Hoon, Michiel; Cock, Peter; Antao, Tiago; Talevich, Eric; Wilczynski, Bartek, Biopython Tutorial and Cookbook, 29 May 2014 [28 August 2014], （原始内容存档于2015-01-01）
^ Zmasek, Christian M; Zhang, Qing; Ye, Yuzhen; Godzik, Adam. Surprising complexity of the ancestral apoptosis network. Genome Biology. 24 October 2007, 8 (10): R226. PMC 2246300  . PMID 17958905. doi:10.1186/gb-2007-8-10-r226  .
^ Talevich, Eric; Invergo, Brandon M; Cock, Peter JA; Chapman, Brad A. Bio.Phylo: A unified toolkit for processing, analyzing and visualizing phylogenetic trees in Biopython. BMC Bioinformatics. 21 August 2012, 13 (209): 209. PMC 3468381  . PMID 22909249. doi:10.1186/1471-2105-13-209  .
^ Klebsiella pneumoniae strain KPS77 plasmid pKPS77, complete sequence. NCBI. [10 September 2014].
^ Pritchard, Leighton; White, Jennifer A; Birch, Paul RJ; Toth, Ian K. GenomeDiagram: a python package for the visualization of large-scale genomic data. Bioinformatics. March 2006, 22 (5): 616–617. PMID 16377612. doi:10.1093/bioinformatics/btk021  .
^ Hamelryck, Thomas; Manderick, Bernard. PDB file parser and structure class implemented in Python. Bioinformatics. 10 May 2003, 19 (17): 2308–2310. PMID 14630660. doi:10.1093/bioinformatics/btg299  .
^ Rousset, François. GENEPOP'007: a complete re-implementation of the GENEPOP software for Windows and Linux. Molecular Ecology Resources. January 2008, 8 (1): 103–106. PMID 21585727. S2CID 25776992. doi:10.1111/j.1471-8286.2007.01931.x.
^ Excoffier, Laurent; Foll, Matthieu. fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics. 1 March 2011, 27 (9): 1332–1334. PMID 21398675. doi:10.1093/bioinformatics/btr124  .

外部链接编辑

官方网站
Biopython教程（页面存档备份，存于互联网档案馆） (PDF （页面存档备份，存于互联网档案馆）)
GitHub上的Biopython源代码（页面存档备份，存于互联网档案馆）

[Chapman2000-1] 1.0 ^1.1 Chapman, Brad; Chang, Jeff. Biopython: Python tools for computational biology. ACM SIGBIO Newsletter. August 2000, 20 (2): 15–19. S2CID 9417766. doi:10.1145/360262.360268  .

[wikidata-5d20dd2efcfef27b425de5f397ce6efa350d90e2-v3-2] Release biopython-181: Commit Release 1.81 (#4233). [2023年4月22日].

[wikidata-47ac6fb2b0c835333c2703f26b0f70bb93b7a247-v3-3] Release 183. 2024年1月10日 [2024年1月19日].

[Cock2009-4] Cock, Peter JA; Antao, Tiago; Chang, Jeffery T; Chapman, Brad A; Cox, Cymon J; Dalke, Andrew; Friedberg, Iddo; Hamelryck, Thomas; Kauff, Frank; Wilczynski, Bartek; de Hoon, Michiel JL. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 20 March 2009, 25 (11): 1422–3. PMC 2682512  . PMID 19304878. doi:10.1093/bioinformatics/btp163.

[lists-5] 5.0 ^5.1 Refer to the Biopython website for other papers describing Biopython （页面存档备份，存于互联网档案馆）, and a list of over one hundred publications using/citing Biopython （页面存档备份，存于互联网档案馆）.

[Mangalam2002-6] Mangalam, Harry. The Bio* toolkits—a brief overview. Briefings in Bioinformatics. September 2002, 3 (3): 296–302. PMID 12230038. doi:10.1093/bib/3.3.296  .

[Chapman2004-7] 7.0 ^7.1 Chapman, Brad, The Biopython Project: Philosophy, functionality and facts (PDF), 11 March 2004 [11 September 2014], （原始内容存档 (PDF)于2023-06-03）

[Contributors-8] List of Biopython contributors, [11 September 2014], （原始内容存档于11 September 2014）

[9] Knight, R; Maxwell, P; Birmingham, A; Carnes, J; Caporaso, J. G.; Easton, B. C.; Eaton, M; Hamady, M; Lindsay, H; Liu, Z; Lozupone, C. Py Cogent: A toolkit for making sense from sequence. Genome Biology. 2007, 8 (8): R171. PMC 2375001  . PMID 17708774. doi:10.1186/gb-2007-8-8-r171  .

[Python27EoL-10] Daley, Chris, Biopython 1.77 released, [6 October 2021], （原始内容存档于2023-10-29）

[Tutorial-11] Chang, Jeff; Chapman, Brad; Friedberg, Iddo; Hamelryck, Thomas; de Hoon, Michiel; Cock, Peter; Antao, Tiago; Talevich, Eric; Wilczynski, Bartek, Biopython Tutorial and Cookbook, 29 May 2014 [28 August 2014], （原始内容存档于2015-01-01）

[Zmasek2007-12] Zmasek, Christian M; Zhang, Qing; Ye, Yuzhen; Godzik, Adam. Surprising complexity of the ancestral apoptosis network. Genome Biology. 24 October 2007, 8 (10): R226. PMC 2246300  . PMID 17958905. doi:10.1186/gb-2007-8-10-r226  .

[Talevich2012-13] Talevich, Eric; Invergo, Brandon M; Cock, Peter JA; Chapman, Brad A. Bio.Phylo: A unified toolkit for processing, analyzing and visualizing phylogenetic trees in Biopython. BMC Bioinformatics. 21 August 2012, 13 (209): 209. PMC 3468381  . PMID 22909249. doi:10.1186/1471-2105-13-209  .

[NC_023330.1-14] Klebsiella pneumoniae strain KPS77 plasmid pKPS77, complete sequence. NCBI. [10 September 2014].

[Pritchard2006-15] Pritchard, Leighton; White, Jennifer A; Birch, Paul RJ; Toth, Ian K. GenomeDiagram: a python package for the visualization of large-scale genomic data. Bioinformatics. March 2006, 22 (5): 616–617. PMID 16377612. doi:10.1093/bioinformatics/btk021  .

[Hamelryck2003-16] Hamelryck, Thomas; Manderick, Bernard. PDB file parser and structure class implemented in Python. Bioinformatics. 10 May 2003, 19 (17): 2308–2310. PMID 14630660. doi:10.1093/bioinformatics/btg299  .

[Rousset2008-17] Rousset, François. GENEPOP'007: a complete re-implementation of the GENEPOP software for Windows and Linux. Molecular Ecology Resources. January 2008, 8 (1): 103–106. PMID 21585727. S2CID 25776992. doi:10.1111/j.1471-8286.2007.01931.x.

[Excoffier2011-18] Excoffier, Laurent; Foll, Matthieu. fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics. 1 March 2011, 27 (9): 1332–1334. PMID 21398675. doi:10.1093/bioinformatics/btr124  .

[1]

[4]

[5]

[6]

[2]

[3]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]


原作者	Chapman B, Chang J^[1]
首次发布	2002年12月17日，21年前（2002-12-17）
当前版本	1.81 (2023年2月12日；穩定版本)^[2] 183 (2024年1月10日；穩定版本)^[3]
源代码库	https://github.com/biopython/biopython
编程语言	Python和C语言
平台	跨平台
类型	生物信息学
许可协议	Biopython许可证
网站	biopython.org

Biopython

历史 编辑

设计 编辑

主要特点和示例 编辑

序列 编辑

序列注释 编辑

输入输出 编辑

访问在线数据库 编辑

基因组图 编辑

高分子结构 编辑

群体遗传学 编辑

命令行工具的包装 编辑

参见 编辑

参考文献 编辑

外部链接 编辑