PreprintPDF Available

PHYLUCE is a software package for the analysis of conserved genomic loci

Authors:

Abstract

Targeted enrichment of conserved and ultraconserved genomic elements allows universal collection of phylogenomic data from hundreds of species at multiple time scales (< 5 Ma to > 300 Ma). Prior to downstream inference, data from these types of targeted enrichment studies must undergo pre-processing to assemble contigs from sequence data; identify targeted, enriched loci from the off-target background data; align enriched contigs representing conserved loci to one another; and prepare and manipulate these alignments for subsequent phylogenomic inference. PHYLUCE is an efficient and easy-to-install software package that accomplishes these tasks across hundreds of taxa and thousands of enriched loci. Availability and Implementation PHYLUCE is written for Python 2.7. PHYLUCE is supported on OSX and Linux (RedHat/CentOS) operating systems. PHYLUCE source code is distributed under a BSD-style license from https://www.github.com/fairclothUlab/phyluce/ . PHYLUCE is also available as a package ( https://binstar.org/fairclothUlab/phyluce ) for the Anaconda Python distribution that installs all dependencies, and users can request a PHYLUCE instance on iPlant Atmosphere (tag: phyluce). The software manual and a tutorial are available from http://phyluce.readthedocs.org/en/latest/ and test data are available from doi: 10.6084/m9.figshare.1284521. Contact brant@fairclothUlab.org Supplementary information Supplementary Figure 1.
!
"!
!
#$%&'()!*+!,!+-./0,12!3,45,62!.-1!/72!,8,9:+*+!-.!4-8+21;2<!628-=*4!9-4*!
!
>1,8/!(?!@,*149-/7"AB!
!
!
"C23,1/=28/!-.!>*-9-6*4,9!D4*2842+!,8<!EF+2F=!-.!G,/F1,9!D4*2842A!&-F*+*,8,!D/,/2!'8*;21+*/:A!
>,/-8!H-F62A!&I!JKLKMA!'DI!
!
BN-!07-=!4-112+3-8<2842!+7-F9<!O2!,<<12++2<?!
!
Abstract(
(
Summary:!N,162/2<!281*47=28/!-.!4-8+21;2<!,8<!F9/1,4-8+21;2<!628-=*4!292=28/+!,99-0+!
F8*;21+,9!4-9924/*-8!-.!37:9-628-=*4!<,/,!.1-=!7F8<12<+!-.!+324*2+!,/!=F9/*392!/*=2!+4,92+!PQ!R!
E,!/-!S!MKK!E,T?!!#1*-1!/-!<-08+/12,=!*8.212842A!<,/,!.1-=!/72+2!/:32+!-.!/,162/2<!
281*47=28/!+/F<*2+!=F+/!F8<216-!312U31-42++*86!/-!,++2=O92!4-8/*6+!.1-=!+2VF2842!<,/,W!
*<28/*.:!/,162/2<A!281*472<!9-4*!.1-=!/72!-..U/,162/!O,4561-F8<!<,/,W!,9*68!281*472<!4-8/*6+!
12312+28/*86!4-8+21;2<!9-4*!/-!-82!,8-/721W!,8<!3123,12!,8<!=,8*3F9,/2!/72+2!,9*68=28/+!.-1!
+FO+2VF28/!37:9-628-=*4!*8.212842?!!#$%&'()!*+!,8!2..*4*28/!,8<!2,+:U/-U*8+/,99!+-./0,12!
3,45,62!/7,/!,44-=39*+72+!/72+2!/,+5+!,41-++!7F8<12<+!-.!/,X,!,8<!/7-F+,8<+!-.!281*472<!9-4*?!
!
Availability(and(Implementation:!#$%&'()!*+!01*//28!.-1!#:/7-8!Y?J?!!#$%&'()!*+!+F33-1/2<!-8!
ZD[!,8<!&*8FX!PH2<$,/\(28/ZDT!-321,/*86!+:+/2=+?!#$%&'()!+-F142!4-<2!*+!<*+/1*OF/2<!F8<21!,!
>DCU+/:92!9*428+2!.1-=!7//3+]\\000?6*/7FO?4-=\.,*149-/7U9,O\37:9F42\?!#$%&'()!*+!,9+-!
,;,*9,O92!,+!,!3,45,62!P7//3+]\\O*8+/,1?-16\.,*149-/7U9,O\37:9F42T!.-1!/72!I8,4-8<,!#:/7-8!
<*+/1*OF/*-8!/7,/!*8+/,99+!,99!<2328<284*2+A!,8<!F+21+!4,8!12VF2+/!,!#$%&'()!*8+/,842!-8!*#9,8/!
I/=-+37212!P/,6]!37:9F42T?!N72!+-./0,12!=,8F,9!,8<!,!/F/-1*,9!,12!,;,*9,O92!.1-=!
7//3]\\37:9F42?12,</72<-4+?-16\28\9,/2+/\!,8<!/2+/!<,/,!,12!,;,*9,O92!.1-=!<-*]!
"K?^KL_\=`?.*6+7,12?"YL_RY"?!
!
Contact:!O1,8/a.,*149-/7U9,O?-16!!
!
Supplementary(information:!DF3392=28/,1:!@*6F12!"?!
!
! !
This%article%has%been%accepted%for%publication%in%Bioinformatics%©:%2015%Oxford%University%
Press.%Published%by%Oxford%University%Press.%All%rights%reserved.%
%
7//3]\\<-*?-16\"K?"K`M\O*-*8.-1=,/*4+\O/;^_^!
.CC-BY 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted October 30, 2015. ; https://doi.org/10.1101/027904doi: bioRxiv preprint
!
Y!
1( Introduction((
(
N,162/!281*47=28/!-.!4-8+21;2<!,8<!F9/1,4-8+21;2<!292=28/+!P7212,./21!b4-8+21;2<!
9-4*cT!,99-0+!F8*;21+,9!37:9-628-=*4!,8,9:+2+!-.!8-8U=-<29!-16,8*+=+!P@,*149-/7!2/!,9?!YK"YW!
@,*149-/7!2/!,9?!YK"MW!@,*149-/7!2/!,9?!YK"RT!,/!=F9/*392!/*=2!+4,92+!P@,*149-/7!2/!,9?!YK"YW!D=*/7!2/!
,9?!YK"_T?!!N72!+/1286/7!-.!/72!,331-,47!<21*;2+!.1-=!*/+!,O*9*/:!/-!4-9924/!+2VF2842!<,/,!.1-=!
/7-F+,8<+!-.!9-4*!,41-++!7F8<12<+!-.!+324*2+A!321=*//*86!37:9-6282/*4!4-=3,1*+-8+!,41-++!<223!
37:9-6282/*4!O12,5+!+F47!,+!-16,8*+=,9!(9,++2+!PS!YKKUMKK!E,T!,8<!+7,99-021!2;-9F/*-8,1:!
<*;2162842+!+F47!,+!3-3F9,/*-8+!PQ!K?R!d!R!E,T?!!e728!/72!6-,9!-.!<,/,!4-9924/*-8!*+!/-!*8.21!/72!
2;-9F/*-8,1:!7*+/-1:!-.!+324*2+A!/72!+FO+2VF28/!,8,9:/*4,9!/,+5+!,12!62821,99:!/-]!P"T!,++2=O92!
/72!+2VF284*86!12,<+A!07*47!=,:!+3,8!/28+!/-!7F8<12<+!-.!*8<*;*<F,9+W!PYT!*<28/*.:!3F/,/*;2!
-1/7-9-6+!,=-86!/72!,++2=O92<!4-8/*6+!-8!,!+,=392UO:U+,=392!O,+*+!07*92!12=-;*86!3F/,/*;2!
3,1,9-6+W!PMT!2,+*9:!62821,/2!<,/,+2/+!/7,/!4-8/,*8!<*..2128/!*8<*;*<F,9+A!*8<*;*<F,9+!*849F<2<!
.1-=!-/721!2X321*=28/+A!-1!*8<*;*<F,9!628-=2!+2VF2842+W!P_T!*<28/*.:!,8<!2X3-1/!+2VF2842!
<,/,!.1-=!-1/7-9-6+!,41-++!,99!*8<*;*<F,9+!*8!/72!+2/W!PRT!,9*68!/72!<,/,!,8<!-3/*-8,99:!/1*=!
12+F9/*86!,9*68=28/+!*8!3123,1,/*-8!.-1!37:9-6282/*4!*8.212842W!P^T!4-=3F/2!+F==,1:!+/,/*+/*4+!
-8!/72!,9*682<!<,/,W!,8<!PJT!321.-1=!F/*9*/:!.F84/*-8+!-8!/72!+2VF2842!-1!,9*68=28/!<,/,!
3123,12!/72=!.-1!<-08+/12,=!,8,9:+2+!F+*86!,!;,1*2/:!-.!37:9-6282/*4!*8.212842!31-61,=+?!
#$%&'()!P31-8-F842<!b37:U9--U4722cT!*+!/72!.*1+/!-328U+-F142A!2,+:U/-U*8+/,99!+-./0,12!
3,45,62!/-!321.-1=!/72+2!/,+5+!.-1!/,162/!281*472<A!4-8+21;2<!9-4*!*8!,!4-=3F/,/*-8,99:!
2..*4*28/!=,8821?!
(
2( Workflow(and(features((
!
N72!#$%&'()!0-15.9-0!PDF3392=28/,1:!@*6F12!"T!.-1!*8.211*86!37:9-628:!O26*8+!0*/7!
2X/218,9!3123,1,/*-8!-.!+2VF2842!12,<+!.1-=!/,162/U281*472<!9*O1,1*2+!O:!/1*==*86!,<,3/21!
4-8/,=*8,/*-8!,8<!9-0UVF,9*/:!O,+2+!F+*86!,!31-61,=!9*52!N1*==-=,/*4!P>-9621!2/!,9?!YK"_T!-1!,!
O,/47!31-42++*86!+41*3/!+*=*9,1!/-!*99F=*31-42++-1!P7//3+]\\6*/7FO?4-=\!.,*149-/7U
9,O\*99F=*31-42++-1T?!!#$%&'()!/728!-..21+!+2;21,9!31-61,=+!/-!O,/47U,++2=O92!/72!12+F9/*86!
b492,8c!12,<+!*8/-!4-8/*6+!F+*86!<*..2128/!,++2=O9:!31-61,=+!Pf21O*8-!,8<!>*182:!YKKLW!
D*=3+-8!2/!,9?!YKK`W!g1,O7211!2/!,9?!YK""T!0*/7!3,1,9929*h,/*-8!,331-,472+!/,*9-12<!/-!2,47!
31-61,=?!!N72!82X/!+/23!*8!/72!#$%&'()!0-15.9-0!*+!/-!*<28/*.:!-1/7-9-6-F+!4-8+21;2<!9-4*!
+7,12<!,=-86!*8<*;*<F,9+?!!N72!=,/47i4-8/*6+i/-i31-O2+!31-61,=!321.-1=+!/72!+/23+!-.!
-1/7-9-6!*<28/*.*4,/*-8!,8<!3,1,9-6!12=-;,9!O:!,9*68*86!/72!,++2=O92<!4-8/*6+!/-!,!@IDNI!.*92!-.!
/,162/!281*47=28/!O,*/+!F+*86!9,+/h!P$,11*+!YKKJT?!!I9/7-F67!/7*+!31-61,=!*+!<2+*682<!/-!0-15!
0*/7!+/,8<,1<*h2<!O,*/+!+2/+!<2;29-32<!.-1!/72!/,162/2<!281*47=28/!-.!'()!9-4*!P2?6?!
7//3]\\F9/1,4-8+21;2<?-16TA!F+21+!4,8!*83F/!4F+/-=!O,*/!+2/+!0*/7!<*..2128/!8,=*86!4-8;28/*-8+!
/,162/*86!<*..2128/!49,++2+!-.!9-4*!O:!,<jF+/*86!+2;21,9!3,1,=2/21+!P2?6?A!E,8<29!2/!,9?!YK"_T?!
@-99-0*86!/72!,9*68=28/!+/23A!=,/47i4-8/*6+i/-i31-O2+!+41228+!/72!9,+/h!-F/3F/!/-!*<28/*.:!P"T!
,++2=O92<!4-8/*6+!7*/!O:!31-O2+!/,162/*86!<*..2128/!9-4*A!,8<!PYT!<*..2128/!4-8/*6+!/7,/!,12!7*/!O:!
31-O2+!/,162/*86!/72!+,=2!9-4F+?!!N72!31-61,=!,++F=2+!/7,/!/72+2!124*31-4,99:!<F39*4,/2!9-4*!
,12!3-/28/*,99:!3,1,9,6-F+!,8<!12=-;2+!/72=!.1-=!<-08+/12,=!,8,9:/*4,9!+/23+?!!N72!31-61,=!
/728!OF*9<+!,!129,/*-8,9!<,/,O,+2!4-8/,*8*86!,!/,O92!-.!<2/24/*-8+!,8<!8-8U<2/24/*-8+!,/!2,47!
9-4F+!,41-++!,99!*83F/!,++2=O9*2+!,+!0299!,+!,!/,O92!,++-4*,/*86!/72!8,=2!-.!2,47!/,162/2<!9-4F+!
.CC-BY 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted October 30, 2015. ; https://doi.org/10.1101/027904doi: bioRxiv preprint
!
M!
P.1-=!/72!@IDNI!.*92!12312+28/*86!/72!O,*/!+2/T!0*/7!/72!8,=2!-.!/72!,++2=O92<!4-8/*6!/-!07*47!
*/!=,/472+?!!G2X/A!F+21+!-.!#$%&'()!412,/2!,!b/,X-8U+2/c!4-8.*6F1,/*-8!.*92!/7,/!+324*.*2+!/72!
*8<*;*<F,9!,++2=O9*2+!/7,/!0*99!O2!F+2<!*8!<-08+/12,=!37:9-6282/*4!,8,9:+2+?!!>:!*83F//*86!/7*+!
4-8.*6F1,/*-8!.*92!/-!/72!62/i=,/47i4-F8/+!31-61,=A!F+21+!4,8!.92X*O9:!412,/2!<*..2128/!<,/,!
+2/+A!*8/261,/2!<,/,!.1-=!+23,1,/2!+/F<*2+!/,162/*86!/72!+,=2!9-4*A!-1!*849F<2!*<28/*4,9!9-4*!
7,1;2+/2<!.1-=!3FO9*+72<!628-=2!+2VF2842+!P2?6?!7//3]\\6*/7FO?4-=\.,*149-/7U9,O\F42U31-O2U
+2/+T?!!I./21!*<28/*.:*86!/7-+2!*8<*;*<F,9+!,8<!9-4*!*8!/72!<2+*12<!/,X-8!+2/A!F+21+!2X/1,4/!/72!
4-8/*6+!4-112+3-8<*86!/-!8-8U<F39*4,/2!4-8+21;2<!9-4*!*8/-!,!=-8-9*/7*4!P,99!9-4*!.-1!,99!/,X,T!
@IDNIU.-1=,//2<!.*92!F+*86!/72!62/i.,+/,+i.1-=i=,/47i4-F8/+!31-61,=?!!N7*+!31-61,=!
128,=2+!2,47!4-8/*6!.-1!2,47!+324*2+!0*/7*8!/72!/,X-8!+2/!+F47!/7,/!/72!@IDNI!72,<21!.-1!2,47!
4-8/*6!4-8/,*8+!*8.-1=,/*-8!<28-/*86!/72!+324*2+!*8!07*47!/72!4-8+21;2<!9-4F+!0,+!<2/24/2<!
,8<!/72!+324*.*4!4-8+21;2<!9-4F+!/-!07*47!*/!=,/472<?!!I./21!412,/*86!/72!=-8-9*/7*4!@IDNIA!
F+21+!4,8!,9*68!/72!/,162/2<!9-4*!0*/7!/72!+2V4,3i,9*68!31-61,=A!07*47!3,1,9929*h2+!EI@@N!
Pk,/-7!,8<!D/,8<92:!YK"MT!-1!E'D(&)!P)<6,1!YKK_T!,9*68=28/+!,41-++!,99!/,162/2<!9-4*!-8!
4-=3F/21+!0*/7!=F9/*392!(#'+?!!N72!+2V4,3i,9*68!31-61,=!,9+-!-..21+!/72!-3/*-8!/-!/1*=!/72!
12+F9/*86!,9*68=28/+!.-1!2<62+!/7,/!,12!3--19:!,9*682<!U!,!+F*/,O92!47-*42!0728!/72!+324*2+!
0*/7*8!/72!/,X-8!+2/!,12!49-+29:!129,/2<!P2?6?A!1-F679:!Z1<21U92;29!-1!9-021!/,X-8-=*4!1,85+A!Q!RK!
E,T?!!N-!,339:!=-12!,6612++*;2!,9*68=28/!/1*==*86!0728!129,/*-8+7*3+!,12!-9<21!PS!RK!E,TA!
#$%&'()!31-;*<2+!,!+*=*9,1!31-61,=!/7,/!*=392=28/+!3,1,9929*h2<A!*8/218,9!/1*==*86!F+*86!
gO9-45+!P(,+/12+,8,!YKKKW!N,9,;21,!,8<!(,+/12+,8,!YKKJT?!!!
#$%&'()!*849F<2+!+2;21,9!3,1,9929*h2<!31-61,=+!/-!=,8*3F9,/2!/72!12+F9/*86!,9*68=28/+A!
*849F<*86!/72!,O*9*/:!/-!1,3*<9:!62821,/2!+F==,1:!+/,/*+/*4+!,41-++!/7-F+,8<+!-.!,9*68=28/+A!
2X39-<2!,9*68=28/+!*8/-!/72*1!4-112+3-8<*86!@IDNI!+2VF2842+A!2X/1,4/!/,X,!.1-=!,9*68=28/+A!
4-=3F/2!3,1+*=-8:!*8.-1=,/*;2!+*/2+!0*/7*8!,9*68=28/+A!,8<!4-8;21/!,9*68=28/+!O2/0228!
4-==-8!.-1=,/+A!,8<!/72+2!31-61,=+!4,8!,9+-!O2!F+2<!0*/7!,9*68=28/+!.1-=!-/721!<,/,!/:32+?!!
I./21!,9*68=28/A!#$%&'()!F+21+!4,8!62821,/2!<,/,!=,/1*42+!7,;*86!;,1:*86!92;29+!-.!
4-=392/282++!F+*86!/72!62/i-89:i9-4*i0*/7i=*8i/,X,!31-61,=?!!N7*+!31-61,=!+41228+!2,47!
9-4F+!.-1!/,X-8-=*4!4-=392/282++!,8<!.*9/21+!-F/!9-4*!4-8/,*8*86!.2021!/,X,!/7,8!<2+*12<?!!l8!/7*+!
0,:A!F+21+!4,8!412,/2!"KKm!4-=392/2!P,99!/,X,!7,;2!<,/,!.-1!,99!9-4*T!-1!*84-=392/2!<,/,!
=,/1*42+!P+-=2!9-4*!7,;2!<,/,!.-1!,!421/,*8!321428/,62!-.!/,X,T?!!I./21!.*9/21*86!9-4*!.-1!
/,X-8-=*4!4-=392/282++A!#$%&'()!-..21+!+2;21,9!31-61,=+!/-!.-1=,/!12+F9/*86!,9*68=28/+!.-1!
,8,9:+2+!*8!#,1/*/*-8@*8<21!P&,8.2,1!2/!,9?!YK"YTA!HIXE&!PD/,=,/,5*+!YK"_TA!)X,>,:2+!PIO2121!2/!
,9?!YK"_TA!gIH&l!Pf0*459!YKK^TA!-1!E1>,:2+!PH-8VF*+/!,8<!$F29+28O245!YKKMT?!#1-61,=+!,12!,9+-!
,;,*9,O92!/-!,++*+/!F+21+!0*/7!3123,1*86!<,/,!.-1!,8<!1F88*86!6282U/122UO,+2<!+324*2+!/122!
,8,9:+2+?!
!
( (
.CC-BY 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted October 30, 2015. ; https://doi.org/10.1101/027904doi: bioRxiv preprint
!
_!
Acknowledgements(
!
l!/7,85!(,19!Z9*;21-+A!G*45!(1,0.-1<A!,8<!E*52!$,1;2:!.-1!4-8/1*OF/*86!/-!/72!+-F142!4-<2!,8<!
N1,;*+!g9288A!n-78!E4(-1=,45A!E*47,29!I9.,1-A!H-OO!>1F=.*29<A!>1*,8!D=*/7A!,8<!k2;*8!e*8521!
.-1!4-8/1*OF/*86!/-!2,19:!'()!+/F<*2+?!!(-==28/+!.1-=!C,;*<!#-+,<,!,8<!/7122!,8-8:=-F+!
12;*2021+!*=31-;2<!/7*+!=,8F+41*3/?!
!
Funding(
!
N7*+!0-15!0,+!+F33-1/2<!O:!/72!G,/*-8,9!D4*2842!@-F8<,/*-8!C*;*+*-8!-.!)8;*1-8=28/,9!>*-9-6:!
P61,8/!8F=O21+!C)>U"Y_YY^KA!C)>UK`R^K^`A!C)>UKL_"JY`A!C)>U"MR_JM`T!,8<!+/,1/UF3!.F8<+!
31-;*<2<!O:!&-F*+*,8,!D/,/2!'8*;21+*/:?!
(-8.9*4/!-.!l8/212+/]!8-82!<249,12<?!
!
References(
(
IO2121AI?n?!2/!,9?!PYK"_T!)X,>,:2+]!=,++*;29:!3,1,9929!O,:2+*,8!/122!*8.212842!.-1!/72!07-92U
628-=2!21,?!E-9?!>*-9?!);-9?A!M"A!YRRMdYRR^?!
!
>-9621AI?E?!2/!,9?!PYK"_T!N1*==-=,/*4]!,!.92X*O92!/1*==21!.-1!l99F=*8,!+2VF2842!<,/,?!
>*-*8.-1=,/*4+A!MKA!Y""_dY"YK?!
!
(,+/12+,8,An?!PYKKKT!D2924/*-8!-.!4-8+21;2<!O9-45+!.1-=!=F9/*392!,9*68=28/+!.-1!/72*1!F+2!*8!
37:9-6282/*4!,8,9:+*+?!E-9?!>*-9?!);-9?A!"JA!R_KdRRY?!
!
)<6,1AH?(?!PYKK_T!E'D(&)]!,!=F9/*392!+2VF2842!,9*68=28/!=2/7-<!0*/7!12<F42<!/*=2!,8<!
+3,42!4-=392X*/:?!>E(!>*-*8.-1=,/*4+A!RA!""Md""`?!
!
@,*149-/7A>?(?!2/!,9?!PYK"MT!I!37:9-628-=*4!321+324/*;2!-8!/72!1,<*,/*-8!-.!1,:U.*882<!.*+72+!
O,+2<!F3-8!/,162/2<!+2VF284*86!-.!F9/1,4-8+21;2<!292=28/+!P'()+T?!#&-D!Z82A!LA!2^R`YM?!
!
@,*149-/7A>?(?!2/!,9?!PYK"RT!N,162/!281*47=28/!-.!F9/1,4-8+21;2<!292=28/+!.1-=!,1/71-3-<+!
31-;*<2+!,!628-=*4!321+324/*;2!-8!129,/*-8+7*3+!,=-86!$:=28-3/21,?!E-9?!)4-9?!H2+-F1?A!"RA!
_L`dRK"?!
!
@,*149-/7A>?(?!2/!,9?!PYK"YT!'9/1,4-8+21;2<!)92=28/+!I847-1!N7-F+,8<+!-.!g282/*4!E,1521+!
D3,88*86!EF9/*392!);-9F/*-8,1:!N*=2+4,92+?!D:+/?!>*-9?A!^"A!J"JdJY^?!
!
g1,O7211AE?g?!2/!,9?!PYK""T!@F99U9286/7!/1,8+41*3/-=2!,++2=O9:!.1-=!HGIUD2V!<,/,!0*/7-F/!,!
12.212842!628-=2?!G,/?!>*-/2478-9?A!Y`A!^__d'"MK?!
!
$,11*+AH?D?!PYKKJT!l=31-;2<!3,*10*+2!,9*68=28/!-.!628-=*4!CGI?!
!
.CC-BY 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted October 30, 2015. ; https://doi.org/10.1101/027904doi: bioRxiv preprint
!
R!
k,/-7Ak?!,8<!D/,8<92:AC?E?!PYK"MT!EI@@N!=F9/*392!+2VF2842!,9*68=28/!+-./0,12!;21+*-8!J]!
*=31-;2=28/+!*8!321.-1=,842!,8<!F+,O*9*/:?!E-9?!>*-9?!);-9?A!MKA!JJYdJLK?!
!
&,8.2,1AH?!2/!,9?!PYK"YT!#,1/*/*-8.*8<21]!4-=O*82<!+2924/*-8!-.!3,1/*/*-8*86!+472=2+!,8<!
+FO+/*/F/*-8!=-<29+!.-1!37:9-6282/*4!,8,9:+2+?!E-9?!>*-9?!);-9?A!Y`A!"^`Rd"JK"?!
!
E,8<29An?H?!2/!,9?!PYK"_T!I!/,162/!281*47=28/!=2/7-<!.-1!6,/721*86!37:9-6282/*4!*8.-1=,/*-8!
.1-=!7F8<12<+!-.!9-4*]!I8!2X,=392!.1-=!/72!(-=3-+*/,2?!I339*4,/*-8+!*8!#9,8/!D4*2842+A!Y?!
!
H-8VF*+/A@?!,8<!$F29+28O245An?#?!PYKKMT!E1>,:2+!M]!>,:2+*,8!37:9-6282/*4!*8.212842!F8<21!
=*X2<!=-<29+?!>*-*8.-1=,/*4+A!"`A!"RJYd"RJ_?!
!
D*=3+-8An?!2/!,9?!PYKK`T!I>:DD]!,!3,1,9929!,++2=O921!.-1!+7-1/!12,<!+2VF2842!<,/,?!g28-=2!H2+?A!
"`A!"""Jd""YM?!
!
D=*/7A>?N?!2/!,9?!PYK"_T!N,162/!(,3/F12!,8<!E,++*;29:!#,1,9929!D2VF284*86!-.!'9/1,4-8+21;2<!
)92=28/+!P'()+T!.-1!(-=3,1,/*;2!D/F<*2+!,/!D7,99-0!);-9F/*-8,1:!N*=2!D4,92+?!D:+/?!>*-9?A!^MA!
LMd`R?!
!
D/,=,/,5*+AI?!PYK"_T!HIXE&!;21+*-8!L]!,!/--9!.-1!37:9-6282/*4!,8,9:+*+!,8<!3-+/U,8,9:+*+!-.!
9,162!37:9-628*2+?!>*-*8.-1=,/*4+A!MKA!"M"Yd"M"M?!
!
N,9,;21,Ag?!,8<!(,+/12+,8,An?!PYKKJT!l=31-;2=28/!-.!37:9-628*2+!,./21!12=-;*86!<*;21628/!,8<!
,=O*6F-F+9:!,9*682<!O9-45+!.1-=!31-/2*8!+2VF2842!,9*68=28/+?!D:+/?!>*-9?A!R^A!R^_dRJJ?!
!
f21O*8-AC?H?!,8<!>*182:A)?!PYKKLT!o29;2/]!I96-1*/7=+!.-1!<2!8-;-!+7-1/!12,<!,++2=O9:!F+*86!<2!
>1F*j8!61,37+?!g28-=2!H2+?A!"LA!LY"dLY`?!
!
f0*459AC?n?!PYKK^T!g282/*4!,96-1*/7=!,331-,472+!.-1!/72!37:9-6282/*4!,8,9:+*+!-.!9,162!O*-9-6*4,9!
+2VF2842!<,/,+2/+!F8<21!/72!=,X*=F=!9*529*7--<!41*/21*-8?!
!
.CC-BY 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted October 30, 2015. ; https://doi.org/10.1101/027904doi: bioRxiv preprint
Cleaned
Reads
Assembled
Contigs
UCE Contigs
UCE Data Set
Aligned UCE
Data Set
Raw Read Preparation
Trimmomatic illumiprocessor
Assembly
Velvet
assemblo_velvet
Trinity
assemblo_trinity
ABySS
assemblo_abyss
Ortholog Detection
(and Paralog Removal)
Data Set Creation
Taxon
Set get_match_counts
get_fastas_from_match_counts
Alignment
Muscle
seqcap_align
Mafft
seqcap_align
Edge Trimming
seqcap_align
Alignment Cleaning
remove_locus_name_from_nexus_lines
Alignment Summary
get_align_summary_data
Data Matrix Selection
get_only_loci_with_min_taxa
Concatenated Data Preparation
MrBayes
format_nexus_files_for_mrbayes
format_nexus_files_for_raxml
RAxML GARLI ExaBayesExaML
Gene Tree - Species Tree
Analysis Concatenated Data Analysis
End
or
Internal Trimming
gBlocks
get_gblocks_trimmed_alignments_from_untrimmed
lastz
match_contigs_to_probes
Supplementary Figure 1. PHYLUCE workflow for phylogenomic analyses of data collected from conserved genomic loci using targeted enrichment.
.CC-BY 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted October 30, 2015. ; https://doi.org/10.1101/027904doi: bioRxiv preprint
Article
A rapid proliferation in the availability of whole genome sequences (WGS), often with relatively low read depth, offers an unprecedented opportunity for phylogenomic advances using publicly available data, but there are several key challenges in applying these data. Using low‐coverage WGS data for the ant species of Formica , we conducted detailed comparisons on two different analytical pipelines (reference‐based vs. de novo genome assembly), four types of datasets (5‐kbp‐window, ultra‐conserved element [UCE], single‐copy ortholog [BUSCO] and mitogenome), and a series of analytical procedures (e.g. concatenation vs. coalescent analyses) to identify which are robust to typical WGS data. The results show that at a shallow scale of phylogenetic relationships of closely related species 5‐kbp‐windows from the reference‐based pipeline and UCEs from the de novo assemblies are more successful than the BUSCOs in recovering informative markers for phylogenetic inference. Compared with concatenation analyses, coalescent analyses often resulted in disparate deeper relationships in the phylogeny. This study also uncovers evident mito‐nuclear discordance and demonstrates genome‐wide gene conflicts in phylogenetic signals, both pointing to possible incomplete lineage sorting and/or hybridization during the early, rapid radiation of Formica ants. Divergence dating analyses show that different types of data and analytical methods could result in inconsistent time estimates, highlighting the potential need for multiple approaches to better understand species divergence. The strengths and weaknesses of different analytical pipelines and strategies are discussed. Findings from this study provide valuable insights for large‐scale phylogenomic projects using WGS data.
Article
Full-text available
The biota of cave habitats faces heightened conservation risks, due to geographic isolation and high levels of endemism. Molecular datasets, in tandem with ecological surveys, have the potential to precisely delimit the nature of cave endemism and identify conservation priorities for microendemic species. Here, we sequenced ultraconserved elements of Tegenaria within, and at the entrances of, 25 cave sites to test phylogenetic relationships, combined with an unsupervised machine learning approach for detecting species. Our analyses identified clear and well-supported genetic breaks in the dataset that accorded closely with morphologically diagnosable units. Through these analyses, we also detected some previously unidentified, potential cryptic morphospecies. We then performed conservation assessments for seven troglobitic Israeli species of this genus and determined five of these to be critically endangered.
Article
Full-text available
Obtaining sequence data from historical museum specimens has been a growing research interest, invigorated by next-generation sequencing methods that allow inputs of highly degraded DNA. We applied a target enrichment and next-generation sequencing protocol to generate ultraconserved elements (UCEs) from 51 large carpenter bee specimens (genus Xylocopa), representing 25 species with specimen ages ranging from 2–121 years. We measured the correlation between specimen age and DNA yield (pre- and post-library preparation DNA concentration) and several UCE sequence capture statistics (raw read count, UCE reads on target, UCE mean contig length and UCE locus count) with linear regression models. We performed piecewise regression to test for specific breakpoints in the relationship of specimen age and DNA yield and sequence capture variables. Additionally, we compared UCE data from newer and older specimens of the same species and reconstructed their phylogeny in order to confirm the validity of our data. We recovered 6–972 UCE loci from samples with pre-library DNA concentrations ranging from 0.06–9.8 ng/μL. All investigated DNA yield and sequence capture variables were significantly but only moderately negatively correlated with specimen age. Specimens of age 20 years or less had significantly higher pre- and post-library concentrations, UCE contig lengths, and locus counts compared to specimens older than 20 years. We found breakpoints in our data indicating a decrease of the initial detrimental effect of specimen age on pre- and post-library DNA concentration and UCE contig length starting around 21–39 years after preservation. Our phylogenetic results confirmed the integrity of our data, giving preliminary insights into relationships within Xylocopa. We consider the effect of additional factors not measured in this study on our age-related sequence capture results, such as DNA fragmentation and preservation method, and discuss the promise of the UCE approach for large-scale projects in insect phylogenomics using museum specimens.
Article
Restriction-site associated DNA sequencing (RAD-seq) and target capture of specific genomic regions, such as ultraconserved elements (UCEs), are emerging as two of the most popular methods for phylogenomics using reduced-representation genomic datasets. These two methods were designed to target different evolutionary timescales: RAD-seq was designed for population-genomic level questions and UCEs for deeper phylogenetics. The utility of both datasets to infer phylogenies across a variety of taxonomic levels has not been adequately compared within the same taxonomic system. Additionally, the effects of uninformative gene trees on species tree analyses (for target capture data) have not been explored. Here, we utilize RAD-seq and UCE data to infer a phylogeny of the bird genus Piranga. The group has a range of divergence dates (0.5 my - 6 my), contains eleven recognized species, and lacks a resolved phylogeny. We compared two species tree methods for the RAD-seq data and six species tree methods for the UCE data. Additionally, in the UCE data, we analyzed a complete matrix as well as datasets with only highly informative loci. A complete matrix of 189 UCE loci with ten or more parsimony informative (PI) sites, and an ~80% complete matrix of 1128 PI SNPs (from RAD-seq) yield the same fully resolved phylogeny of Piranga. We inferred non-monophyletic relationships of P. lutea individuals, with all other a priori species identified as monophyletic. Finally, we found that species tree analyses that included predominantly uninformative gene trees provided strong support for different topologies, with consistent phylogenetic results when limiting species tree analyses to highly informative loci or only using less informative loci with concatenation or methods meant for SNPs alone.
Article
Production of massive DNA sequence datasets is transforming phylogenetic inference, but best practices for analyzing such datasets are not well established. One uncertainty is robustness to missing data, particularly in coalescent frameworks. To understand the effects of increasing matrix size and loci at the cost of increasing missing data, we produced a 90 taxon, 2.2 megabase, 4800 locus sequence matrix of landfowl using target capture of ultraconserved elements. We then compared phylogenies estimated with concatenated maximum likelihood, quartet-based methods executed on concatenated matrices, and gene tree reconciliation methods across five thresholds of missing data. Results of maximum likelihood and quartet analyses were similar, well-resolved, and demonstrated increasing support with increasing matrix size and sparseness. Conversely, gene tree reconciliation produced unexpected relationships when we included all informative loci, with certain taxa placed towards the root compared to other approaches. Inspection of these taxa identified a prevalence of short average contigs, which potentially biased gene tree inference and caused erroneous results in gene tree reconciliation. This suggests the more problematic missing data in gene-tree based analyses are partial sequences rather than entire missing sequences from locus alignments. Limiting gene tree reconciliation to the most informative loci solved this problem, producing well-supported topologies congruent with concatenation and quartet methods. Collectively, our analyses provide a well-resolved phylogeny of landfowl, including strong support for previously problematic relationships such as those among junglefowl (Gallus), and clarify the position of two enigmatic galliform genera (Lerwa, Melanoperdix) not sampled in previous molecular phylogenetic studies.
Article
Full-text available
• Premise of the study: The Compositae (Asteraceae) are a large and diverse family of plants, and the most comprehensive phylogeny to date is a meta-tree based on 10 chloroplast loci that has several major unresolved nodes. We describe the development of an approach that enables the rapid sequencing of large numbers of orthologous nuclear loci to facilitate efficient phylogenomic analyses. • Methods and Results: We designed a set of sequence capture probes that target conserved orthologous sequences in the Compositae. We also developed a bioinformatic and phylogenetic workflow for processing and analyzing the resulting data. Application of our approach to 15 species from across the Compositae resulted in the production of phylogenetically informative sequence data from 763 loci and the successful reconstruction of known phylogenetic relationships across the family. • Conclusions: These methods should be of great use to members of the broader Compositae community, and the general approach should also be of use to researchers studying other families.