Cotton (Gossypium hirsutum L.) is the key renewable fibre crop worldwide, yet its yield and fibre quality show high variability due to genotype-specific traits and complex interactions among cultivars, management practices and environmental factors. Modern breeding practices may limit future yield gains due to a narrow founding gene pool. Precision breeding and biotechnological approaches offer potential solutions, contingent on accurate cultivar-specific data. Here we address this need by generating high-quality reference genomes for three modern cotton cultivars (‘UGA230’, ‘UA48’ and ‘CSX8308’) and updating the ‘TM-1’ cotton genetic standard reference. Despite hypothesized genetic uniformity, considerable sequence and structural variation was observed among the four genomes, which overlap with ancient and ongoing genomic introgressions from ‘Pima’ cotton, gene regulatory mechanisms and phenotypic trait divergence. Differentially expressed genes across fibre development correlate with fibre production, potentially contributing to the distinctive fibre quality traits observed in modern cotton cultivars. These genomes and comparative analyses provide a valuable foundation for future genetic endeavours to enhance global cotton yield and sustainability.
-
Fig. 1: Structure and contiguity of the TM-1 cotton reference genome.
The v2 and v3 reference genome sequences were subjected to contig position mapping by GENESPACE. a, The contigs in each genome (v2, left; v3, right) as a continuous block of a single colour. Given the substantial differences in contiguity, a continuous yellow–blue palette with ten colours was selected for v2, while a discrete three-colour sequence (pink, purple, blue) was used for v3. b, The difference in genome architecture between the A (top) and D (bottom) subgenomes of the tetraploid TM-1 v3 cotton. Repeat and gene density were hierarchically inferred, classifying the genomes into exons, Ty3 repeats, other repeats (from RepeatMasker), introns and other (white). Sliding windows (5 Mb width, 1 Mb steps) are plotted. Decomposed blocks of alignments from minimap2 are shown between the two subgenomes.
nature.com -
Fig. 2: Molecular and phenotypic variation between TM-1 and the three modern cultivars.
a,b, Principal components were calculated through principal component analysis (PCA) from 7.3 million SNPs across 218 landrace, 228 improved/modern and TM-1 genotypes in the full set of materials (a) and the TM-1 and improved/modern lines (b). c, The same set of polymorphic SNPs was used to calculate genetic distances among the polishing libraries of the four reference genomes; the graph of these distances with x–y positions derived from the distances (multidimensional scaling (MDS) coordinates). d, Scanning electron microscopy images and data from representative fibres that had a circumference close to the mean of each genotype (n = 60). The numerical value within each image refers to the exact circumference of the fibre.
nature.com -
Fig. 3: Synteny and PAV across four cotton genomes.
a, Completely collinear (grey), inverted (red) and PAV (white wedges) sequences are plotted on a common coordinate system across the genomes. b, Zoomed-in contact maps of both TM-1 (left) and CSX8308 (right) Hi-C libraries mapped to the TM-1 reference are shown to highlight the chromosome A06 inversion found only in CSX8308. The off-diagonal ‘hourglass’ contacts in CSX8308 clearly confirm the presence of this inversion relative to TM-1. c, Gene family PAV within genomes is presented. Gene families private to TM-1 (yellow) and the modern cultivars (pink) are highlighted. d, Gene family PAV for ‘liftover’ gene model projection from the UA48 annotation onto the other three genomes demonstrates that hundreds of gene sequences are completely missing across the genomes.
nature.com
Conclusion
A more complete reference genome for cultivated cotton
The cotton breeding and genetics community currently relies on the v2 reference sequence of TM-1 as the foundation for sequence and marker discovery. While serviceable, the TM-1 v2 reference sequence suffers two major limitations. First, the previous assembly was unable to accurately distinguish sequences in the substantial and highly repetitive pericentromeres of the cotton genome, which produced a fragmented assembly with 5,723 contigs (Fig. 1a). To provide a foundation for further cotton comparative genomics and reference-based approaches, we reconstructed the TM-1 reference genome using deep (116.7×) PacBio CLR, 55.0× Illumina sequence polishing (Methods) and Hi-C scaffolding (172×). Heterozygosity tends to be very low in inbred tetraploid cotton cultivars, and TM-1 is no exception with 12,173 heterozygous sites (single-nucleotide polymorphisms (SNPs) or insertions and deletions (indels) across the 2,154 million callable bases (5.6 heterozygous sites per megabase). This heterozygosity also justifies a haploid genome assembly representation and the use of continuous long read (CLR) sequencing technology.