Mass-spectrometry-based draft of the Arabidopsis proteome

Data reporting

No statistical methods were used to predetermine sample size. Samples for mass spectrometry and RNA-seq measurements were prepared and measured in a random order. The investigators were not blinded to allocation during experiments and outcome assessment.

Sample preparation

Inflorescence and seed samples

A. thaliana wild-type Columbia-0 (Col-0) plants were grown under continuous white light conditions at 22 °C. Samples for flower parts and siliques were collected from mature plants. Stage 15 flowers⁴⁴ were dissected into petal, sepal, stamen and carpel. Fully grown green siliques were separated into septum, valves and green seeds (stage 10). Surface-sterilized mature dry seeds were stored for 2 days at 4 °C and subsequently imbibed for 24 h or kept dry. For pollen collection, plants were grown on soil under a long photoperiod (16 h of light, 21 °C, 65% humidity). Mature pollen was bulk-harvested from open flowers at developmental stage 13⁴⁴. A vacuum cleaner was modified with three subsequent nylon meshes (80 μm, 35 μm, 10 μm mesh) for large-scale pollen isolation as previously described⁴⁵. Pollen was collected in a 1.5-ml reaction tube, snap-frozen in liquid nitrogen and stored at −80 °C until further use.

Cell culture and callus samples

Cell culture samples (root cells, Col-0) were grown in medium composed of 4.3 g l⁻¹ Murashige and Skoog basal medium (MS), 30 g l⁻¹ sucrose and 0.33 mg ml⁻¹ KH₂PO₄, supplemented with 2,4-dichlorophenoxyacetic acid (2,4-D) (final concentration 1 mg l⁻¹; pH adjusted to 5.8 with KOH) at 22 °C under continuous light and obtained either 3 or 10 days after sub-culturing. To generate callus-inducing medium (CIM) callus, root explants were collected from two-week-old seedlings (Col-0) grown in sterile culture on MS plates. Then, 5–10 mm long root segments were cultured on CIM medium composed of 1× Gamborg’s B-5 salts, 20 g l⁻¹ glucose, 0.5 g l⁻¹ MES, 1× Gamborg’s vitamin solution and 1% Phytoagar supplemented with 2,4-D (500 μg l⁻¹) and kinetin (50 μg l₋₁). CIM calli appeared after 7–10 days and were propagated in 2-week subculture intervals on CIM medium. To generate the callus line with an egg-cell-like expression profile, the coding sequence of RKD2 (AT1G74480) was amplified from pistil cDNA using the primer pair RKD2fw and RKD2rev (Supplementary Data 11). Pistils from flowers at developmental stage 12⁴⁵ were collected to purify mRNA and generate cDNA as previously described⁴⁶. The PCR fragment was cloned into pENTR/D-TOPO (Invitrogen) and subsequently transferred into the GATEWAY-compatible destination vector pH7FWG2.0⁴⁷ with LR clonase. The resulting expression vector 35Sp:RKD2-GFP was used for floral dip transformation of Arabidopsis⁴⁸. Seeds of transformed plants were surface-sterilized and grown on half-strength MS medium with 2% (w/v) sucrose, 1% phytoagar and hygromycin (30 μg ml⁻¹). RKD2-induced calli had formed after 20–30 days and were propagated in two-week subculture intervals on half-strength MS, 2% (w/v) sucrose, 1% phytoagar and hygromycin (30 μg ml⁻¹). Calli were collected with a sterile scalpel blade and immediately frozen in liquid nitrogen.

Leaf and root samples

Arabidopsis plants (Col-0) were grown under continuous light conditions at 22 °C. Senescent rosette leaves were collected from 35-day-old plants. Samples for stem, first node, second internode and first cauline leaf were collected from 30-day-old plants. Rosette leaf sections (Leaf7: distal; proximal; petiole) were collected from 22-day-old plants before bolting. Seedlings were grown on half-strength MS plates under continuous light for 7 days and separated into cotyledons, hypocotyl, root tip, root maturation zone or seed apical meristem including cotyledons and first leaves. Whole roots were obtained from 22-day-old plants grown under continuous light on half-strength MS plates.

Classification of growth stage and plant section was done as previously described^44,49,50,51. Obtained material from at least three individual plants was combined for each sample, frozen in liquid nitrogen and stored at −80 °C until further use.

Protein lysis and digest

Frozen plant material was homogenized with a tissue lyzer (Qiagen) or with mortar and pestle in liquid nitrogen. Proteins were precipitated overnight with 10% tricholoroacetic acid in acetone at −20 °C and subsequently washed twice with ice-cold acetone. Dry samples were incubated with urea digestion buffer (8 M urea, 50 mM Tris-HCl pH 7.5, 1 mM DTT, cOmplete EDTA-free protease inhibitor cocktail (PIC) (Roche), phosphatase inhibitor (PI-III; in-house, composition resembling phosphatase inhibitor cocktail 1, 2 and 3 from Sigma-Aldrich)) for 1 h. Protein concentration was determined with a Bradford assay⁵². Then, 1 mg of protein was reduced (10 mM dithiothreitol (DTT), 1 h, room temperature), alkylated (55 mM chloroacetamide, 30 min, room temperature) and subsequently diluted 1:8 with digestion buffer (50 mM Tris-HCl pH 8.0, 1 mM CaCl₂). In-solution pre-digestion with trypsin (Roche) was performed for 4 h at 37 °C (1:100 protease:protein ratio), followed by overnight digestion with trypsin (1:100 protease:protein ratio). Samples were acidified to pH 3 using trifluoroacetic acid (TFA) and centrifuged at 14,000g for 15 min at 4 °C. The supernatant was desalted on 50 mg SepPAC SPE catridges (Waters). Peptides were eluted with 0.1% TFA in 50% acetonitrile (ACN) and vacuum-dried in a Thermo Savant SPD SpeedVac (Thermo Fisher Scientific).

Peptide enrichment and off-line fractionation

Fe³⁺-IMAC was performed as described previously with some adjustments⁵³. In brief, desalted peptide samples were re-suspended in ice-cold IMAC loading buffer (0.1% TFA, 40% ACN). For quality control, 1.5 nmol of a synthetic library of phosphopeptides and their corresponding non-phosphorylated counterpart sequence (B2 and F1)⁵⁴ were spiked into each sample before loading onto a Fe³⁺-IMAC column (Propac IMAC-10 4 × 50 mm, Thermo Fisher Scientific). The enrichment was performed with buffer A (0.07% TFA, 30% ACN) as wash buffer and buffer B (0.315% NH₄OH) as elution buffer. Collected full proteome and phosphopeptide fractions were vacuum-dried and stored at −80 °C until further use.

For the full proteome fraction, hydrophilic strong anion-exchange chromatography peptide separation was performed as described previously¹⁵. In brief, an equivalent of 300 μg protein digest was reconstituted in hydrophilic strong anion-exchange solvent A (5 mM Tris-HCl, pH 8.5) and separated using a Dionex Ultimate 3000 HPLC system (Dionex) equipped with an IonPac AG24 guard column (2 × 50 mm) and an IonPac AS24 stong anion exchange column (2 × 250 mm, Thermo Fisher Scientific). Fractions were collected in 96-well format and subsequently pooled to 24. Individual fractions were acidified with formic acid, desalted on self-packed StageTips (five disks, diameter 1.5 mm C18 material, 3 M Empore, elution solvent 0.1% formic acid in 50% ACN) and dried down before liquid chromatography–tandem mass spectrometry (LC–MS/MS) analysis. Phosphopeptide fractions were separated into four fractions using a StageTip (five disks, diameter 1.5 mm C18 material, 3 M Empore) based high pH reversed-phase protocol as previously described⁵⁵. Phophosphopeptides were eluted with 2.5%, 7.5%, 12.5% and 50% ACN in 25 mM NH₄FA, pH 10. The flow through and 50% ACN fraction were combined and all fractions were dried down before LC–MS/MS analysis.

For the CHX and MG132 chase experiments, 300 μg peptides for each sample were reconstituted in high pH reversed-phase loading buffer (2.5 mM NH₄HCO₃, pH 8) and fractionated using a Dionex Ultimate 3000 HPLC system (Dionex) and a Waters XBridge column (BEH130 C18, 3.5 μm, 2.1 × 150 mm; Waters). Peptides were separated by a linear gradient from 4% buffer D to 32% D in 45 min, followed by a linear gradient from 32% D to 80% D in 6 min. The proportion of buffer A was kept at 10% during fractionation (buffer A: 25 mM NH₄HCO₃, pH 8; buffer C: 100% ACN; buffer D: 100% ultrapure water). Fractions were collected in 96-well format, subsequently pooled to 48, acidified and dried in a SpeedVac.

Seed treatment with CHX and MG132

Aliquots of surface-sterilized mature dry seeds (10 mg, Col-0) were stored for two days at 4 °C in the dark and subsequently imbibed with 2 ml liquid half-strength MS at 22 °C under constant light. After 4 h incubation, CHX or MG132 (N-(benzyloxycarbonyl)-Leu-Leu-Leu-al) was added to a concentration of 100 μM. Because both CHX and MG132 were dissolved at the same stock concentration in 100% DMSO, the respective volume of 100% DMSO was added to the control samples. As a baseline sample, one seed aliquot was dried and immediately frozen in liquid nitrogen after the 4 h incubation in half-strength MS medium. Seeds for CHX, MG132 and DMSO control were removed after 8 h, 16 h and 24 h treatment, dried and frozen in liquid nitrogen. Germination of treated seeds was visually checked after four days of incubation. Both compounds CHX and MG132 were active and led to complete or partial inhibition of seed germination as compared to a DMSO treated control sample (Extended Data Fig. 4i).

Protein extraction and digest was performed as described above. Peptide quantification before high pH reversed phase fractionation was done using the Pierce BCA protein assay kit (Thermo Fisher Scientific)⁵⁶.

SEC

Homogenized flower (stage 15), leaf (rosette leaf 7) and root (whole root) samples were lysed in ice-cold 50 mM Tris-HCl pH 7.4, 100 mM NaCl, 10% glycerin, PIC, PI-III and 0.1% Triton-X100. After filtering (diameter 0.2 μm), 0.25 ml lysate containing 1 mg of protein was injected on a Superose 6 10/30 GL column (GE Healthcare) and separated at a flow rate of 250 μl min⁻¹ on an Äkta pure 25 (GE Healthcare). Molecular mass calibration of the column was performed with the high molecular mass gel filtration calibration kit (GE Healthcare). After the void volume, 0 fractions of 125 μl each were collected, vacuum-dried and re-solubilized in urea digestion buffer. In-solution digestion with trypsin and sample desalting on self-packed StageTips was performed as described above. For quality control, PROCAL peptide standard⁵⁷ was spiked into each SEC fraction before LC–MS/MS analysis.

LC–MS/MS analysis

Nanoflow LC-MS/MS was performed by coupling a Dionex 3000 (Thermo Fisher Scientific) to a QExactive Orbitrap HF (Thermo Fisher Scientific). Samples for the proteome or phosphoproteome analysis were re-suspended in loading buffer containing 0.1% formic acid or 50 mM citrate and 1% formic acid, respectively. Peptide loading and washing were done on a trap column (100 μm ID × 2 cm, packed in-house with Reprosil-Pur C18-GOLD, 5 μm resin, Dr. Maisch) at a flow rate of 5 μl min⁻¹ in 100% loading buffer (0.1% formic acid) for 10 min. Peptide separation was performed on an analytical column (75 μm ID × 40 cm packed in-house with Reprosil-Pur C18, 3 μm resin, Dr. Maisch) at a flow rate of 300 nl min⁻¹ using a 110 min gradient from 4% to 32% solvent B (solvent A: 0.1% formic acid, 5% DMSO in HPLC grade water; solvent B: 0.1% formic acid, 5% DMSO in acetonitrile) for the tissue map full proteome analysis and a two-step 110 min gradient from 2% to 27% solvent B for the phosphoproteome analysis⁵⁸. Peptides were ionized using a spray voltage of 2.2 kV and a capillary temperature of 275 °C. The instrument was operated in data-dependent mode, automatically switching between MS and MS2 scans. For the full proteome samples of the tissue atlas experiment, full scan MS spectra (m/z 360–1,300) were acquired with a maximum injection time of 10 ms at 60,000 resolution and an automatic gain control (AGC) target value of 3 × 10⁶ charges. For the top 12 precursor ions, high-resolution MS2 spectra were acquired in the orbitrap with a maximum injection time of 50 ms at 15,000 resolution (isolation window 1.7 m/z), an AGC target value of 2 × 10⁵ and normalized collision energy of 25%. The underfill ratio was set to 1% with a dynamic exclusion of 30 s. Only precursors with charge states between 2 and 6 were selected for fragmentation. For the phosphoproteome analysis, the MS2 spectra were acquired with a resolution of 30,000 (isolation window 1.7 m/z) and a maximum injection time of 120 ms. Dynamic exclusion was set to 35 s.

SEC samples were measured using a 60 min 4% to 32% gradient. MS2 spectra were generated for the top 20 precursors, with a maximum injection time of 50 ms at 30,000 resolution (isolation window 1.7 m/z), a normalized collision energy of 25% and a dynamic exclusion of 20 s.

The seed samples treated with CHX or MG132 were analysed by a micro-flow LC system coupled online to an Orbitrap Fusion Lumos mass spectrometer (Thermo Fisher Scientific) as previously described⁵⁹. The micro-flow LC system was built by combining a modified Vanquish pump with the auto-sampler of the Dionex UltiMate 3000 RSLCnano System. The sample was directly loaded onto a Thermo Fisher Scientific Acclaim PepMap 100 C18 LC column (2 μm particle size, 1 mm ID × 150 mm), the flow rate was 50 μl min⁻¹ and column temperature was maintained at 55 °C. A 15-min linear gradient of 7–32% solvent B (solvent A: 0.1% formic acid, 3% DMSO in HPLC grade water; solvent B: 0.1% formic acid, 3% DMSO in ACN) was used to separate all the samples, followed by 1 min 95% solvent B washing and 1 min 0.5% solvent B equilibrium time at a flow rate of 100 μl min⁻¹. The Orbitrap Fusion Lumos mass spectrometer was operated with the Ion Max API Source installed with a HESI-II probe (50 μm ID). The detailed acquisition parameters are: Positive polarity; spray voltage 3.5 kV, funnel RF lens value at 40, capillary temperature of 325 °C, auxiliary gas heater temperature of 300 °C. The flow rates for sheath gas, aux gas and sweep gas were set to 32, 5 and 0, respectively. Full scan MS spectra (m/z 360–1,300) were acquired in the orbitrap with a maximum injection time of 10 ms at 120,000 and an AGC target of 4 × 10⁵. MS2 spectra were acquired in the linear ion trap (rapid scan mode) after collision-induced dissociation fragmentation with collision energy set at 35%, an AGC target of 1 × 10⁴ and maximum IT of 10 ms. The isolation window was set to 0.4 m/z and scans were recorded with a maximum duty cycle of 0.6 s and the option ‘inject ions for all parallelizable time’ enabled. Only precursors with charge states between 2 and 6 were selected for fragmentation and the lowest scan range of MS2 was fixed at 100 m/z. The intensity threshold was set to 5 × 10³ and the dynamic exclusion to 12 s.

Peptide and protein identification and quantification

Raw data files were searched with MaxQuant software (v.1.5.8.3) using standard settings unless otherwise described⁶⁰. MS/MS spectra were searched against Araport11⁵ protein coding genes (Araport11_genes.201606.pep.fasta; downloaded June 2016), with trypsin as protease and up to two allowed missed cleavages. Carbamidomethylation of cysteines was set as fixed modification and oxidation of methionines and N-terminal acetylation as variable modifications. The ‘match-between-runs’ function was enabled for corresponding fractions within one parameter set, where applicable.

For the tissue map analysis, full proteome and phosphoproteome samples were processed together as two separate parameter groups, with phosphorylation of serine, threonine or tyrosine defined as variable modification only for the phosphoproteome group. Here we also added the spike-in phosphopeptide library sequences⁵⁴ to the database search space.

For sORF identification, two sORF databases (ATSO and ARA-PEP) were added to the search space in addition to Araport11. ARA-PEP is a previously described repository of putative sORF-encoded peptides in Arabidopsis¹⁴. ATSO (A. thaliana sORFs) was generated by using sORFfinder⁶¹ and our RNA-seq analysis data to identify putative new sORFs in non-coding intergenic regions (ATSO.non_coding.pep.nr.cd-hit_aS1_aL0.3_c1.out.nr.fasta). Three targets were used as input for sORFfinder. Sequences of intergenic and non-coding regions relative to the Araport11 annotation, and sequences resulting from a Trinity⁶² de novo transcriptome assembly that are non-overlapping with the Araport11 annotations. Afterwards, the sORF nucleotide sequences were translated to amino acid sequences. CD-HIT⁶³ was used with the parameters –aS 1 –aL 0.3 –c 1 to reduce sequence redundancy.

SEC experiment raw files for flower, leaf and root were all searched together, with each SEC fraction designated as individual experiment and the ‘match-between-runs’ function enabled.

The seed experiment raw files were searched using standard settings as described above.

Peak list files generated previously⁷ were downloaded from Pride (PRD000044) and searched with the Mascot⁶⁴ search engine (v.2.4.1, Matrix science) against the Araport11 database. The target-decoy option of Mascot was enabled and search parameters included a precursor tolerance of 1.3 Da and a fragment ion tolerance 0.45 Da. Enzyme specificity was set to trypsin and up to two missed cleavages were allowed. The Mascot ¹³C option, which accounts for the misassignment of the monoisotopic precursor peak, was set to 1 and oxidation of methionine and carbamidomethylation of cysteines were included as variable and fixed modifications, respectively. The isobarQuant workflow was used to generate protein identification files⁶⁵.

Spectra validation

sORF peptides identified by database searching that could not be mapped to an existing Araport11 gene model were synthethized at JPT Peptide Technologies using Fmoc-based SPOT synthesis on membranes⁵⁷ and measured on the same mass spectrometer that was used for data acquisition (tissue atlas full proteome method; see ‘LC–MS/MS analysis’ section). Experimental and synthetic peptide spectra were extracted from the raw files and used for similarity calculation without prior processing. Normalized spectral contrast angle comparison between spectra of the tissue sample and synthetic peptides was performed using in-house R scripts⁶⁶ factoring in peaks which are either shared between spectra or exclusive to the synthetic peptide spectra. Peaks exclusive to experimental spectra were ignored. Candidates were selected with spectral contrast angle cut-off values larger than 0.7 and BLASTed against the UniProt database for further validation (Supplementary Data 3).

Protein identifications for the uncertain evidence category of UniProt were validated by comparing the experimental spectra to in-silico predicted fragment spectra¹² (Supplementary Table 1). For this, a spectral library was obtained from Prosit (https://www.proteomicsDB.org/prosit)¹² by uploading all sequence-charge combinations of peptides that were identified for uncertain proteins. To obtain the best matching spectra, the collision energy of Prosit was calibrated using a standard quality control run from the same mass spectrometer that was used for data acquisition. The resulting predicted spectral library was visualized and compared to the experimentally acquired data analogue to the sORF candidate peptides. Spectral contrast angle values larger than 0.7 were used as cut-off values.

RNA sequencing

Total RNA was isolated with the NucleoSpin RNA Plant kit (Macherey-Nagel) according to the manufacturer’s instructions. DNA was removed by on-column treatment with rDNase (Macherey-Nagel). For recalcitrant samples (seed, silique, pollen), a LiCl-based protocol was adopted with minor modifications⁶⁷. After LiCl precipitation, the RNA pellet was dissolved in rDNase buffer and treated with rDNase (Macherey-Nagel) at 37 °C for 10 min. The final pellet was re-suspended in 35 μl DEPC-treated water.

RNA was quantified (Nanodrop, Thermo Fisher Scientific) and quality checked with a Bioanalyzer 2100 (Agilent Technologies). RNA integrity number values between 6.4 and 10 were accepted for further analysis. cDNA libraries were prepared using the TruSeq Stranded mRNA Sample Preparation kit (Illumina) according to the instructions. Clusters were generated in two batches and sequenced on a high-throughput flow cell with the HiSeq 2500 platform (Illumina) to a depth of 36 million reads per sample. Quality assessment of raw and trimmed 75-bp paired RNA-seq reads was performed with FastQC. Raw RNA-seq reads were trimmed to remove adaptor contamination and poor-quality base calls using Trimmomatic v.0.35 with parameters (ILLUMINACLIP: Illumina-PE.fasta:2:30:10; LEADING:3; TRAILING:3; SLIDINGWINDOW:4:20; MINLEN:36)⁶⁸. Trimmed RNA-seq reads were mapped to the Araport11⁴ transcriptome with Kallisto v.0.43.1 (default parameters)⁶⁹.

Data processing

MaxQuant output tables were filtered for non-plant contaminants, reversed sequences and proteins that were only identified based on modified peptides. Protein abundance estimation was based on either intensity-based absolute quantification (iBAQ)²⁶, or top-3 quantification⁷⁰, depending on the analysis. MaxQuant ProteinGroups containing several gene loci were filtered out to retain only unambiguously identified gene loci for further analyses. In case multiple protein isoforms were identification in distinct ProteinGroups, only the isoform with the higher number of razor + unique peptides was retained.

mRNA quantities are displayed as TPM and a cut-off value of 1 TPM was used as lower limit of detection across all samples. GO⁷¹ term analysis for the transcript abundance range was performed using the 1D enrichment function from Perseus^72,73 with Benjamini–Hochberg false discovery rate (FDR) threshold set to 0.01. In this function, a two-sided Wilcoxon–Mann–Whitney test is used to test for systematically larger or smaller transcript abundance within a GO category as compared to the global distribution of all values (Supplementary Data 4).

For further qualitative and quantitative analyses, all transcript or protein isoform information was collapsed onto gene level. Unless otherwise stated, displayed abundances for protein, transcript and P-site were log₂-transformed. Protein, peptide and transcript datasets for the tissue map and seed treatment experiments were median-centred to the overall median of the respective dataset. No normalization was performed for the P-site dataset, because variations in the intensities of total P-sites between tissues are also due to biological sample differences. Instead, the spike-in phosphopeptide library was used, to assess reproducible enrichment efficiency and MS measurement quality of phosphoproteome samples⁵⁴.

P-sites that were reported in the MaxQuant output table (Phospho(STY)sites.txt file) with a ‘localization probability’ larger than 0.75 were designated as ‘high confidence localizations’ or class I sites⁷⁴. P-sites were considered exclusive if they were only detected in a single tissue. The number and identity of P-sites and phosphoproteins detected in this study were compared to datasets available through PhosPhAT4.0 and a published meta-analysis^8,9 (Extended Data Fig. 1h).

Data analysis

Genome annotations

Chromosome information contained in the Arabidopsis locus identifiers (AGI codes: AT (A. thaliana); 1, 2, 3, 4, 5, M, C (chromosome number, M for mitochondrial, C for chloroplast); G (gene); 12300 (five-digit code for position on chromosome)), was used to assign genes to their respective chromosomes (Extended Data Fig. 1c). Araport11 gene identifiers (Arabidopsis gene identifiers (AGI)) were mapped to the UniProt A. thaliana reference proteome (taxon identifier 3702; UP000006548; downloaded November 2018) based on protein sequence. Swiss-Prot, TremBL as well as protein existence criteria annotations (level 1: evidence on protein level; level 2: evidence on transcript level; level 3: inferred from homology, level 4: predicted, level 5: uncertain) were subsequently obtained from UniProt (Extended Data Fig. 1d).

N- and C-terminal peptide sequences were extracted from the MaxQuant peptides.txt file and filtered for zero missed cleavages (n = 2,776) (Extended Data Fig. 2a). N-terminal peptides were divided into groups with (n = 1,707) or without (n = 1,069) cleavage of the initiator methionine. The frequency of the 20 genetically encoded amino acids at the position after the start codon was subsequently calculated separately for both groups and displayed as a pie chart. For N-terminal peptides with the same amino acids at the position after the start codon, the percentage of peptides with N-terminal acetylation was also calculated and displayed as bar plot. This analysis was also performed separately for the groups with or without cleavage of the initiator methionine (Extended Data Fig. 2b, c).

Gene isoforms annotated in Araport11 were considered as ‘identified’, if they were detected with a TPM intensity larger than our limit of detection cut-off value (1 TPM; transcript level) or with an isoform-specific peptide (protein level) in at least one tissue sample (Extended Data Fig. 2f, Supplementary Data 3). A selection of isoform-specific peptides was synthesized together with the sORF peptide collection at JPT Peptide Technologies and the synthetic peptides were used for validation of the experimental spectra identifications as described above.

Transcription factor, transcription regulator, kinase and phosphatase family annotations and classifications are based on previous reports^16,17. Proportional coverage of these families within our dataset was calculated by counting how many of them could be identified on transcript, protein or phosphoprotein level, respectively (Extended Data Fig. 2k).

Tissue groups and tissue characteristics

Tissue samples were assigned to tissue groups as follows, based on their origin or common morphology: flower (sepal, petal, stamen, carpel, silique, flower); fruit (silique septum, silique valves); seed (embryo, seed, seed imbibed); pollen (pollen); leaf (cauline leaf, leaf distal, leaf proximal, leaf petiole, senescent leaf, cotyledons, shoot tip); stem (node, internode, flower pedicle, hypocotyl); root (root, root tip, root upper zone); callus (egg-cell like callus, callus); and cell culture (cell culture early, cell culture late) (Fig. 1a, b).

To provide a measure of tissue similarity, Pearson’s correlation coefficient (r) was calculated for the gene expression of all pairwise tissue combinations. Pearson’s correlation coefficient is a measure of the linear correlation between two variables, in this case the expression levels in two tissue samples. Correlations were computed both on protein level and transcript level and displayed as separate heat maps for both omics levels (Extended Data Fig. 1a). For three examples of morphologically highly similar tissue pairs, the gene expression levels were also displayed as scatter plots using protein abundance or transcript abundance measurements, respectively (Extended Data Fig. 1b).

Tissue or tissue-group specificity of genes was calculated based on iBAQ or TPM abundance values, respectively⁷⁵. Genes were assigned to categories in the following order: ‘tissue-specific’, ‘tissue-enhanced’, ‘group-specific’, ‘group-enhanced’, ‘shared’ and ‘mixed’. Genes were considered ‘specific’, if they were only detected in a particular tissue or tissue-group and ‘enhanced’ if their abundance was at least fivefold higher in a particular tissue or tissue-group as compared to the average levels in all other tissues. Genes that were detected in all 30 tissues but did not show enhanced expression in a tissue or tissue group were classified as shared. All remaining genes are contained in the mixed category (Extended Data Fig. 3a).

Genes detected in all tissues on either protein or transcript level were assigned to the core datasets (core_transcript n = 8,405; core_protein n = 7,734; core_intersection n = 5,043). This classification does not consider gene expression levels and should not be confused with the tissue-specificity classification described in the previous paragraph.

Hierarchical clustering analysis for tissues representing flower parts or the whole flower was performed on log₂-transformed, z-scored protein intensities (iBAQ values) in Perseus using Euclidean distance and average linkage (Extended Data Fig. 3b). The flower organ identity model was restricted to the simplified ABC model, thus only displaying homeotic genes for the A, B and C classes⁷⁶, namely the expression of MADS-box transcription factors APETALA1 (AP1) for class A, APETALA3 (AP3) and PISTILLATA (PI) for class B, and AGAMOUS (AG) for class C. Class A expression specifies sepal formation, the combination of class A and B specifies petal formation, the combination of class B and C specifies stamen formation, and class C expression specifies carpel formation in developing flowers.

For the cumulative abundance calculation of five representative tissues (flower; pollen; root tip; leaf proximal; seed), proteins (iBAQ) and transcripts (TPM) were first ranked from highest to lowest individual intensity (Extended Data Fig. 3e). The running total was then plotted against the rank order. The names or identifiers of the five most abundant transcripts or proteins (rank 1 to 5) are listed in descending order for the respective tissue. Shared IDs among the top 100 most abundant transcripts (TPM) and proteins (iBAQ) were calculated for each individual tissue (Extended Data Fig. 3f). For the genes representing the highest abundant protein in at least one tissue, the contribution to the total protein amount in an individual tissue was calculated by dividing the iBAQ intensity (not log-transformed) of the respective protein by the summed total iBAQ intensity (not log-transformed) in this tissue (Extended Data Fig. 3g).

Principal component analysis of proteome and transcriptome data was performed in Perseus for the intersection of the core datasets (n = 5,043) on log₂-transformed and z-scored iBAQ and TPM intensities (Extended Data Fig. 3h). Subcellular localization information was downloaded from SUBA (downloaded November 2016; unambiguous localizations n = 3,506)⁷⁷ (Supplementary Data 2) and used to calculate the intensity contribution of proteins assigned to a specific compartment for the different tissue groups. For this, the percentage of proteins with the same SUBA annotation was calculated by dividing the summed protein intensity (iBAQ, not log-transformed) of each SUBA category by the total summed iBAQ intensity (not log-transformed) within each tissue. To compare tissue groups, the respective protein intensity proportion was then averaged for all tissues within one group. These averaged proportions were subsequently plotted for each subcellular compartment (Extended Data Fig. 3i).

Protein/mRNA relation

The Pearson’s correlation value (r) between median protein (median iBAQ of 30 tissues) and median transcript (median TPM of 30 tissues) abundances was calculated using the set of genes with abundance measurements on both protein and transcript level in more than ten tissues (more than 10 pairwise complete observations; n = 14,069). The scatterplot of median transcript (TPM) and protein abundance (iBAQ) was displayed together with their marginal histograms (Fig. 2a). Tissue-specific Pearson’s correlation values between protein and transcript abundance were calculated using all genes from the core proteome and core transcriptome intersection dataset (n = 5,043) (Extended Data Fig. 4a). The core dataset was used here to base the Pearson’s correlation value comparison between tissues on the same set of genes in all tissues.

The data subset with more than 10 pairwise complete observations on protein and transcript level (n = 14,069) was also used for all further calculations involving PTR (Fig. 3d, Extended Data Fig. 5). PTR values are calculated by building the ratio between protein abundance (iBAQ) and the corresponding transcript abundance (TPM) for each gene and were calculated separately for each tissue. The tissue PTR values were then used to calculate the median and median absolute deviation (MAD) of PTR values across all 30 tissues (Extended Data Fig. 5). The variation of PTR values across tissues indicates, whether protein and transcript levels of a given gene are regulated in a similar (small PTR variation) or different way (high PTR variation) between individual tissues. To estimate the proportion of genes with rather stable or variable PTR values, the PTR MAD values were distributed into five equal parts, so that 20% of all genes from this analysis are contained in one part (5 MAD quantiles: Q1–Q5). For each gene, the median PTR values were plotted against their MAD across tissues and the MAD quantiles indicated in the plot (Extended Data Fig. 5a). GO term analysis for the median PTR distribution (median PTR across all tissues) was performed using the 1D enrichment function from Perseus⁷² with Benjamini–Hochberg FDR threshold set to 0.01 (Supplementary Data 4). In this function, a two-sided Wilcoxon–Mann–Whitney test is used to test for systematically larger or smaller PTR values within a GO category as compared to the global distribution of all values.

In a similar way, the variation of phospho ratios across tissues indicates, whether P-site levels are driven by protein abundance (small phospho ratio variation) or P-site stoichiometry (high phospho ratio variation) changes, respectively. Phospho ratios were calculated by building the ratio between the abundance of a P-site (intensity) and the abundance of its corresponding protein (iBAQ). For this, we used the set of P-sites, with abundance measurements on both P-site and protein level in more than 10 tissues (more than 10 pairwise complete observations; n = 13,793). Phospho ratios were calculated separately for each tissue and subsequently combined to calculate the median phospho ratio and phospho ratio MAD across all 30 tissues. To estimate the proportion of P-sites that closely resemble the protein abundance profile, the phospho ratio MAD values were again distributed into five equal parts (five MAD quantiles: Q1–Q5), and the median phospho ratio values plotted against their MAD across tissues each P-site. MAD quantiles were indicated in the plot (Extended Data Fig. 5c).

To provide further information about inter-tissue variability, MAD quantiles were also calculated and displayed for the expression levels on protein (iBAQ; n = 14,069), transcript (TPM; n = 14,069) and P-site (peptide intensity; n = 13,793) levels, respectively (Extended Data Fig. 5b, d).

Features for protein level prediction models

As a substantial proportion of variation in protein levels remained unexplained when using transcript level information alone (Pearson’s correlation between protein and transcript abundance in tissues r = 0.28–0.79), additional molecular features, that could explain protein abundance variations, were tested for their influence on protein abundance in a model selection approach. Predictors selected for this analysis were: mRNA levels, codon usage, non-synonymous-to-synonymous substitution (d_N/d_S) ratios, which are a measure of evolutionary conservation, gene/coding sequence (CDS) length, exon number, gene position on the chromosome, cytosine methylation, the number of putative protein interactions and mRNA sequence motifs (k-mers of size 3 to 7 nucleotides):

Codon usage statistics for the A. thaliana genome were obtained from Kazusa (www.kazusa.or.jp/codon/current/species/3702) and parsed to extract NCBI gene identifiers. These identifiers were mapped to their corresponding UniProt entries and Ensembl/TAIR10 annotations using the UniProt Retrieve/ID mapping tool. The extracted TAIR10 annotation was merged with the Kazusa codon usage dataset. Codon frequencies were calculated for each gene by dividing the count (×3) of a given codon by the full length of the coding sequence.

The d_N/d_S substitution rates were calculated from CDS pairs between A. thaliana and its closest relative Arabidopsis lyrata. Reciprocal best BLAST with an E-value cut-off of ≤ 1 × 10⁻⁸ was used to identify orthologue sequences. Individual CDS pairs were aligned using PRANK⁷⁸ and Gblocks⁷⁹ was applied to eliminate poorly aligned positions in an alignment with a cut-off value of eight contiguous non-conserved positions and without allowing for gap positions. The yn00 package in the program PAML⁸⁰ for pairwise sequence comparison was used to estimate substitution rates, d_N and d_S, respectively.

The total number of exons and the total gene lengths were obtained from Araport11. The distance of each gene from the chromosomal centromeres was calculated to capture potential position-specific effects.

A. thaliana (Col-0) whole-genome bisulfite sequencing data were obtained from a previous publication⁸¹. For each gene, methylation levels were calculated for contexts CG, CHG and CHG (in which H denotes adenine, cytosine or thymine) separately. Per gene methylation levels were defined as:

$${g}_{i}=frac{1}{{rm{max }}(j)}xsum frac{N{m}_{j}}{{N}_{j}}$$

where max(j) is the total number of cytosines, Nm_j is the number of methylated reads, and N_j the total number of reads at the jth cytosine.

Arabidopsis protein–protein interactions were downloaded from STRING³⁰ and the number of protein interaction partners extracted for each gene.

All of these features have previously been associated with transcriptional activity and/or protein abundance levels in Arabidopsis and/or other species^{82,83,84,85,86,87}. For a detailed feature set description, see Supplementary Data 5.

De novo motif identification

Motifs in the mRNA CDS, 3′ or 5′UTR region were identified as previously described²³. In brief, protein abundance levels were log₁₀-transformed and median centred. For genes with two or more transcript isoforms, the transcript isoform reported to have the largest summed iBAQ values across all tissues was defined as the major transcript isoform per gene and used to compute all sequence feature and mRNA levels. Raw RNA-seq reads were trimmed to remove adaptor contaminations and poor-quality base calls using Trimmomatic 0.39 with parameters (ILLUMINACLIP:Illumina-PE.fasta:2:30:10; LEADING:3; TRAILING:3; SLIDINGWINDOW:4:20; MINLEN:36). Trimmed RNA-seq reads were mapped to the Araport11 transcriptome annotation with Kallisto v.0.46.0 using default parameters. To estimate the levels of mature mRNA, the number of reads mapping to exonic and intronic regions of the transcript were counted separately and then normalized by the total exonic and intronic region lengths, respectively. Normalized intronic counts were subtracted from the normalized exonic counts to obtain the mature mRNA counts. The resulting normalized exonic counts per sample were corrected by the DESeq2⁸⁸ library size factor and log₁₀-transformed. Transcripts with 10 reads per 1 kb were treated as transcribed. Tissue-specific PTRs were computed using the normalized protein and transcript levels. GEMMA software⁸⁹ was used to identify de novo motifs in 5′ UTR, CDS and 3′ UTR regions using the tissue-specific PTRs as response variables. GEMMA uses a linear mixed model, in which the effect of each individual k-mer on the median PTR across tissues is assessed while controlling for the effect of other k-mers (random effects), region length and GC percentage (fixed effects). The motif search was performed for k-mers ranging from 3 to 7 nucleotides. Obtained P values were adjusted for multiple testing with Benjamini–Hochberg’s FDR and jointly computed across the P values of all tissues. Gemma was run using the median PTR with FDR < 0.1 and covariates set to ‘false’. In total, 82 significant putative motifs were obtained based on their sequence (5′ UTR n = 32; 3′ UTR n = 38; CDS n = 12) and sub-sequence (initial, all, end) region. Then, 39 motifs that lie in the ‘initial’ and ‘end’ sub-sequences were further selected. The presence or absence of each enriched motif with respect to each gene was extracted in form of a binary matrix and used for downstream multivariate feature selection analysis.

Model-based feature selection

Tissue-specific protein abundance data were merged to the feature matrix. Preliminary pairwise correlation analysis showed only weak to moderate correlation between individual features. In addition, variance inflation factors were calculated for each feature using the fmsb R package⁹⁰. The result showed low variance inflation factors, suggesting that multicollinearity was not an issue for downstream analyses.

Two model selection approaches were implemented—stepwise regression and Lasso regression^91,92. To select the most predictive features for protein abundance in each tissue, we used a forward–backward model selection approach in a multiple regression framework. The method was implemented using stepwiseAIC() function in R⁹³, which compares the fit of nested models. To ensure that the comparison of model AICs was not affected by unequal sample size, missing data were removed before the analysis.

Because stepwise regression can occasionally lead to over-fitting⁹⁴, a Lasso regression was implemented as an alternative model selection procedure. Lasso regression performs L1 regularization, which adds a penalty equivalent to the absolute of the magnitude of regression coefficients and tries to minimize them. The strength of the penalty was controlled via the tuning parameter λ⁹². Lasso was implemented using the glmnet package⁹⁵ in R. Model training was performed on a random selection of genes (50% of the dataset) and implemented using the cv.glmnet() function. The optimal value for λ was extracted and used to re-build the model using the glmnet() function. Finally, the fitted model was used to obtain predictions in the remaining 50% of the genes.

For each tissue, stepwise and Lasso regression models were compared. Stepwise regression yielded only slightly higher coefficients of determination (R²) values compared to Lasso, suggesting over-fitting was not an issue (Supplementary Data 5). As stepwise regression yielded more parsimonious models in general, we used this approach for further analysis. Selected features here could explain on average 52% of protein abundance variation (37% in pollen, 62% in cell culture early) (Supplementary Data 5). For each tissue, features from the best fitting models were summarized in an incidence matrix along with the effect direction (positive or negative effect on protein levels). To determine the importance of each feature to the overall model fits, R² variance decomposition was performed using the ‘genici’ metric which is implemented in the relaimpo R package⁹⁶ (Extended Data Fig. 4c). Relative feature contributions were averaged across all tissues (Fig. 3b).

Because many of the detected motifs appear in a tissue-specific manner (predictive only in tissue subsets), we clustered tissues according to the presence or absence status of 5′ UTR motifs in each tissue model (Extended Data Fig. 4d). Seed and pollen tissues form a distinct subcluster, which might be connected to the increased/decreased PTR levels in these tissues compared to the other tissues (Fig. 3c, Extended Data Fig. 4a). In contrast to the motif analysis, d_N/d_S ratios were consistently (and positively) associated with protein abundance in all 30 tissues (Supplementary Data 5). To explore the relationship between d_N/d_S ratios and protein abundance in more detail, groups of genes with low or high (bottom 5% or top 5% of d_N/d_S distribution) were compared to each other with regard to protein levels (Extended Data Fig. 4e).

Seed tissue PTR regulation

The PTR value distribution was plotted for the median PTR (median across all tissues, see ‘Protein/mRNA relation’) and selected tissues (seed, seed imbibed, pollen, cell culture early). Seed and pollen tissues show particularly low Pearson’s correlation values between protein and transcript abundance levels (Extended Data Fig. 4a). Cell culture early in comparison shows the highest Pearson’s correlation among the analysed tissues. Tissue-specific ‘outliers’ with especially high or low PTR as compared to all other tissues were selected if their PTR value was outside two standard deviations of the mean of the median PTR distribution. For seed, this resulted in 469 genes with ‘high’ and 571 genes with ‘low’ PTR. To interpret whether this change in seed PTR values was caused by differential regulation on transcript and/or protein level, the fold change between seed protein and transcript abundance and the median protein and transcript abundance of all tissues was calculated. Genes with at a least fourfold change in either protein or transcript abundance were considered regulated (less/more transcript and/or less/more protein in seed compared to all tissues). The percentage of genes with high or low PTR levels in the seed tissue, which are regulated in the same manner (less/more transcript and/or protein) was calculated and displayed as an arrow plot (Fig. 3e). About one-quarter of each gene set (high PTR 27%; low PTR 24%) showed no regulation based on our fourfold-change cut-off.

For the low PTR gene set, a high proportion of genes with lower protein abundance in the seed tissue was observed. A protein-level chase experiment was performed, to investigate whether this reduction was due to translational inhibition or increased protein degradation in seeds. Protein level (iBAQ, not log-transformed) fold changes in CHX-, MG132- and DMSO-treated samples were calculated relative to the baseline protein abundance at 0 h (control sample). The average protein-level fold change for the subset of genes with either high or low PTR in seeds (see previous paragraph) was subsequently determined for each time point and plotted separately for ‘high’ and ‘low’ PTR genes (Fig. 3f). To test for significant fold change differences between treatment time points, an analysis of variance (ANOVA) and post hoc Tukey honestly significant difference (HSD) test was performed using the stats package in R (Extended Data Fig. 4f). For this, the data were normalized for changes in the proteome caused by seed germination, by calculating the log₂-transformed fold change in protein abundance (iBAQ) between seed samples treated with CHX or MG132 and the time-point-matched DMSO control. Outlier values were removed for the box plot visualization but are included in the ANOVA and post hoc Tukey analysis.

Co-expression analysis

To compare gene co-expression information in our dataset to prior knowledge about various associations between genes pairs, the Pearson’s correlations between all pairwise gene combinations on both transcript and protein level (core dataset intersection; n = 5,043) was calculated. The correlation coefficient value for a given gene pair on transcript level was then plotted against the correlation coefficient value of the same pair on protein level (Extended Data Fig. 6a). Protein–protein association data were downloaded from STRING³⁰ (www.string-db.org; 3702.protein.links.detailed.v10.5.txt.gz; download July 2018), the experimental STRING subscores calculated for each gene pair and the information merged to the correlation value matrix. The scatter plot of transcript level versus protein level correlations was then divided into 50 × 50 bins along the x and y axis and the mean experimental STRING subscore for all gene pairs within a bin was calculated. Mean subscores were log₁₀-transformed and visualized as a heat map together with their marginal histograms (Extended Data Fig. 6a).

For a subset of genes, pairwise co-expression was further investigated. Pairs used in this analysis were genes that are either annotated as duplicated³¹ or as protein interactors (AtPIN³³, downloaded August 2017), respectively. The Pearson’s correlation between the expression levels of gene pairs across tissues was calculated for gene pairs with abundance measurements in more than 10 matching tissues on both protein and transcript level (more than 10 pairwise complete observations). The duplicated gene set was further filtered for three unique peptide identification of each paralogue. Top-3 intensities were used as protein abundance measure for the duplicate gene set as the subsequent ratio comparison was also performed with top-3 intensities. Paralogue genes often have high sequence identity on the amino acid level, which can lead to distorted ratios owing to uneven assignment of shared (razor) peptides to only one of the paralogues. Duplicated genes (n = 3,612) were divided into three subsets: duplications caused by a whole-genome duplication event (n = 2,104), local duplication (local, n = 408) or transposon-mediated duplication (transposed, n = 1,100). For each subset, the distribution of Pearson’s correlation values between duplicates on either transcript or protein levels was plotted (Extended Data Fig. 6b). A set of random gene pairs (n = 27,353) was generated for comparison, filtered like the duplicated gene set (number of pairwise complete observation, three unique peptides) and the Pearson’s correlation value distribution displayed in the same way. To compare relative protein abundance between paralogues, the ratio between the protein abundance (top-3) of one paralogue and the protein abundance (top-3) of the other paralogue was calculated (Extended Data Fig. 6c). As an example for the intensity proportion of paralogue genes in different tissues, the top-3 intensity (not log-transformed) of MAC5A and MAC5B was plotted as proportion of the summed top-3 intensity (not log-transformed) of both paralogues (Extended Data Fig. 6c). For 57 paralogue gene pairs with phenotypic information about asymmetric loss-of-function mutant combinations³², we build the average top-3 intensity proportion of the first paralogue (dupl1/(dupl1 + dupl2)) across 30 tissues and plotted them in descending order together with the phenotype information (Extended Data Fig. 6d).

A set of well-studied, stable protein complexes (26S proteasome, COP9 signalosome, chaperonin, cellulose synthase) was selected to estimate the protein-level (iBAQ) co-expression expected for stable interaction partners. For this, the Pearson’s correlation coefficient (r) values between all pairwise subunit combinations of an individual complex were calculated and displayed in a density plot (Extended Data Fig. 6e). On the basis of this analysis, a Pearson correlation coefficient cut-off value of >0.5 was subsequently used as indication for stable protein interactions. The protein interactor gene set (AtPIN) was used to test the recovery of annotated protein–protein interactions based on co-expression data from this study. Again, the distribution of Pearson’s correlation values (r) between gene pairs (here interactors) on either transcript or protein level (iBAQ) was plotted. This was done for the whole AtPIN dataset (n = 57,152) and the following subsets: interactions identified by yeast-two hybrid experiments (n = 7,621), by affinity-purification mass spectrometry (n = 17,982) or by both yeast two-hybrid and affinity-purification mass spectrometry (n = 829).

SEC and complex analysis

Reproducibility between the three SEC experiments was tested by comparing the elution behaviour of the same proteins. The coefficient of determination between protein peaks in the different experiments was above 0.7 (flower–root R² = 0.82; flower–leaf R² = 0.78; root–leaf R² = 0.7). Protein peak elution between the samples was also consistent across the whole SEC gradient range (Extended Data Fig. 7b).

To assign proteins to potential complexes, peak correlation profiling was performed. For this, the R package CCprofiler v.0.1 was used to analyse the peptides.txt output table from MaxQuant^61,97. Peptide table entries were restricted to the maximum molecular mass per gene name for the quantification of protein groups. This step removed peptides mapping to smaller isoforms of each gene. The restricted proteinGroups.txt served as the trace annotation input table for CCprofiler. The calibration table of the SEC column was based on the high molecular mass gel filtration calibration kit (GE Healthcare). The peptides table was split into the different tissues and the resulting tables converted into the input format for CCprofiler. All subsequent steps were performed separately for each tissue table.

The peptides table was imported, converted into a traces object, where a trace object is the SEC-elution profile of a specific peptide and annotated with the corresponding fractions and additional external information (for example, molecular mass of the respective fraction, molecular mass of the respective protein) based on the aforementioned calibration and restricted proteinGroups tables. Traces not containing at least three consecutive peptide identifications across fractions, as well as traces with a sibling peptide correlation of less than 0.2 between peptides originating from the same protein, were filtered out. On the basis of the remaining traces, protein features (elution peak positional information) consisting of highly correlated peptides (correlation higher than 0.9) were detected using CCprofiler’s sliding window algorithm. Peptides were randomly assigned to proteins to control the false discovery rate of protein features at 5% (random decoy model). Subsequently, protein traces were calculated by summing up the intensities of the top two most abundant peptide traces.

Next, complex hypotheses consisting of target and decoy protein complexes were constructed based on a previously published mapping of Arabidopsis orthologues to mammalian protein complexes from CORUM^98,99. Proteins were only mapped to the same decoy complex if they were not interacting with each other directly (minimum path length of two). These target and decoy complexes were used during the detection of complex features consisting of highly correlated proteins using the same sliding window algorithm as above to control the false discovery rate of complex features at 5% (target-decoy model).

Both protein features (elution peak positional information) and protein traces (quantitative information) were then used to generate an input matrix for weighted gene correlation network analysis (WGCNA)¹⁰⁰, which was used to find clusters of proteins with correlated SEC elution profiles. Protein traces were restricted to the ones, which were part of at least one protein feature. Each protein trace was duplicated for each protein feature it was part of and all intensities outside of the respective protein feature were set to zero. With this, protein traces were effectively split into separate elution peaks corresponding to distinct (sub)complexes. Afterwards, elution peaks, which met either of the following criteria were filtered out: (1) the absolute difference between the molecular mass at the apex of the elution peak and the monomer molecular mass of the corresponding protein was smaller than twice the monomer molecular mass. (2) The absolute difference between the molecular mass at the apex of the elution peak and the monomer molecular weight of the corresponding protein was less than 200 kDa. (3) The apex of the elution peak was in fraction 77 (void fraction). (4) The apex of the elution peak was in fraction 4 (monomer range).

The remaining elution peaks were restricted to the intersection with the core protein dataset across tissues described above (see ‘Tissue groups and tissue characteristics’), generating the SEC WGCNA input dataset. At the same time, an additional WGCNA input dataset was generated based on the core protein dataset across tissues and restricted to the intersection with the SEC WGCNA input dataset. Proteins with several SEC elution peaks were duplicated to match the SEC WGCNA input dataset. WGCNA was carried out separately for each of these datasets. Signed co-expression similarities were calculated between all pairs of proteins with at least seven pairwise-complete observations using the following formula:

$${s}_{ij}=frac{1+{rm{cor}}({x}_{i},{x}_{j})}{2}$$

where s_ij is the signed co-expression similarity between two proteins x_i and x_j (based on their Pearson’s correlation coefficient (r) across tissues). Adjacency matrices were calculated with (A=[{s}_{ij}]) for several values of β ∈ [1,30].

$${a}_{ij}={s}_{ij}^{beta }$$

where a_ij is the adjacency between two proteins x_i and x_j. The adjacency function parameter β was selected to be 30 for the construction of signed protein co-expression networks. The topological overlap matrix (Omega =[{omega }_{ij}]) of the two networks was calculated with:

$${omega }_{ij}=frac{{l}_{ij}+{a}_{ij}}{{rm{min }}{{k}_{i},{k}_{j}}+1-{a}_{ij}}$$

where ({l}_{ij}=sum _{u}{a}_{iu}{a}_{uj}) and ({k}_{i}=sum _{u}{a}_{iu}). Hierarchical clustering was performed on the topological-overlap-based dissimilarity matrices (D=[1-{omega }_{ij}]) using average linkage. Correlation clusters (‘Modules’) were detected using adaptive branch pruning¹⁰¹ using the ‘Dynamic Hybrid’ method set to respect the dendrogram topology during PAM operations and requiring at least 30 members per module. Similar clusters in the SEC or tissue dataset were merged, if the dissimilarity of their module eigengenes was smaller than 1 − MPC or 1 − MTC, respectively. MPC is the median peak correlation of the FDR-filtered complex features from CCprofiler (known complexes), and MTC is the median tissue correlation of the very same complex features (pairwise correlation between proteins mapping to a specific complex feature). The resulting modules for both datasets were then mapped to each other using a combination of Fisher’s exact test and manual curation. Enrichment of functional annotations from Corum⁹⁹, GO⁷², KEGG¹⁰² and Reactome¹⁰³ were calculated in each cluster using Fisher’s exact test. All P values were adjusted for multiple-testing by calculating FDRs¹⁰⁴ (Supplementary Data 8).

To quantify how well complexes can be detected using the tissue atlas or SEC WGCNA output, we calculated a summary statistic termed ‘complex index’. Protein interactor information was downloaded from UniProt⁶ (https://uniprot.org; December 2018) and manually curated for the presence of large (>4 subunits) and small (≤4 subunits) complexes (Extended Data Fig. 7d, Supplementary Data 9). The complex index (C) was calculated using the following formula:

$$C=frac{{S}_{m}}{S}times frac{{S}_{m}}{M}$$

where S_m is the number of subunits detected as present in the same module. S is the total number of subunits identified in the restricted datasets for either SEC or tissue atlas experiments and M is the total number of entries in the respective module. The complex index is a measure of how well complex subunits co-occur (either by co-elution or by co-expression) and thus are detected in the same module and also quantifies the resolution of the detection method (module size). The complex index is 1 when all subunits of a complex are identified in the same module and no other proteins are contained in the module.

Complex stoichiometry analysis

Absolute SEC protein elution traces of selected complexes (chaperonin, 26S proteasome, CSN, CESA) were plotted using top-3 intensities (not log-transformed). Relative subunit proportions or stoichiometry for selected protein complexes within individual tissues of the tissue atlas dataset were calculated using top-3 intensities (based on the three most abundant unique peptides). Top-3 intensity rather than iBAQ intensity was used, to avoid distorted ratios caused by shared (razor) peptide assignment. In the case of paralogue genes with high protein sequence similarity, peptides were not required to be unique for one gene, but rather for one ‘paralogue group’. The three most abundant peptides within a paralogue group were then used for the top-3 calculations.

For the example of the coatomer complex, the relative proportion of paralogue genes in different tissues was calculated by plotting the top-3 intensity (not log-transformed, unique for gene) of each paralogue as proportion of the summed top-3 intensity (not log-transformed) of all paralogues (Extended Data Fig. 7e). For the calculation of subunit stoichiometry ratios of selected complexes (chaperonin, 26S proteasome, CSN, CESA), the top-3 intensity (not log-transformed) of a subunit was divided by the average top-3 intensity (not log-transformed) of all complex subunits. Subunit ratios were first calculated for each tissue and subsequently averaged across all 30 tissues (chaperonin, 26S proteasome, CSN) or only across tissues where the respective subunits are mainly expressed (CESA4/7/8: node, internode, silique septum, silique valves; CESA1/3/6: all other) (Extended Data Fig. 7g).

Subnetwork extraction

De novo network enrichment was performed to identify tissue- or tissue-group-specific subnetworks and to link them to a molecular function. In contrast to gene set overrepresentation or gene set enrichment analysis¹⁰⁵, this approach is suited to identify previously uncharacterized functions from large-scale molecular interaction networks. KeyPathwayMiner¹⁰⁶ was used to extract tissue- and tissue-group-specific subnetworks from transcriptomics and proteomics datasets. To enrich subnetworks with proteins that deviate in their expression from other tissues, we used z-scored expression values. These reflect how many standard deviations the expression in a given tissue is away from the expression range found in all other tissues:

$${z}_{x,g}=frac{x-{mu }_{Xbackslash x}}{{sigma }_{Xbackslash x}}$$

where x is a specific tissue or tissue group of interest, g a gene, μ the mean, and σ the standard deviation. On the basis of the following rule, a one-column indicator matrix I(x,g) was constructed as input for KeyPathwayMiner for each tissue or tissue group x:

$$I(x,g)={begin{array}{l}1,{rm{if}},|{z}_{x,g}|rangle 2\ 0,{rm{else}},end{array}$$

Here each row corresponds to a gene g. STRING³⁰ A. thaliana network (v.10.5, downloaded October 2018) was used as the molecular interaction network, considering only high-confidence interactions with a score >900. Subnetworks extracted via KeyPathwayMiner were made available for individual tissues or tissue groups as part of the ATHENA web application.

Phosphorylation sites and motif analysis

Serine, threonine and tyrosine content and the P-site number of individual proteins was calculated based on the longest sequence in case of ambiguous isoform identifications (Fig. 5a, Extended Data Fig. 8g). Similarly, the schematic of phosphorylation (P-site) localization in LEA and receptor-like kinase proteins was based on the P-site localization in the longest isoform sequence (Extended Data Fig. 8h, i). Functional domain assignment for receptor-like protein kinase sequences was done using SMART (http://smart.embl-heidelberg.de/smart/batch.pl)¹⁰⁷ and P-sites were manually assigned to specific domains based on their localization in the protein (Extended Data Fig. 8i).

For motif analysis within our dataset, sequence windows of 15 amino acids centred on the identified class I P-sites (15 mers) were assigned to a motif class of proline-directed, acidic, basic or other by following a binary decision tree³⁶ (Extended Data Fig. 8d). Assignment to these categories was done sequentially as follows: Pro amino acid at position −1 (proline-directed), 5 or more Asp or Glu at position +1 to +7 (acidic), Arg or Lys at position −3 (basic), Asp or Glu at position +1,+2 or +3 (acidic), 2 or more Arg or Lys at position −6 to −1 (basic) and otherwise (other).

P-sites with high-confident localization scores (class I) were divided into Ser, Thr and Tyr P-site datasets. Motif extraction was performed separately for each motif class category (proline-directed, acidic, basic or other) using the motif-X algorithm, implemented in the rmotifx R package³⁵ (Supplementary Data 10). Cut-off settings were: min-seq. = 30, pval.cutoff = 1 × 10⁻⁶ (Ser and Thr datasets); min-seq. = 10, pval.cutoff = 1 × 10⁻⁵ (Tyr dataset). Motif-X calculates the fold-increase of a respective motif in the ‘foreground’ (P-site dataset) compared to the ‘background’ dataset (non-phosphorylated peptides). To avoid mass spectrometry-based residue biases, all peptides from our dataset that contained an STY amino acid and had only been identified in non-phosphorylated form were used as the background for all motif-X analyses (258,395 sequences). These peptide sequences were centred on STY and extended where necessary along the N- and C-terminal window to generate 15 mers. Position weight matrix sequence logos were drawn using the ‘bits’ method for amino acid sequences in the ggseqlogo R package¹⁰⁸ (Extended Data Fig. 8e, Supplementary Table 2). The motif-X fold increase for S motifs was plotted for motifs with 2, 3 and 4 fixed amino acid positions (Extended Data Fig. 8f).

Published P-site motifs were retrieved from PhosPhAT4.0⁹ and the Human Protein Reference Database¹⁰⁹ (HPRD; www.hprd.org) and divided into groups depending on the number of fixed amino acid positions within the motif sequence. These motifs were subsequently matched to the P-site or background dataset to calculate detection frequency and fold increase (P-site/background dataset) of a reported motif (Supplementary Data 10).

AGCVIII protein abundance and mutant phenotypic characterization

For the relative expression analysis of AGC kinases across tissues, the AGC1 and AGC3 subfamilies of Arabidopsis AGCVIII kinases were selected²¹. The summed total intensity of AGC1 and AGC3 subfamily kinases (top-3, not log-transformed) was calculated for each tissue. Relative protein amounts of individual kinases in different tissues were calculated as the top-3 intensity (not log-transformed, unique for gene) of each kinase in proportion to the summed top-3 intensity (not log-transformed) of all AGC1 and AGC3 kinases (Fig. 2c). For clear visualization only the D6PK family (D6PK, D6PKL1, D6PKL2 and D6PKL3), AGC1.5, AGC1.6 and AGC1.7 were coloured in the plot.

Mutant alleles of the D6PK family kinase genes—d6pk-1, d6pkl1-1, d6pkl2-2 and d6pkl3-2—and their combinations were previously described¹¹⁰. Embryos were prepared omitting fixation with ethanol and acetic acid as previously described¹¹¹. Embryos were analysed with a Zeiss Axio Imager.M2, Axiocam 512 camera and 20×/0.8 Plan Apochromat objective using differential interference contrast.

For promoter:GUS constructs of AGC1.5, AGC1.6 or AGC1.7, 2,916-bp, 1,207-bp or 2,214-bp fragments upstream from the respective start codon were cloned into the SalI and EcoRI sites of pCAMBIA1391Z. Flowers from transgenic plants, obtained by the floral-dip method⁴⁸, were first incubated in 90% cold acetone for 15 min followed by a β-glucuronidase (GUS)-staining solution (50 mM sodium phosphate pH 7.0, 10 mM EDTA, 2 mM potassium ferricyanide, 2 mM potassium ferrocyanide, 0.1% Triton X-100, 0.5 mg ml⁻¹ 5-bromo-4-chloro-3-indolyl-β-d-glucuronide) overnight in the dark. After a wash in 70% ethanol, flowers were incubated in ethanol/acetate (6:1) for removing of chlorophyll and then rehydrated in a graded ethanol series (90%, 70%, 50% and 30%). Samples were mounted in chloral hydrate/glycerol/water (8 g:1 ml: 2 ml) for imaging. The experiment was performed twice and GUS signal was observed for 12/18 AGC1.5p∷GUS, 0/20 AGC1.6p∷GUS and 12/18 AGC1.7p∷GUS independent lines.

ABA response of RCAR phosphomutants

Preparation and analysis of Arabidopsis protoplasts was performed as previously described¹¹². In brief, protoplasts from 3-week-old Col-0 plants (10⁵ protoplasts; 0.1 ml) were transfected with 5 μg of reporter construct (pRD29B::LUC), 3 μg of p35S::GUS plasmid as internal control and 3 μg of effector plasmid. The effector plasmids drive expression of respective RCAR and PP2C cDNAs under control of the 35S promoter¹¹³. RCAR10 phosphomimetic reporter constructs (RCAR10 mutations S32D and S113D) were obtained by site-directed mutagenesis using primer pairs S32D_F/S32D_R and S113D_F/S113D_R, respectively (Supplementary Data 11). The protoplast suspension was incubated at various levels of ABA as indicated and reporter expression was determined after 18 h of incubation at 25 °C. Three biological replicates per data point were performed for each assay. All data points were normalized to the empty vector control. Structural modelling for RCAR10 was performed with SWISS-MODEL¹¹⁴ using RCAR11 as template model (PDB code 3K3K^115,116) and modified with UCSF CHIMERA¹¹⁷.

Phenotypic characterization of QKY phosphomutants

The pCAMBIA2300-based pQKY::mCherry:QKY construct was described previously⁴². The pQKY::mCherry:QKY(S262A) and pQKY::mCherry:QKY(S262E) plasmids were obtained using the Q5 site-directed mutagenesis kit (NEB, E0554S) according to the manufacturer’s recommendation. Primers are Q5SDM S262A_F and Q5SDM S262A_R for pQKY::mCherry:QKY(S262A) and Q5SDM S262E_F and Q5SDM S262E_R for pQKY::mCherry:QKY(S262E) (Supplementary Data 11). All PCR-based constructs were confirmed by sequencing. Floral tissue for quantitative PCR (qPCR) was collected from plants grown under long day conditions. With minor changes, tissue collection, RNA extraction and qPCR analysis were performed as previously described^118,119. For detection of mCherry:QKY expression, qPCR analysis was performed with primers mCherry-qRT-For and mCherry-qRT-Rev (Supplementary Data 11). Average mCherry:QKY expression was calculated relative to the expression of three control genes (AT4G33380, AT2G28390 and AT5G46630) (Supplementary Data 11) and normalized to the wild-type control for visualization. A. thaliana (L) Heynh. var. Landsberg erecta (Ler) was used as wild-type strain. The likely null allele qky-9 (Ler) has previously been described⁴³. qky-9 mutant plants were transformed using Agrobacterium strain GV3101/pMP90¹²⁰ and the floral dip method⁴⁸. Transgenic T₁ plants were selected on kanamycin (50 μg ml⁻¹) and transferred to soil for further inspection. Plants were grown as previously described⁴³. In total, 4 out of 17 independent T₁ transformants showed a wild-type phenotype for the S262A transgenic line, whereas 13 out of 17 T₁ transformants displayed no (7 out of 17) or partial (6 out of 17) rescue, a notable decrease in functionality compared to the wild-type construct (14 out of 17 rescue, 2 out of 17 partial rescue, 1 out of 17 no rescue). For the S262E transgene, 11 out of 13 independent T₁ transformants showed a wild-type phenotype (1 out of 13 partial rescue, 1 out of 13 no rescue).

Floral organs were imaged using a Leica SAPO stereo microscope equipped with a digital MC 170 HD camera (Leica Microsystems GmbH). Images were adjusted for colour and contrast using ImageJ/Fiji software¹²¹. Confocal laser scanning microscopy of qky-9 pQKY::mCHerry:QKY, qky-9 pQKY::mCHerry:QKY(S262A) and qky-9 pQKY::mCHerry:QKY(S262E) six-day-old seedling roots was performed with an Olympus FV1000 setup using an inverted IX81 stand and FluoView software (FV10-ASW v.01.04.00.09) (Olympus Europa) equipped with a water-corrected 40× objective (NA 0.9) at 3× digital zoom. Confocal high sensitivity detection was used involving two gallium arsenide phosphide photomultipliers mounted equidistantly to the probe. The experiment was performed twice independently with 2 and 12 roots for qky-9 pQKY::mCHerry:QKY, 6 and 15 roots for qky-9 pQKY::mCHerry:QKY(S262A) and 15 roots for qky-9 pQKY::mCHerry:QKY(S262E).

Locus identifiers

Gene locus identifiers are listed for all gene names mentioned in the manuscript (Supplementary Data 11).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Mass-spectrometry-based draft of the Arabidopsis proteome

Edited Transcript of SCCO earnings conference call or presentation 26-Feb-20 3:00pm GMT

Pharmaceutical Packaging: Bormioli Pharma Speeds up on Innovation Together with H-Farm

Pharmaceutical Packaging: Bormioli Pharma Speeds up on Innovation Together with H-Farm

Category

HPIN International Financial Platform Becomes a New Benchmark for India’s Digital Economy

Top 10 Market Research Companies in the world

3 Best Market Research Certifications in High Demand