Genetic diversity of 16S rRNA and mcrA genes from methanogenic archaeas

Methanogenic archaeas are found in aquatic and terrestrial environments and are fundamental in the conversion of organic matter into methane, a gas that has a potential use as renewable source of energy, which is also considered as one of the main agents of the greenhouse effect. The vast majority of microbial genomes can be identified by a conservative molecular marker, the 16S ribosomal gene. However, the mcrA gene have been using in studies of methanogenic archaea diversity as an alternative marker, highly conserved and present only in methanogens. This gene allows the expression of the enzyme Methyl-coenzyme M reductase, the main agent in converting by-products of anaerobic digestion into methane. In this context, we aimed to study the genetic diversity of mcrA and 16S rRNA genes sequences available in databases. The nucleotide sequences were selected from the NCBI. The heterozygosity and molecular diversity indexes were calculated using the Arlequin 3.5 software, with plots generated by package R v3.0. The diversity and heterozygosity indices for both genes may have been influenced by the number and size of the sequences. Descriptive analysis of genetic diversity generated by sequences deposited in databases allowed a detailed study of these molecules. It is known that the organisms in a population are genetically distinct, and that, despite having similarities in their gene composition, the differences are essential for their adaptation to different environments.


Introduction
Archaea is a domain present in the tree of life. This group, considered polymorphic, is directly involved in the anaerobic digestion process and is responsible for performing methanogenesis steps. Hydrogenotrophic methanogenic archaea are capable of converting hydrogen and carbon dioxide into methane, while acetoclastic methanogenic archaea converts acetate to methane (Amaral, Steinmetz, & Kunz, 2019).
Studies of these organisms are carried out mainly through metagenomics, because cultivation in laboratory is difficult. However, recent studies accomplished by Japanese scientists (Imachi et al., 2020) isolated and cultivated, after almost fifteen years, in the laboratory an Asgard archaeon strain extracted from a deep marine sediment.
The current knowledge about archaea is due to the development of a way to study environmental samples without the need to cultivate microorganisms. The first application of metagenomics with sequencing was performed in the 70's decade, from which several techniques were developed and refined, in order to obtain increasingly reliable analyzes. There is currently a wide range of metagenomic data that can be accessed at several databases, such as NCBI -(National Center for Biotechnology Information), and which allows various microbiome studies to be performed.
Within the Archaea group, methanogenic archaeas are found in aquatic and terrestrial environments, which are fundamental in the conversion of organic matter into methane, a gas that has a potential use as renewable source of energy, which is also considered as one of the main agents of the greenhouse effect. Environmental sequence of data suggests that they are common constituents of anaerobic environments across the globe, ranging from hypersaline to freshwater sediments, acidic to alkaline soils, invertebrate to vertebrate gastrointestinal tracts, industrial bioreactors, hydrocarbon-rich deposits, and rice paddies at temperatures ranging from extreme thermal to near freezing conditions ( Figure 1) (Evans et al., 2019).
In this context, genetic and evolutionary studies of various types of environmental microorganisms might be performed through a series of molecular genetics techniques. Metagenomics allowed the acquisition of microbial DNA without the need for culture isolation and became the main technique for the study of microbial community genomes of samples obtained directly from the environment (Handelsman, 2004). Archaeal genome is a circular double-stranded DNA molecule -l. 9 Mbp in length, or -45% the size of the E. coli genome. Methanogen genomic DNA range overall from 26 to 68 mol% G+C, although intergenic regions are frequently more A+T rich than the average value for the other archaeal genomes (Reeve, 1992).
The vast majority of microbial genomes can be identified by a conservative molecular marker, the 16S ribosomal gene. Through cloning and sequencing techniques, the 16S rRNA gene becomes a great ally in studies on microbial diversity (Handelsman, 2004).
However, the mcrA gene have been using in studies of methanogenic archaea diversity as an alternative marker, highly conserved and present only in methanogens. This gene allows the expression of the enzyme Methyl-coenzyme M reductase, the main agent in converting by-products of anaerobic digestion into methane. One of the advantages of mcrA gene is that only one or two copies of mcrA have been found in sequenced methanogens genomes, making it a more precise tool for estimating the number of these archaeas in the environment than the 16S rRNA gene, which can have up to four copies per genome (Lee, Kim, Hwang, O'Flaherty, & Hwang, 2009). Thus, mcrA has been replacing the 16S rRNA gene in the study of methanogenic diversity and phylogenetics (Hallam, Girguis, Preston, Richardson, & DeLong, 2003;Evans et al., 2019).
Molecular studies of these environmental samples allow the understanding microbiomes and their interactions. Bioinformatics analysis will allow a greater functional and structural understanding of the microbiome. In this context, we aimed to study the genetic diversity of mcrA and 16S rRNA genes sequences available in databases.

Material and methods
Sequences retrieval mcrA and 16S rRNA sequences genes were obtained from NCBI -(National Center for Biotechnology Information) https://www.ncbi.nlm.nih.gov/taxonomy, with the keyword "Methanobacteria" selecting the mcrA and 16S sequences. The resulting sequences were classified according to information available from NCBI and were used in FASTA format. Acta Scientiarum. Biological Sciences, v. 42, e49877, 2020 Sequence alignment was carried out using BioEdit v.7.2.6 (http://www.mbio.ncsu.edu/BioEdit/ bioedit.html). Then, the sequences of each gene were grouped relating to the country of collection and its environment (Supplementary data). The nucleotide compositions of the groups were verified on MEGA-X (https://www.megasoftware.net/).

Descriptive statistics of genetic diversity
The analysis of genetic diversity was performed with the software ARLEQUIN v3.5X (Excoffier & Lischer, 2010). The sequences were organized into groups and converted into .arp file using the software DNA sequence polymorphism (DnaSP v6) (Rozas et al., 2017). The .arp format is used as the input format to ARLEQUIN. The data generated were analyzed and shown as plots of Heterozygosity and Molecular diversity index using the statistical ARLEQUIN for R package v3.0.

Results and discussion
We found 59 sequences for the mcrA gene, but after an analysis of them, 56 were selected for presenting complete data of their origin, totalizing 14 different origins. Regarding the 16S gene, of the 27 found, 25 were selected, showing six types of environments.
These sequences were organized into groups, in which from the 56 sequences, 11 groups of mcrA gene were formed (Table 1); whereas from 25 sequences for 16S, five groups of 16S rRNA gene (Table 2) were constituted. After aligning the gene sequences into their groups, their nucleotide composition was conducted using MEGA-X.
In the mcrA groups (Table 1), the largest sequence number (14) belonged to the mesophilic sludge, with an average sequence size of 454.7 base pairs (bp). However, its GC content (49.9%) was not the largest among the groups. Thermal Waters group, with only two sequences in its composition, showed larger CG content (51.6%) in mcrA groups.
The Pool group with the largest average size (760.7 bp showed 46.3% of CG. The groups belonging to Marine Sediment, Terrestrial Sediment and Peninsula (two sequences belonging to the group) showed the lowest percentages of CG, being 38.9%, 39.2% and 42.6%, respectively. In the 16S groups (Table 2), the JP Oil group (only one sequence) had an average of 1452 bp, with 59.2% GC, representing the second largest percentage. The BR Oil group (three sequences) had an average of 258 bp and the largest amount of CG content (59.8%). The Maize Reactor group, despite having the largest number of sequences composing it (13 sequences, with an average of 1293.9 bp), showed the lowest CG content, 55.8%.
According to Yakovchuk, Protozanova, and Frank-Kamenetskii (2006), DNA sequences with low GC content are less stable than with high GC content. This stability is not dependent of the three hydrogen bonds, but is largely due to the base stacking.
It is also known that in PCR experiments, the GC content of the primers is counted by the software to calculate the annealing temperature to the DNA template. An elevated GC content indicates a relatively higher annealing temperature (Dieffenbach, Lowe, & Dveksler, 1993).
The expected heterozygosity (He) is defined as an estimated fraction of all individuals who may be heterozygous at a randomly chosen locus. Groups with a He close to one, indicate that the alleles present are very different within their group. However, when the sequences are more similar to each other in a group, the closest to zero will be the value, indicating a low heterozygosity (Nei, 1978).
The Figure 2 shows the expected heterozygosity (He) of the mcrA gene sequence groups from the haplotypes (each base position in the sequences) present in each cluster.
The Beer group (a) had a He average of 0.669 with 142 polymorphic loci. Thermal Waters (b) showed 284 nucleotides with polymorphism, varying in several positions within the sequence, reaching a high He, with average 1; the same value found in Peninsula (d), which may have been caused by introducing gaps (spaces) during alignment. Heterozygosity averages for groups c, f, i, j and k was probably due to the presence of more similar sequences in alignment. Despite higher nucleotide content, the bases were repeated in the analyzed positions, showing a low divergence (Figure 2).
The molecular diversity index among the mcrA groups is shown in Figure 3. This index corresponds to the expected heterozygosity when having diploid information, that is, the probability that two randomly chosen haplotypes are different in the sample, which are calculations from Theta (θ), a population parameter of genetic differentiation (Zolet, Segatto, Turchetto, Silva, & Freitas, 2013). The blue line (θS) in the Figures 3 and Figure 5 is an estimate among the equilibrium relationship at an infinite locus (Watterson, 1975) and between the number of loci that suffered segregation (S), sample size (n) and theta (θ) for a non-recombinant DNA sample.

Molecular diversity indices
The green line (θπ) is an estimated value from the equilibrium relationship between an infinite locus, the average number of differences between pairs (π) and theta (θ) (Tajima, 1983). The black and red lines do not apply, and they were not estimated for this type of DNA sequence study. The black line (θk) is an estimate from the equilibrium ratio of an infinite allele (Ewens, 1972) among the expected numbers of alleles, while in red (θH) calculates the relationship from homozygosity expected in an equilibrium population and between drift and mutation (Zolet, Segatto, Turchetto, Silva, & Freitas, 2013).
Based on the molecular diversity (Figure 3), the same pattern found in the results of (He) was observed, with Thermal waters and Mesophilic sludge showing higher diversity indexes among their sequences (θS), besides a higher average in differentiation among pairs (θπ).
It is worth noting that in the diversity index, the gaps end up being analyzed, because the concept of molecular diversity is to estimate the possibility of two alleles being different from each other, so a nucleotide will always be different from a gap (Excoffier & Lischer, 2010). The lowest values of θ S and θ π were found in the terrestrial and marine sediment groups, corroborating the low heterozygosity found in Figure 2.
For the data generated from the 16S rRNA gene sequences (Figure 4), only the Seafloor Mud (a), Maize Reactor (b) and BR Oil (c) groups were submitted to the diversity test, as they showed two or more sequences in the group, allowing an analysis of expected heterozygosity intragroup. The groups of Seafloor Mud (a) (composed of seven sequences) and Maize Reactor (b) (13 sequences) showed more divergent loci; however, the expected average heterozygosity was low, with a value of 0.460 and 0.376, respectively.

Expected heterozygosity
The value found in BR Oil (c) (three sequences) was 0.666 representing a higher average heterozygosity compared to the other groups. The observed pattern in (He) of the mcrA gene was also found for the 16S rRNA gene groups, i.e., groups with a higher number of sequences showed more similar positions, with lower heterozygosity.
Yet, the molecular diversity index for the 16S rRNA gene did not show the same pattern obtained in Figure 5. The Maize Reactor group had a high number of sequences in its group and with the largest size, with an average of 1293.9 bp, which significantly interfered with θπ, since it calculates the largest average number of differences among pairs. In the group BR Oil, despite its high He (0.666), the average size of its sequences was 258 bp, with gaps present in the alignment that repeated in several positions, not indicating different haplotypes in the sample ( Figure 5).

Conclusion
Descriptive analysis of genetic diversity generated by sequences deposited in databases allowed a detailed study of these molecules. It is known that the organisms in population are genetically distinct, and that, despite having similarities in their gene composition, the differences are essential for their adaptation to different environments.
Our findings for 16S and mcrA sequences analyzed in this study suggest that groups with greater number of sequences, and longer base pair length were more similar; in the group with smaller base-pair sequences, automatic insertion of gaps by the tool used may have underestimated the showed differences.
Genetic diversity aims to understand the emergence and disappearance of alleles, due to spontaneous mutations or not. From population genomic studies, as in the present work, it is possible to establish a range of information about evolutionary processes, interactions between organisms and the environment, thus strengthening the context of population genetics.
Because archaeas are present in many environments and most often live in extremophilic conditions, understanding and contextualizing the vastness of molecular data that is added to databases (Big Data Era) becomes critical so that this information can be transformed into applicable knowledge.