Big Data: truth, quasi-truth or post-truth?

In this paper we investigate if sentences presented as the result of the application of statistical models and artificial intelligence to large volumes of data – the so-called ‘Big Data’ – can be characterized as semantically true, or as quasi-true, or even if such sentences can only be characterized as probably quasifalse and, in a certain way, post-true; that is, if, in the context of Big Data, the representation of a data domain can be configured as a total structure, or as a partial structure provided with a set of sentences assumed to be true, or if such representation cannot be configured as a partial structure provided with a set of sentences assumed to be true.


Introduction
In the controversial article The end of theory: will the data deluge make the scientific method obsolete? (2008), Anderson makes some quite ambitious claims about the future of knowledge after the advent of Big Data or, in his own words, in the 'Petabyte Age'. His main claims are summarized in his conclusion to the text: The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all (Anderson, 2008).
Mayer-Schönberger and Cukier, authors of Big Data: a revolution that will transform how we live, work and think (2013), one of the best regarded works on the subject, agree with Anderson's thesis that statistical models based on Big Data may outperform the models of traditional science, based on hypotheses: Big data is about what, not why. We don't always need to know the cause of a phenomenon; rather, we can let data speak for itself. Before big data, our analysis was usually limited to testing a small number of hypotheses that we defined well before we even collected the data. When we let the data speak, we can make connections that we had never thought existed. (Mayer-Schönberger & Cukier, 2013, p. 14).
In this paper, we will investigate in which circumstances would models based on Big Data succeed, as their enthusiasts argue, in producing knowledge as efficiently or even more efficiently than the models of traditional science, in which circumstances would such models fail in producing knowledge, and which of those circumstances are more commonly found in reality. To do so, we will analyze if the sentences resulting from the application of hypothetical models based on Big Data can be characterized as true, more specifically as semantically true, in the sense formalized by Tarski (1944); as pragmatically true or quasi-true, in the sense proposed by da Costa and his contributors (Mikenberg, Da Costa, & Chuaqui, 1986); or even if such sentences can only be characterized as post-true, in the sense popularized by the Oxford Dictionary (2016).
That is, we will try to understand if, in the context of Big Data, the representation of a data domain and of the correlations between those data can be configured as a structure, that is, as a (non-empty) set of elements provided with relations between them; as a partial -simple pragmatic -structure, that is, as a (non-empty) set of elements provided with partial relations between them and with primary sentences assumed to be true; or even if such representation cannot be configured as a pragmatic simple structure, in a way that the sentences resulting from it cannot be characterized as true nor quasi-true and therefore be relegated to the category of post-truth.
Thus, in each of this paper's sections, we will analyze abstract models based on Big Data, whose data are provided by hypothetical digital social networks: on section 1, 'Big Data and truth', we will analyze a model that is more efficient in producing knowledge than the traditional scientific models, whose results would constitute true sentences, but whose social network, which provides the data to the model, has features almost impossible to obtain in reality; following, on section 2, 'Big Data and quasi-truth', we will analyze a model which is as efficient in producing knowledge as the traditional scientific models, whose results would constitute quasi-true sentences, and whose social network, which provides the data to the model, has features that could be obtained in reality; finally, on section 3, 'Big Data and post-truth', we will analyze a model which is inefficient in producing knowledge, whose results would constitute, in a certain way, post-true sentences, and whose social network, which provides the data to the model, has features that are similar to those one obtains in reality today, being closer to real examples of digital social networks frequently used to feed statistical models based on Big Data, like facebook and twitter.

Big Data and truth
Let us picture a statistical model based on Big Data whose structure is composed by a data collection provided by a digital social network to which every adult citizen of a given nation is obligatorily registered. Every citizen must also provide this social network with many of his or her personal data, such as gender, age, education, job, etc., as well as all of his or her connections to other citizens, characterizing them in categories such as family, friends, work colleagues, etc. In this social network every citizen must also post one picture a day, chosen from a previously created collection, provided by the platform itself. Once posted, this picture will be visualized by every citizen connected to the one who posted it, and those may share it, so it will be visualized by every citizen connected to the one who shared it, and those may also share it, and so on.
To the ends of our investigation, we suppose that the users of this social network, that constitute the totality of the adult population of a given nation, submit all of the data requested and are totally honest when providing information and when expressing their daily feelings by posting and sharing pictures, that is, that they do not omit nor falsify any of their personal data and that they seek to express their real opinions and feelings when posting and sharing.
The reality is captured by this 'model' through a structure whose universe consists in the set of the individuals that represent the nation's citizens, joined to the set of the elements that correspond to the pictures provided by the platform, to the set of the elements that correspond to the registration data of the individuals (like sex, age, etc.) and to the set of the elements that correspond to the metadata of the actions of posting and sharing (date, time, etc.); and whose relations correspond to the relationships of many kinds between citizens (relation of affiliation, of fraternity, etc.), between the citizens and their data (relation between the individual and his or her age, education, etc.), between the citizens and the pictures (relation of posting, of sharing), between posted pictures (relation of contiguity on time between two pictures), between posted pictures and their metadata (relation between the picture and the date and time of its posting); among which some relations are of a functional character (like the function that associates an age to an individual, for example).
Such structure corresponds to a language capable of generating sentences about it and that, by its turn, can correspond to a metalanguage capable of generating sentences about sentences of it and that also includes semantic concepts, like those of truth and falsehood. This way, the prerequisites presented by Tarski in order for the notion of truth to be rigorously defined are fulfilled, namely, by the schema of equivalences of the form (T): "X is true if, and only if, p. […]" (Tarski, 1944, p. 344), in which p is a sentence and X is the Acta Scientiarum. Human and Social Sciences, v. 42, e56201, 2020 name of a sentence. Truth is, therefore, defined only relatively to a given structure, by all instances of the equivalence of the form (T). To put it more formally, given a sentence A of the language, one can say that A is true in the structure considered, according to a given interpretation i, if the interpretation of A according to i is verified in the structure; and, given a sentence A of the language, one can say that A is 'valid' in the structure considered, if A is true in the structure, according to all interpretations i.
Thus, provided with this structure, the 'model' can generate as result particular sentences about the reality it captures, such as 'The citizen c2 shared the picture p1 posted by the citizen c1', a sentence that is true if and only if the relation of sharing picture p1 by citizen c2 occurs, and if the relation of posting picture p1 by citizen c1 also occurs; as well as general sentences about that same reality, such as 'Picture p1 is posted and shared more frequently by male citizens between 18 and 30 years old.', a sentence that is true if and only if it is verified that all the relations implied occur.
From general sentences like the one mentioned above and from the identification of patterns of correlations and trends on time, through the application of statistical tools, the model can also generate predictions about the reality it models. We could thus assume that the model presented in this section would make predictions about the relationships and feelings of this nation's citizens with more precision than traditional scientific models, whose hypotheses had been tested in small samples of the population, even though it does not have the support of a theoretical model that aimed on explaining the causes that make the citizens relate and express themselves in certain ways.
This way, in order for a model based on Big Data to obtain results more satisfying that those of traditional science, it would be necessary, in our understanding, that its data collection represented every possible instance of the modelled phenomena, such as in the example we described, as well as every possible relation between the instances of the modelled phenomena.
In fact, at least according to Mayer-Schönberger and Cukier, to analyze the totality of the occurrences of the phenomena under investigation, instead of just samples of those phenomena -what they call 'N=all' -, would be the goal of the models based on Big Data (Mayer-Schönberger & Cukier, 2013, p. 26-31). However, such a model, whose data domain reflects with precision the totality of the reality it models, is an idealization hardly found in reality.

Big Data and quasi-truth
Let us picture now a model based on Big Data whose data collection is provided by a digital social network to which a portion of the citizens of a given nation are registered. Those users may or may not provide this social network with their personal data, as well as with their connections to other users. In this social network each user may also post texts and pictures, which once posted may be visualized by the users connected to the one who posted them, and those may evaluate, comment or share those posts, so they will be visualized by the users connected to those who evaluated, commented or shared them, and those may also evaluate, comment or share them, and so on.
Unlike the model in section one, we suppose that the users of this social network do not always submit all of the data requested and are not totally honest when providing information and when expressing their feelings by posting, evaluating, commenting and sharing, that is, that they eventually omit or falsify their personal data and that they do not always seek to express their real opinions and feelings when posting and sharing.
The reality captured by this model can also be represented in abstract form by a structure, similar to the one of the model in the first section. Such structure clearly reflects the totality of the users of the social network and of the interactions between them within the social network; however, such structure in no way reflects the totality of the relations, feelings and opinions, whether of the individuals registered to the network, whether of the citizens of a given nation, since, as we supposed, the nation's citizens are not required to register to the network, nor are required to provide all of their data and, when they do, they frequently provide incomplete or even false information. For such reason, we propose here an analysis based on the concepts of partial structure and quasi-truth.
In a total structure, the relations between the elements are total; that is, given a binary relation, for example, we can determine which pairs of elements satisfy the given relation, and which do not. In a partial structure, by its turn, the relations between the elements are only partial; that is, given a binary relation, for example, we can determine that some pairs of elements are in the given relation, and that some are not, but we cannot determine, for a given portion of the elements, if they are or are not in the relation. Thus, the introduction of the notions of partial relation and partial structure allows one to formally represent an incomplete domain of knowledge (D'Ottaviano & Hifume, 2007;D'Ottaviano, 2010). However, we must clarify that every partial relation can be extended to become a total relation: The 'partialness' of the relation is of 'epistemic' order and represents the incompleteness of our knowledge of the domain under investigation. Thus, the relation of pertinence [...] does not have any diffuse character, that is, once known, it presents only two situations, they belong or they do not belong to R. Thus, the pertinence [...] is classic. (Hifume, 2003, p. 87, our translation, author's emphasis).
To put it more formally: while in a, say, binary total relation, given a domain of elements on which the relation applies, the ordered pairs of the elements of the domain are divided in two sets, namely, the set of the pairs that satisfy the relation and the set of the pairs that do not satisfy the relation, in a partial relation, also binary, for instance, the ordered pairs of the elements of the domain are divided in three sets, namely, the set of the pairs that we know that satisfy the relation, the set of the pairs that we know that do not satisfy the relation, and the set of the pairs that we do not know if they do or do not satisfy the relation (when this last set is empty, the relation is total). A partial structure, by its turn, is a domain provided with one or more partial relations.
Thus, since the structure is partial, the prerequisites presented by Tarski to define truth in this structure are not fulfilled, for the partialness of the relations does not allow the verification of all occurrences of the schema of equivalences of the form (T); that is, given a, for instance, binary relation, we can assert the equivalence 'x1 and x2 are in the relation R' is true if and only if x1 R x2.'' -that is, x1 satisfies x2 -, but for some elements it is not possible to assert if the relation occurs or not. For such reason, when dealing with partial structures, one must make use of the notion of pragmatic truth or quasi-truth, introduced in (Mikenberg, Da Costa, & Chuaqui, 1986) and studied in various works by da Costa and his contributors. In order to define quasi-truth, however, one must first define the notions of simple pragmatic structure and A-normal structure.
A simple pragmatic structure is a partial structure, provided with a set of primary sentences P assumed as true, as basic truths. Such sentences constitute an empirical or theoretical framework that works as a ground for the verification of the quasi-truthness or quasi-falsehood of the sentences interpreted by the partial structure. This set P may be empty.
An A-normal structure, by its turn, is a total structure that expands the simple pragmatic structure, in a way that the partial relations of the simple pragmatic structure are extended to total relations, and in which its primary sentences are also true. One must notice that: Of course, given a partial structure A, there may be several distinct A-normal structures B that extend A to a total structure. [...] The main idea, however, consists in requiring that the extension be done in such a way that it be consistent with certain accepted sentences P (this actually supplies a constraint for the admissible extensions) (Bueno & Souza, 1996, p. 188-189).
Once those notions are defined we may at last define the notion of quasi-truth: given a language and a simple pragmatic structure in which the language is interpreted, one can say that a sentence of this language is quasi-true (or pragmatically true) in the simple pragmatic structure according to an A-normal structure that extends it if the sentence is true, in the Tarskian sense, in the referred A-normal structure. Otherwise, it is said of the sentence that it is quasi-false (or pragmatically false) in the simple pragmatic structure according to the referred A-normal structure (Da Costa, 1997). That is, a sentence is quasi-true relatively to a simple pragmatic structure if it is true, in the Tarskian sense, in some A-normal structure that extends the referred simple pragmatic structure and, consequently, if it is consistent with its primary statements: [...] we may say that in order for ϕ to be pragmatically true in the partial structure , ϕ must satisfy the following condition: all logical consequences of ϕ or of ϕ plus true primary statements cannot be incompatible with the true primary statements. Otherwise, ϕ is pragmatically false. Loosely speaking, if ϕ is pragmatically true it saves appearances. (Mikenberg, Da Costa, & Chuaqui, 1986, p. 204).
For this reason it is said of a sentence that it is quasi-true if it behaves as if it was true relatively to a certain domain: [...] to say that a proposition P is quasi-true in a domain D means to say that things happen in D as if P were true in D in the sense of the correspondence theory. In other words, P 'saves the appearances' in D. (Krause, 2009, p. 114, our translation, author's emphasis).
As we have already made explicit, given a partial structure 'A', it is possible to generate, through different sets of primary sentences, different simple pragmatic structures; and, given a simple pragmatic structure, it is possible to generate, through different extensions of its partial relations, different A-normal structures. In an A-normal structure, a given relation may be satisfied by all elements, thus making a certain universal sentence about such relation true, while in another A-normal structure there may be elements that do not satisfy it, thus making the same sentence false. That is, two contradictory sentences may both be quasi-true in relation to the same simple pragmatic structure, although only one of the two is true according to each Anormal structure. For this reason, the logic of quasi-truth is a paraconsistent one, in which the principle of the 'Ex Falso Sequitur Quodlibet', contemporaneously known as the 'Principle of Explosion', is not valid, that is, in which the presence of a contradiction in the system does not trivialize it (D'Ottaviano & Hifume, 2007;D'Ottaviano 2010;Silvestrini, 2016).
Therefore, the notion of pragmatic truth has the goal of reflecting the reality of the scientific practice, in which different models of the same phenomena -like those of physics -coexist, even though they are incompatible between themselves, for each model works as if it was true in a certain context, even if it does not reflect with exactness every aspect of the modelled phenomena. That does not imply that different models cannot be in some moment unified, being the "'Complete' or 'absolute' truth [...] the (ideal) terminus of all inquiry" (Da Costa & French, 2003, p. 16, author's emphasis). Thus: [...] to accept a theory is to be committed, not to believing it to be true per se, but to holding it as if it were true, for the purposes of further elaboration, development and investigation. Thus acceptance involves belief that the theory is partially or pragmatically true only and this, we believe, corresponds to the fallibilistic attitude of scientists themselves. (Da costa, Bueno, & French, 1998, p. 617).
Let us return to the model based on Big Data proposed in this section. Supplied with the partial structure provided by its data collection and with the language (to which the structure corresponds), the model can generate as result true sentences if one has the conditions to identify every possible interaction between the users of the social network. However, if one does not have the conditions to identify every interaction (even if the goal was to represent the reality of the interactions between all of the nation's citizens, supposing that all of them were registered to this social network), by using as model the data collected from the social network, analyzed through statistical models, it is possible to generate as result only quasi-true sentences; and, even so, it will be frequently necessary, as we have seen, to provide the model with basic sentences accepted as true, thus generating a simple pragmatic structure from the partial structure, and it will also be necessary to extend the simple pragmatic structure to a certain A-normal structure between many possible ones, by the extension of the partial relations to total ones. This way, one can deduce from our analysis that models based on Big Data, provided with data collections that can be configured only as partial structures, usually cannot do without hypotheses and theories assumed as true, belonging to the set of primary sentences P necessary to the constitution of a simple pragmatic structure. This is not what the Big Data enthusiasts defend, though: This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. [...] With enough data, the numbers speak for themselves. (Anderson, 2008). those posts, so they will be visualized by the users connected to those who evaluated, commented or shared them, and those may also evaluate, comment or share them, and so on.
Like the model on section two, we suppose that the users of this social network do not always submit all of the data requested and are not totally honest when providing information and when expressing their feelings by posting, evaluating, commenting and sharing. In addition, a portion of the profiles of this social network are not personal profiles, but institutional or fictional ones, controlled by groups of individuals, or profiles that are supposedly personal, but that are controlled by artificial intelligence. Some of the users control more than one profile. Finally, some of the users have profiles that are active daily or weekly, while other profiles are seldom accessed by their users and others are completely inactive.
As in the previous model, this data collection can, at best, be configured as a partial structure. However, we also assume that the model is not grounded in any hypotheses or theories, therefore the set of primary sentences P is empty. This way, the model can constitute a simple pragmatic structure, which can be extended to an A-normal structure and, therefore, it can generate quasi-true sentences; however, since the model does not assume as basic truths the results of any already established scientific theories, once we take this model and add results as such to the set of primary sentences P of its simple pragmatic structure, we may contradict the results of the model, making them quasi-false. That is, not being equipped with such theories, models as this one cannot guarantee that their results will not be revealed as merely quasi-false when confronted to facts already accepted by the scientific community. It seems to us that it is the case that models as this one exist in reality. In fact, many models based on Big Data use data from social networks such as facebook and twitter, that present various features similar to those of the model proposed in this section.
There are other quite problematic models, such as those that deal with predictive policing and criminal recidivism, one example being the COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), property of the private company Equivant (previously Northpointe). A study has shown that in that model, whose rate of success is of merely approximately sixty percent, African-American defendants are frequently wrongly classified with a higher risk of recidivism, while Caucasian defendants are frequently wrongly classified with a lower risk of recidivism . Even though in the official guide to the use of the platform various criminal theories are cited (Northpointe, 2015, p. 5-6), in the questionnaire that feeds the system with data from the defendants there are questions like "How many of your friends/acquaintances are taking drugs illegally?" and "How often did you get in fights while at school?". The questionnaire also asks people to agree or disagree with statements such as "A hungry person has a right to steal" , among other questions of subjective character.
In 2016, the Oxford Dictionaries chose the word 'post-truth' as word of the year, defined as "Relating to or denoting circumstances in which objective facts are less influential in shaping public opinion than appeals to emotion and personal belief" (Oxford Dictionaries, 2016). Models based on Big Data that are poorly grounded or even not grounded in well-established scientific theories and that present inaccurate or biased results, that reflect preconceived opinions, and that are easily falsifiable by facts widely accepted as such by the scientific community could, in our understanding, have the sentences that result from them categorized as post-true, since they are quasi-true relatively to their simple pragmatic structures, but are probably quasifalse relatively to simple pragmatic structures whose primary sentences included scientific results and, nevertheless, are presented as if they were true and scientific.

Final considerations
In conclusion, statistical models based on Big Data may produce results more precise that those produced by traditional science if their data collections in fact reflect with precision the totality of the instances of the modelled phenomenon, what may occur in controlled environments, but hardly does in social networks; that such models may produce results as precise as those produced by traditional science when their data collections are not totally precise or exhaustive, provided that results obtained by traditional science constitute their theoretical framework; and that such models may be considered pseudo-scientific in case they do not fulfill those prerequisites.