Associated Incidents

About the automatic prediction of teenage pregnancies
After studying the methodology of the artificial intelligence system supposedly capable of predicting teenage pregnancies, mentioned by the Governor of Salta, Juan Manuel Urtubey, we found serious technical and conceptual errors, which question the reported results and compromise the use of said tool, especially all in the case of such a sensitive issue.
On 4/11/2018, in the television program “El Diario de Mariana”, the Governor of Salta, Juan Manuel Urtubey, described an artificial intelligence system supposedly capable of predicting teenage pregnancies:
“Recently we launched a program with the Ministry of Early Childhood […] to prevent adolescent pregnancy using artificial intelligence with a world-renowned software company, which we are carrying out a pilot plan. You can today with the technology you have, you can see, five or six years before, with name and surname and address, which is a girl, a future teenager, who is 86% predestined to have a teenage pregnancy.”
Previously, on 3/20/2018, at the “Microsoft Data & AI Experience 2018” event, Urtubey had already mentioned this topic:
“The examples that you referred to in the case of the prevention of teenage pregnancy and the issue of school dropouts are very clear examples of that. We have clearly defined, with name and surname, 397 cases of children that we know, from a universe of 3000, who inexorably drop out of school. We have 490-odd, almost 500 cases of girls who, we know, we have to go look for today.”
Different journalistic media associated these declarations of the Gdor. Urtubey to a document available on github signed by Facundo Davancens, an employee of Microsoft Argentina. That document ends by thanking "the Ministry of Early Childhood of the Provincial Government of Salta" and "Microsoft".
After carefully studying the methodology detailed in that document, we found serious technical and conceptual errors, which cast doubt on the results reported by the Gdor. Urtubey, and that compromise the use of the generated tool, in an issue as sensitive as teenage pregnancy.
We briefly and colloquially list some of the most serious problems we have encountered:
Problem 1: Artificially oversized results
The study details the following procedure:
-
Construct a set of statistical rules to try to determine if a teenager will have a pregnancy in the future.
-
Those rules are built based on known data (the “training data”). So, the statistical rules are made in the image and likeness of the training data.
-
Once the statistical rules are built, they should be tested using new, unknown data (the “evaluation data”), thus calculating their “accuracy” (how many times it is correct in the predictions).
The problem here is that the evaluation data (in step 3) includes almost identical replicas of many training data. And therefore, the reported results are strongly overstated. It leads to the erroneous conclusion that the prediction system works better than it actually does. (In the annex below we give more details of this problem.)
Problem 2: Potentially skewed data
The other problem, which is key and insurmountable, is that we strongly doubt the reliability of the data used in this study.
Data on adolescent pregnancies have a tendency to be biased or incomplete, due to the fact that they are a sensitive and confidential subject, difficult to access. For example, in many families, teenage pregnancies tend to be hidden, and even clandestinely terminated. Therefore, the data used has the risk of including more adolescent pregnancies from certain sectors of society than from others.
Thus, even if the methodology used to build and evaluate the systems were correct, the statistical rules built on these data would yield erroneous conclusions, which would reflect the distortions in the data.
Problem 3: Inadequate data
The data used was extracted from a survey of adolescents living in the province of Salta containing personal information (age, ethnicity, country of origin, etc), about their environment (number of people with whom they live, if they have hot water in the bathroom, etc) and whether she had completed or was undergoing, at the time of the survey, a pregnancy.
These data are not adequate to answer the question posed: if an adolescent will have a pregnancy in the future (for example, in 5 or 6 six years). For that, it would be necessary to have data collected 5 or 6 years before the pregnancy occurs.
With the current data, in the best of cases, the system could determine if an adolescent has had, or now has, a pregnancy. It is to be expected that the conditions and characteristics of an adolescent would have been very different 5 or 6 years earlier.
conclusion
Both methodological problems and unreliable data pose the risk of misleading policymakers.
This case is an example of the dangers of using the results of a computer as revealed truth. Artificial intelligence techniques are powerful and demand responsibility from those who employ them. In interdisciplinary fields such as this one, it should not be lost sight of that they are just one more tool, which must be complemented with others, and in no way replace the knowledge or intelligence of an expert, especially in fields that have a direct influence on public health issues. and vulnerable sectors.
Addendum: More details of problem 1
The process used to obtain the reported results is technically incorrect. A basic principle of machine learning is being violated: that the data on which the system is evaluated must be different from the data used to train it. If this principle is violated, that is, if there is contamination of training data in the data on which it is validated, the results will be invalid.
In the system described on github by the author, contamination of evaluation data arises quite subtly. The system uses a method to balance the number of samples of each class called SMOTE. This method generates new "synthetic" samples by replicating the samples of the minority class (at risk of pregnancy, in this case) X times with small variations from the original sample. The problem arises because the author does this data replication before splitting the data into training and evaluation. This division is done randomly, so it is very likely that a sample appears in the training set and some of its replicas appear in the evaluation data. When evaluating on these replicated data, the consequence is that the accuracy value is overstated. Given this problem, it is impossible to know what the true accuracy of this system is.
This can be understood using an example. Suppose that instead of using the characteristics considered in this work (age, neighborhood, ethnicity, country of origin, etc.), we simply use the first and last name of each adolescent. Clearly, a system that had only that information as input would not be able to learn to extrapolate and make decisions on new data. But, in the case of using SMOTE as it has been used, it would be easy to learn to memorize the training data perfectly and then predict with very high accuracy the evaluation data since it would contain replicas of these same names and surnames. In the case that we are studying, the first and last name are not being used as input, but a series of characteristics are used that, if we think about it carefully, allow the same problem to occur. For example, a system that learns that a 16-year-old adolescent, who lives in the El Milagro neighborhood, Creole, without disabilities, of Argentine origin, with hot water in the bathroom and who lives with 4 people where the head of household did not abandon studies have a risk of teenage pregnancy, by evaluating the system with data where almost identical replicas of these characteristics occur, you will be able to predict without problem the class of these replicas. Since, due to the use of SMOTE prior to dividing the data into training and evaluation sets, a high proportion of the minority class samples seen in the evaluation will have been seen during training, this results in an oversized accuracy value. .
Note: It should be noted that at the time of writing this document, others have found and reported a very similar vision, published on the same page where the original description of the prediction system was published. Link: https://github.com/facundod/case-studies/issues/2