Journalism and machine learning: a poweful tool for investigations

First of all, let’s make one thing clear : this post will not be about mystical or magical processes, nor about artificial intelligence. Instead, it will be about algorithms, data analysis, scraping, cleaning, etc. All of this with the aim of helping journalists to investigate.

Machine learning (ML) can be useful to journalists in their investigations, especially when they have to analyze a huge amount of data.
To better understand how, here are four concrete (and inspiring) examples.

Following them, I draw some lessons and recommendations on how journalists can use ML and I add links to other examples (not detailed here) to discover.


In March 2018, the Ukrainian investigative website Texty investigated illegal amber drilling in the northwest of the country using machine learning on satellite images from 2014 to 2016 (the period of high increase in drilling). The goal was to assess the environmental impact of these illegal drillings that destroy hectares of forest and agricultural land.

First step: what does an illegal drilling look like on a satellite image? To answer this question, the journalists began by documenting themselves by looking at images and videos of legal amber drilling, by interviewing specialists.

Satellite images on which mining spots appear

Second step: find recent satellite images of the determined area (northwest of the country), with a high level of detail. The ones from Bing Maps were used because of the good API containing a lot of metadata. The search area covered 70,000 km2, which represents several hundred thousand images, impossible to check by hand.

The third step was to find the first images containing drilling traces. This work was done with the help of participants of the Open data day in Kiev and allowed to identify about 100 illegal mining locations, then the model allowed to identify about 100 more, verified by humans.

Satellite image divided into superpixels

The fourth step was to create a labelled dataset. This was done in two phases: first the images of the humanly identified places were divided into superpixels (with SLIC). Then these superpixels were processed (with ResNet50) to extract features that are then used to parameterize the classifier model (XGBoost).

In the fifth step, the creation of the XGBoost classifier model, the dataset contained about 15,000 smaller images (superpixels) to train the classification algorithm, “of which 80% were for training the model and 20% for testing”, explains Anatoliy Bondarenko, journalist and developer of the team.

After some adjustments to improve the results of the XGBoost algorithm, step six was to deploy it in order to classify the 450,000 satellite images, either with drilling traces or without drilling traces.

The computation time of the model on the 450 000 satellite images was about 100 hours.

In addition to this image analysis, feedback from readers in the region was extremely helpful: they pointed out a number of errors related to deforestation, which generated false positives in the results. These locations were removed from the training data, the model re-trained and the images reprocessed.

As a final step, the journalists created an interactive map of locations with evidence of illegal drilling and published a rich and detailed investigation.

Interactive map of illegal drilling in northwest Ukraine, produced by Texty based on their survey results

The project :
duration: 1 month
team: 5 people (1 journalist, 1 art director, 2 data journalists/developers and 1 person in charge of the model)
amount of data analyzed: 450 000 satellite images
programming language: Python
type of algorithms: unsupervised (SLIC) and supervised (XGBoost)

In 2020, the AI+ Automation Lab team of the german public radio station Bayerschicher Rundfunk, together with teams from WDR and NDR, analyzed several million Facebook posts and comments published in 138 right-wing groups (“The hate machine”). In these groups, incitement to murder, Nazi imagery and Holocaust denial are common, in violation of German law.

At the beginning of the investigation, a source provided the journalists with a list of hundreds of far-right Facebook groups in which this illegal content was published. The journalists first had to infiltrate these groups to have access to the content.

The first step in this investigation, once the groups were infiltrated, was to retrieve the posts and comments to create a database for analysis. The scraping was complicated to avoid being blocked by Facebook. It was carried out between September and November 2019.

This scraping made it possible to obtain the texts, metadata but also to generate screenshots of all the recovered comments and posts. The database thus constituted contains 2.6 million elements published between 2010 and November 2019.

In the second step, with a simple keyword search, the journalists identified more than 10,000 cases of violation of german law. They labeled more than a thousand of them by hand.

The most violent contents are mostly included in images and do not show up in a keyword search. Therefore, it was necessary to work on the images specifically.

The comments are sometimes more violent than the posts they are published under.

Third step. The journalists deconstructed the screenshots, first by extracting the images with ImageMagick (open source) and converting them into black/white images in order to separate the images from the text.

The analysis of the 400,000 images obtained then required the use of two different algorithms: one for facial recognition, the other for object recognition.

  • via facial recognition: the team used an open source tool again to find images, such as photos of Angela Merkel (often targeted) or Adolf Hitler for example
  • via object recognition: this time it is a tool developed by Facebook that has been used, Detectron2 (open source). The algorithm was trained with about fifty objects and symbols illegal in Germany (swastikas or SS runes for example) from the database previously built. The results of this analysis were validated by hand by the journalists because they needed to know the context of the publication to determine whether it was legal or illegal. More than 100 illegal images were identified in the database created for the investigation.

Fourth step, the “classic” investigation to complete and contextualize this data analysis work. This involved interviews and exchanges with several experts, but also a request for access to certain official documents (FOIA).

For Robert Schöffel, a journalist who worked on this investigation, to carry out this type of investigation, journalists must have advanced technical skills and be able to discuss effectively with the developers.

The project :
duration: several months
team: 11 people (3 developers and 8 journalists)
amount of data analyzed: 2.6 million posts and comments
programming language: Python
type of algorithm: semi-supervised

In 2019 in Peru, the investigative website Ojo Público sought to spot corruption in public procurement by assigning a risk score to companies based on different criteria. This is the Funes project, which is based on two previous investigations dealing with the private financing of political parties and a corruption scandal on the scale of the South American continent.

The Funes project looked at more than 245,000 contracts, most of them entered into without competition, between 2015 and 2018, whether at the national, regional, or local level. That’s a lot of contracts that are impossible to sift through by hand. The score assigned by the Funes algorithm tells journalists what deserves their attention first.

The first step in this work, which lasted more than 15 months, was to define what corruption was, to differentiate between collusion and illegal agreements, bribes, embezzlement, influence peddling, conflicts of interest, financial fraud… We also had to agree on risk indicators.

The Ojo Público team relied on the work of Mihály Fazekas, who initiated the project (which identifies indicators of corruption risk in the awarding of public contracts in Europe), and selected 22 elements, some of them specific to Peru.

The second step in developing Funes was to find data. This took several paths: open data, scraping, direct requests to institutions, and data already retrieved from a previous survey on private financing of political parties. This phase took a lot of time.

The third step was to organize the data. To do this, the team created a spreadsheet to list, qualify and document the data that was available and usable. Next, the data had to be cleaned and normalized because the databases came from different sources. The data was organized differently, sometimes with different nomenclatures for the same names (sometimes a name is written CCatcca, sometimes Ccatca). Not to mention the classic missing, erroneous or inaccurate data.

Once the data was cleaned, documented and fully understood, the team proceeded with the first analyses with simple filters such as the top 10 companies that win the most government contracts. These first analyses allow us to identify anomalies and remarkable elements. This phase can also take the form of a graphical exploration of the data, to quickly identify outliers, i.e. data that is outside the norms of the corpus analyzed.

Fourth step, the modeling was done from the model created by Fazekas. This involves creating the algorithm that will allow us to assign a risk score to each company, based on the indicators decided at the beginning of the project. This phase using machine learning includes a part of data for the training of the algorithm and a smaller part for the test (in general we divide the dataset in 80% for the training / 20% for the test). In the case of Funes, supervised learning was used.

The first results are then manually checked to ensure that the algorithm is performing correctly, identify errors and correct them to adjust the code. When the results are deemed satisfactory, the model is validated and deployed on all the data to be analyzed.

Display a search result based on a company name

Did the algorithm encounter any biases? In this V1, the model has indeed overpointed the contracts of small communes as being at risk. These contracts more often than not show family ties or ties to political parties. This can be explained by the geographical proximity of the actors and the small size of the communities.

In the post-project analysis, Ojo Publico indicates that they would need more proven cases of corruption to better train and validate their model. That said, the model as finalized already highlights some of the riskier indicators, such as having a single respondent to a public contract, the recurrence and/or exclusivity of a relationship between a public official and a manager, and the recurrence of a relationship between the procurement process and a contractor.

In addition to validating and expanding on leads already known or suspected by the journalists, the algorithm identified new ones. On the other hand,

“The model can give you leads, but the story has to be built through investigation,” insists Gianfranco Rossi, a member of the Ojo Publico team who worked on Funes.

Indeed, the human work of investigation, helped by the results given by the model, remains essential and unavoidable.

The project :
duration: 15 months
team: 4 people: 1 statistician, 2 journalists, 1 editor
amount of data analyzed: 245 000 contracts (52 GB of data)
programming language: R
type of algorithm: supervised

This investigative work by Jeremy B. Merrill, published on The MarkUp in 2020, looked at the targeting of job ads on Facebook. The dataset included 578,000 ads across all topics. It is impossible to go through them all one by one, but with the help of machine learning, it is possible to identify problematic ads in a large corpus.

In the United States, as in many other countries, it is illegal to publish a real estate or job offer indicating age, gender, race… But Facebook does it anyway by setting the broadcasting of certain ads to certain audiences, on these illegal criteria. The investigation aimed to determine if, despite the warnings, Facebook was still broadcasting ads targeting these criteria.

To do this, the journalist first retrieved a corpus of 578,000 ads published by Facebook via the platform’s advertising library. Then he searched by keywords, without any conclusive result (too many useless ads or without the desired keywords).

A database was created with about 200 ads labeled “job ad” or “not job ad” and targeted by gender. To this database was added a second set of about 1,000 ads that were not targeted by gender. These were labeled as such by hand.

The next step was to use to classify the advertisements and identify those that were relevant to the job — is a deep learning algorithm based on PyTorch (a library used for NLP, natural language processing).

In a first step, the model was trained to process English text and recognize common expressions like “on sale now” and less common terms. This was done with about 50,000 unlabeled ads. “Then we trained the classifier with 1,200 examples,” says Jeremy B. Merrill. uses 80% of the data in training by default and 20% for the control test.

The goal was to identify job ads among the 578,000 ads collected. The false positive rate was quite high due to the fact that Facebook has few job ads compared to the overall amount of ads. The reporter then hand-checked the first 200 results of the algorithm.

“The classifier found four ads with mentions of race, gender or age. I hadn’t found them by keyword searching,” the reporter says. “I wrote about only one case because the other ads had factors that would make them more complicated in the context of U.S. law.”

The project :
duration: /
team: 1 data journalist + 1 editor
amount of data analyzed: 578,000 ads
programming language: Python
type of algorithm: supervised


It is important to repeat over and over that machine learning is a tool for journalists, it will not do the job for them (and it will replace them!). Feeling that there is a subject to investigate, taking into account a context to interpret data, identifying biases, deciding what is worth investigating, and investigating, all this remains the job of journalists.

It is also essential to gather information beforehand, to meet with experts on the subject you are working on, in order to be able to determine which indicators/criteria/elements will be relevant to use when setting up the model you will then use.

We can nevertheless agree that having a tool that allows to “scan” thousands or millions of documents, either to find specific elements, or to classify them or group them by typology, is a great time saver and makes a certain number of investigations possible.

Among the interesting examples that I have not detailed here, there is the Atlanta Journal-Convention’s investigation of doctors accused of sexual abuse but still in practice (“Doctors & sex abuse project”). The investigation involved analysis of over 100,000 documents. As in the examples presented in this post, the quantity of documents makes the analysis work humanly impossible for an editorial office, whether small or large. So machine learning opens up the field of possibilities.

The machine learning method used by journalists in an investigation must be able to be explained clearly, described, be transparent (see 2018 post on algorithm ethics and journalism).

Explain what data is used, how it was retrieved, cleaned, processed, which algorithm was chosen, what it was trained with, if biases were discovered and how they were handled… This is a pillar of trust with readers that should be consolidated whenever possible given the general mistrust.
It is also important to be careful and to treat correctly the personal data that may be present in the data sets.

Finally, as Alex Siegman, former head of the machine learning technical program at Dow Jones, pointed out in 2019, one should not ask where one can use ML, but think about everyday problems and evaluate whether ML can be a solution or not.

To evaluate the feasibility of an ML project,

you need :

  • first ask yourself if there is not a simpler and faster way to work, like with a keyword alert or a statistical analysis (^^)
  • discuss with subject matter experts to understand the scope of the survey, the needs, the potential difficulties
  • then ask yourself if the necessary data are available or can be made available (already in your possession, in open data, on Kaggle…). Without data, there is no analysis
  • plan the time needed to clean and normalize the data, this step can be very time consuming but it is essential for the data to be usable
  • choose an algorithm (supervised, unsupervised, with reinforced learning) according to your objective, the quantity of data you have (a KNN or SVM algorithms are not efficient with a large dataset for example), their structure (spreadsheets are structured data that will be easily processed by all the algorithms, whereas sound or images will require certain types of algorithms). In general, we test several of them to choose the one that gives the best results. Poking around in Github and other resources like Paperwithcode to see different approaches can help to set up your algorithm.
  • structure your data according to the chosen algorithm and its technical specifications
  • test on a limited quantity, check the results, if it’s good, deploy on the whole data set to be analyzed, if not, review the data, their formatting, modify/refine the settings, re-test and re-evaluate the results until you find the right parameters to deploy after.

Other examples of ML used in investigations :

Journaliste en formation/reconversion data analyst, ex responsable numérique @ESJPRO. Algo, transparence, audio, ❤#Berlin, #Nantes, #freediving et #lindyhop