[This is a translation from the original article published in march 2020]
Collecting the data was the first step in the process of content automation (detailed in this post). The second step was the question of journalistic choices in the criteria used to create or work with databases. The third step is to ensure the quality of the data collected and used, which I detail here.
“Garbage in, garbage out”
Indeed, bad information in the data, indicators that change, and paf, automatically produced articles announce the wrong winner or a false match score. In short, a wrong article. To avoid this, the database needs to be checked and corrected, which is often a long and tedious job.
This is the first difficulty that journalist and developer Laurence Dierickx noted in her speech at the “Computation + journalism symposium” in February 2019 in Miami (USA). In her Bxl’Air bot project on the air quality index in Brussels, she retrieved data from the Belgian inter-regional environment cell (CELINE) from web pages and not from an already structured database. This implied difficulties (anomalies in the measurements, modification of an index along the way…) and an important human work of monthly verification of the data and re-pointing.
Ads moderated upstream
To limit this kind of problem as much as possible and to create a strong added value to its leisure and entertainment ad service, Ouest France has chosen to check the data on the Infolocale.fr portal upstream. The ads are structured and available on an open data portal where they are geolocated, corrected (orthotypo) and moderated.
To ensure the quality of this work, “60 local journalists and editorial secretaries fulfill this mission,” says David Moizan, head of the Infolocale.fr platform. Indeed, it is imperative to know the area to remain relevant.
Xavier Antoyé, the editor-in-chief of Le Progrès, also emphasizes this point. The Ebra group also has its own platform of leisure information open to contribution and is thinking about a project of automation of texts. “Our base is open to the associations that we have certified.” The goal is also to avoid the local restaurant that advertises for free for its karaoke every friday night (instead of buying advertising) or ads like psychics, magnetizers, etc.
Everysport also moderates the comments that the coaches send by SMS after the games.
Cleaning the databases, a necessary step
Databases are rarely correctly and completely filled, there is always a cleaning work to be planned before using them. Whether it is duplicates, empty boxes, others badly filled, without forgetting the names written in different ways or the more or less exotic formats of files.
One of the recurring difficulties encountered in the processing of databases is related to the location. Indeed, it can be indicated by the postal code, or by the Insee code (which is not the same otherwise it would not be funny). It is then necessary to harmonize the bases to be able to compare them. In addition, there are bad locations, denominations that vary depending on who filled in the database (e.g. “avenue”, “AV”, “AV.”, “Av.”, “av.” etc.)
To carry out this cleaning, the journalists use several tools: R but also Open Refine or Qgis, an open source “geographic information” software. Depending on the amount of data to process and the complexity of the cleaning, this step can take from a few minutes to several days.
Archive the original data before it disappears…
To ensure a qualitative and perennial approach, scraping data from a web page is rarely the best option. The page can be deleted, moved elsewhere, the source file can be removed, modified…. Hence the importance of keeping the collected elements in clean databases.
“On the web, things happen, things go away, ministries remove their archives…”, Karen Bastien (WeDoData).
The agency WeDoData has produced “A data on politics”, for the media Les Jours. The series (which includes eight episodes) is based on public data on the activity of government members, which are then put into story form.
To achieve this, Karen Bastien explains in the french podcast A Parte that the journalists and developers of WeDoData have been sucking up numerous public elements since the first day of Emmanuel Macron’s mandate (tweets from members of the government, deputies, senators, the President, the Prime Minister, agendas, parliamentary activity which is already in open data…).
…and to save time during their (re)use
All this information is centralized, stored and organized in an internal database. What’s in it for you? A considerable saving of time to (re)find information, a great reactivity, and a database directly connected to the in-house visualization tools. Moreover, having its own database gives ideas for subjects and angles (as already mentioned in this article).
The same concern of reactivity at Sud Ouest for a completely different subject: road accidents. If the database of traffic accidents comes from the (open) platform data.gouv.fr, it is updated on the portal every year and is presented in several parts. Each time it is updated, it requires a lot of cleaning and aggregation work before being used, not to mention the necessary mastery of its very technical nomenclature.
To save time, Sud Ouest’s data department retrieves the new database every year, cleans it and aggregates it with the previous ones. “This allows us to quickly produce an accident history for a specific area, a crossroads, an intersection, for example,” says Frédéric Sallet, journalist in charge of the data and graphics department. “If we had to do that for every request our journalists made from the data.gouv.fr files, it would be too time-consuming.”
The cleaning and aggregation work renewed each year brings added value for the editorial staff in terms of daily use.
Durability and reuse of data over time
The work of re-structuring a database and updating it are two essential steps to be able to reuse it over time. This can be useful for a recurring event that can be compared from one year to the next, or to analyze changes, for example. We can also create a database for an initial use and imagine later on to use it for another subject in the same theme.
At La Montagne, the group has started to think about this issue and has created a database of all the candidates for the 2019 European elections, in the form of a CV (surname, first name, political party, social network accounts, department, town, etc.). This resource, created in 48 hours by the 300 journalists of the group who filled out a form, has automatically generated a page per candidate. Since then, the database has been used to work on the 2020 municipal elections, even if it is not for the automated generation of articles.
All this documentation work organized in the form of databases can be re-used at each local election, enriched, completed and updated. It can be used to create new editorial processes (newsgames, graphics, videos, etc.).
The question then is, who will manage its internal databases in the media? Does the documentation department have the vocation, through adequate training, to preserve and ensure the viability of these databases?
The updates, a real added value
L’Avenir’s map of municipalities in which lawn mowing is allowed on sundays, published in 2018 and accompanied by three articles, is based on data retrieved by two journalists. This typical info service article was a hit when it was first published online.
It only took two or three hours for weblab journalist Arnaud Wéry to update the database and draw article angles to suggest to local agencies in 2019. This illustrates the initial investment (time spent creating the database) and the gain that comes from updating it.
As for the dataset on primary school canteen fees created by Sud Ouest (see this article on the creation of databases by regional medias), even if it has not been updated for two years, it would be a very useful basis for a follow-up for the next school year, with the new municipal teams in place. It lays the groundwork for medium- and long-term monitoring, making it possible, for example, to see whether the change in municipal teams has an impact on rates and, if so, in what ways.
The updates of the databases can be done manually, as in the examples above, or be automated by the APIs. This is what allows, for example, Le Télégramme to generate automated articles on fuel prices with automatic updates when a price changes (details here).
However, some people avoid updates. Victor Alexandre, data journalist at Le Parisien, prefers to download a whole database (when it is of a “reasonable” size) rather than making small updates. “I find that it makes the work more complex,” he says. He also points out the difficulty of maintaining a database when several people have access to it, and the risk of multiple errors, holes or possible inconsistencies.
This question of updating is ultimately managed according to the human resources of the editorial office, the number of potential contributors to the databases and the organization of the editorial office. It is also a question of management priorities.
Take care of its documentation
Finally, a very important element for the durability of a database and its reuse: its documentation to know who made it, to have a telephone contact linked to it, its creation date, the details of the nomenclature, the modifications, who made it, the financing of the database.
“If you don’t have the additional information, you can go to the box in terms of interpretation”, warns Julien Vinzent, journalist at MarsActu. “For example, the benefits in kind taken into account in the database of subsidies to cultural associations were not valued in cash. Now they are. As a result, if we don’t know, we have the impression that the subsidy has increased, whereas it is an accounting change.
At the moment, the reflection on the long-term work and the durability of the databases is not very advanced in the regional and local editorial offices. And for good reason, the teams are generally small (or even just one person), with their noses in the air of daily production, without always having a clear strategic vision from the management. There may also be a lack of specific technical knowledge that limits the perspective and the projection of a future use, of a reuse.
However, this investment pays off when it is carried out in the editorial offices, even with limited resources. Time savings, reactivity, diversification of subjects and angles are all good reasons to dig this furrow.
The profiles of data journalists who are experts in databases are not commonplace, but could become highly sought after as the media seek to make greater use of these tools. We can also see an opportunity to transform or adapt certain professions, such as documentalists, who could have a role to play. Provided that they are trained and supported, and provided that management makes this issue a priority in the development of their media.