Datas and local newsrooms (1/3) : 8 ways to collect unpublished datas

Maëlle Fouquenet
14 min readJul 1, 2021


[Translation from the original article published in january 2018 here]

Using data, regional media can come up with story ideas and different angles, automate some publications, create new services or develop surveys based on exclusive information.

For this, it is necessary to find existing databases or to create its own databases. This second option, to create permanent databases, is for regional media a strategic investment, both from an editorial point of view (credibility) and from an economic point of view (loyalty, leverage of subscriptions). And to create these databases, the first question is the following:

Feedback from different regional media

[In this post, we will only discuss collection, a dense topic. Data quality and editorial choices is discussed in another posts].

In local newsrooms, the first subjects cited by most of the publishers contacted about creating editorial databases are recreation news, the 2020 municipal elections and local sports. Indeed, almost all local and regional medias have agenda pages, and almost all of them cover local sports competitions, not to mention the election of the next mayors. But the subjects of investigation through datas are all sectors as we create more and more datas all the time. Here are eight examples of data collection:

1/ Thanks to the work of journalists, local secretaries or correspondents

For its survey on school canteen rates in primary and nursery schools in December 2017, for which no data sets exist, Sud Ouest put its journalists and agency secretaries to work.

If the idea of an online form to propose to parents was discussed, it was discarded because of the complexity of the rates (with or without family quotient, number of different rates etc). “There was too much risk of data entry errors or incomplete data,” explains Frédéric Sallet, head of the data and computer graphics department. As a result, the data was collected by hand.

The journalists and secretaries called the staff of the town halls, sent them emails and searched the websites to find the rates when they were indicated. Then they filled in a Google Form so that the datas could be centralized and structured in a single document.

From this collection work, which lasted about a month, they drew up an exclusive reporting (no database exists on the subject), wrote fourteen articles, identified strong disparities and oddities (such as a commune located in a “rep+” zone that does not offer a social rate), in short, interesting and unpublished information.

Information from 168 communes (including all those with more than 10,000 inhabitants, and a random sample of 20 communes from each of the seven departments) was collected for this report.

At L’Avenir, in Belgium, many data-driven articles are published regularly. Arnaud Wéry, datajournalist at the weblab, works with the collaboration of local agencies, notably on taxes on terraces or lawn mowing. For the collaboration to work well, it is essential to have a referent person and/or department heads who are convinced of the interest of the approach in each locality.

“Collection takes time but it is an investment that pays off”, Arnaud Wéry.

This collection will allow to produce articles with different angles and formats (graphic, playful, etc.), but also to prepare interviews. The database thus constituted could then be updated every year (depending on the nature of the subject) to also note changes over time for example. “We will have a gold mine that we can exploit”, underlines the journalist. An exclusive gold mine for the media since it does not exist elsewhere.

2/ With the participation of lcoal associations and cultural actors

Recreation information, which includes weekend outings, cultural information, flea markets, etc., is an important topic locally. Several local and regional medias would like to automate the announcement articles for example. It is true that an exhaustive agenda that can be easily sorted by criteria attracts Internet users in search of outings because it is a practical service. But in order to automate, and to imagine customizable services for internet users, it is still necessary to build the database. has existed for more than ten years. It was initially conceived by Ouest France to feed the print agenda with recreation information.

For more than ten years, Ouest France has set up a collection tool called (which was initially intended to feed the print media). This database is filled in directly by event organizers (institutions, associations and individuals). What’s the point of posting your ad when you organize a ball or a flea market in La Chapelle-sur-Erdre (or elsewhere in the west)? With a single entry, the ad will potentially be available on several sites of the five regional daily newspapers and 40 weekly newspapers of the regional press group, not to mention the local or thematic newsletters. Finally, the advertisements are also available on the portals of the 80 local authorities that reuse the information in this database.

The primary engine of participation is first and foremost the publication in the paper. “It’s the power of print,” says Fabrice Bazard, director of digital activities at Sipa Ouest-France.

The promise works and Ouest France attracts a critical mass (500,000 recreation ads and daily life announcements posted by 80,000 organizations) to offer a real agenda to the internet user in La Chapelle-sur-Erdre who wants to know what’s going on in his or her area next weekend.

“On recurring events, it works, but people don’t necessarily think of sending their information for one-off events,” adds Claude de Loupy, co-founder of Syllabs (in which Ouest France has taken a stake).

Although correspondents and journalists sometimes collect diary/recreation datas, they can’t collect as much of it as quickly.

3/ By institutional actors

Another interesting collection method is that of Fré, a site specializing in cultural and tourist news, which collects beach weather information from municipalities or tourist offices (temperature and water quality, presence or absence of jellyfish).

“We made it really easy for them to do this with small NFC/QR Code cards so that beach attendants can do it with their smartphones from the beach,” explains Jean-Baptiste Fontana, director of the publication.

The information collected is used in “practical information” pages dedicated to each beach, in news pages that mention the beaches in question, or allows journalists to have practical information to write more substantial articles.

To convince municipalities to participate, Fréquence-Sud allows them to retrieve the information via a widget or an RSS feed. This way, communities avoid creating, managing and hosting a database while using the information on their own sites. The most difficult part of this approach? Convincing municipalities to participate in the collection. This is a worthwhile effort for the media, since it provides “exclusive information, with a certain added value for the reader and tools to facilitate the work of the editorial staff,” says Jean-Baptiste Fontana.

When an element changes in the database, a swimming flag that turns red or the presence of jellyfish for example, the editorial staff receives an automated alert by email. Journalists are then free to decide whether or not to write about it. [This is one of the functions mentioned in my article on the automatic generation of articles in regional medias].

4/ With sports clubs, on their websites and through social networks

Local sports is another major topic that some local and regional medias would like to put through automation, but which has not yet been achieved. Each sports federation has its own tools, service providers and processes. Some open APIs, as is the case for basketball and rugby, others do not or only under specific conditions/partnerships (soccer, handball for example).

Another way used to retrieve information is scraping the websites of federations and districts. But the technique remains fragile because the code created to retrieve and process the data on the site of a sport will not be usable on another, each website having its own graphic design. Moreover, if the site modifies its design, the scraping program does not bring back anything anymore.

To complete these two methods, there is still the good old “old-fashioned” technique: calling ! Often carried out by students in the framework of a small job on sunday evening or by the correspondents, this stage of the collection allows the media to obtain the results missing by the other means.

In Sweden, the company Everysport (which provides data to United Robots and MittMedia) created its local team sports databases (adult category) in 2000 for the results pages in the print versions of local newspapers. The company collects information after each game, partly from sports federations and partly by phone from clubs and referees, explains Stefan Lundström, data manager at Everysport. They also scan Twitter accounts, federation and club websites.

The work mobilizes six to seven people per evening for 300 to 400 matches and relies on a strong curation of reliable accounts that give the results first. During a “typical” month (August), this represents about 5,000 phone calls against an average of 2,500 for a quieter month. These calls are dedicated to the third and lower divisions. The upper divisions are more easily served with automatic results from the federations, cooperation with other players and monitoring on social networks. The team can go up to fifteen people per evening at the end of the season during the busy evenings of late June, in a slot from 7 to 11 pm. The goal of the game: to collect as many results as possible by 10 p.m., the average closing time for local newspapers.

“We started with a few newspapers and gradually increased our coverage,” explains Stefan Lundström. This also involves a lot of work updating the databases with contacts and accounts to follow before the start of each season (May for soccer, September for field hockey). Check if so-and-so is still in place and will still be the right person to contact on the day.

5/ Via automated questions by SMS to sports coaches

Everysport’s most recent experimentation in terms of collection is the sending of an SMS to the team coaches after each game. The response, also by SMS, is integrated into the article automatically generated by United Robots to enrich it. Launched in December 2018, the operation integrates for the moment about 400 coaches all sports covered.

Screenshot of the system presentation on the United Robots website

First, the local newspapers provide Everysport with a list of the most important teams for them (it can go from 5 or 6 teams to 80). The company contacts the coaches to collect their agreement (with the argument of visibility) and their mobile number. Then it sends after each end of game an automatic SMS with one question.

The question is chosen algorithmically according to the latest statistics of the team (three games won in a row, the leader defeated by one of the last ones in the ranking, etc.) in a list established beforehand by journalists.

The journalists have imagined the different possible scenarios and have provided twenty or so typical questions from which the algorithm draws. Each sport has its own list of questions.

The answers are then moderated before being integrated into the articles generated and then updated automatically. “We were afraid that moderation would take a lot of time, but in the end it didn’t,” says Stefan Lundström. This step allows us to correct typing errors and moderate the few ungentlemanly terms when there are some, “but there are not that many”. What takes the most time? Convincing the coaches to participate.

6/ Through automated visual recognition

At Stat Perform (formerly Opta Sport), a pay-per-view solution specialized in sports that works with L’Equipe, cameras in american stadiums film and record matches. Visual recognition algorithms decipher the images and record the actions in a structured way to produce articles, graphics and other automated content.

In France, to generate the hyper-detailed databases of professional soccer (which represents 90% of Stat Perform’s customer requests in Europe), three (human) analysts scan the video broadcast of each soccer game (one dedicated to each team, plus a supervisor) and log all the actions.

Would it be possible at the regional level ? If James Chalk, Stat Perform’s french manager, doesn’t say no, we understand that the production costs are probably not bearable compared to the potential monetization of the articles it would allow to create. If technically, everything is possible, what is the economic equation ?

For Stefan Lundström of Everysport, the economic viability of his company is based on a strong swedish tradition of sports results pages in the local press. Every small town has its own newspaper. In addition, there is “a strong tradition of local communities involved in sports activities. Many people are involved in local clubs,” he says. Finally, collecting is getting easier every year, as identifying the right accounts to follow saves time that didn’t exist before social networks.

7/ What about user participation?

Regarding the collection of local sports data, Guillaume Desombre, CEO of LabSense (a provider of automated text generation and a competitor of Syllabs) favors “a mix of information from the sports federation and UGC (content provided by Internet users), with validation.

It is then necessary to determine who validates and how. Is it a validation by the mass of Internet users giving the same score? By already selected and “verified” Internet users (who would then act as sports correspondents?)? Or a clever mix of both? These questions apply to all fields, not only to sports.

As for the beach weather, Fréquence-Sud plans to ask for the participation of Internet users starting in the summer of 2020. This would allow opinions on the quality and cleanliness of the beach, water and ratings from “real people” in addition to water temperature, flag color and air temperature. The control would be done by an evaluation of the “reliability” of the user cumulated with the other notes and comments for the same place.

An example of crowdsourcing that is a bit dated but very meaningful and regularly shown during datajournalism trainings, the collection of water prices in France conducted by Owni in 2011 for the Fondation France Libertés and 60 Millions de Consommateurs. In four months, about 5,000 individuals had scanned and sent their water bills. The information was validated by dozens of employees of the NGO France Libertés via a scoring mechanism (as Nicolas Kayser Bril describes here).

In recent years, the Germans at have conducted several local surveys with this method: “Who owns Hamburg? “(a survey to find out who owns housing in Hamburg, during six months, in partnership with the Hamburger Abendblatt and about 1,000 participating tenants, which generated several articles), the unreplaced absences of teachers in Dortmund (one month of participatory survey, 520 participants and results showing twice as many unreplaced absences as the ministry’s figures)… also opened their Crowdnews platform as a paid service to other publishers.

“Who owns Hamburg?”

To cover the March 2020 municipal elections, Sud Ouest and Centre France have launched online questionnaires for internet users. This is also a collection of information to feed the journalists, to help them feel the questions that people really ask, the subjects that concern them, to check that they are in tune with the voters and their readers.

This questionnaire, entitled “If I were a mayor”, proposes closed questions (easier to process afterwards). It was initiated by the Data+Local collective, which brings together many regional data journalists.

Editing from screenshots of the questionnaire Si j’étais maire, on the Berry Républicain website.

8/ Through connected objects

In 2017, I was attending a strange conference presenting a survey dedicated to milk production. The angle? Giving voice to three dairy cows via sensors, each living in a different environment (organic farm, classic family farm and a big farm). Code name: Superkühe (super cows).

The herd management system (already present in the farms) was the first collection means used. In addition, sensors (in the cowsheds but also ingested by the cows) operating 24/7 for 30 days indicated the nature and quantity of feed ingested, the duration of meals, the quantity and quality of milk produced, the state of the udders, the humidity level and temperature in the cowsheds, but also the body temperature of the cows, their movements, the start of calving…

Editorially, the data were exploited on a dedicated website by the german channel WDR (WestDeutscher Rundfunk) and are presented in several forms:
- a daily logbook for each of the three cows containing the events of the day, with videos, expert interviews to contextualize
- captioned graphs to explain what the numbers mean
- a live chat via Facebook messenger with each cow (not tested because inactive now)
- more classic articles on each type of farm written in the first person on behalf of each cow
- videos

The idea behind this project is to make better known the different methods of milk production in Germany but also to question these methods with regard to the cows’ well-being. All this in a playful and interactive way.

Recently, several french media have tested sensors to measure air pollution. Last march, Le Parisien published an article produced with the data of a sensor manufactured by a french start-up (Plume Labs).

The test seems to have worked rather well in Paris, even if the journalist specifies that a one or two hour delay sometimes occurs between the reading and the result.

In october, Ouest France also tested a sensor provided by the Maison de la consommation et de l’environnement in Rennes. The journalist rightly pointed out that this is not a scientific operation given the number of uncertainties linked to the readings (a point discussed in detail by Laurence Dierickx in a forthcoming article on the importance of journalistic choices in the data used for articles).

In the Stuttgarter Zeitung, Germany, a survey on fine particles in the city and neighboring communities was conducted in association with a university laboratory dedicated to sustainable mobility and the Open Knowledge Lab Stuttgart. The regional newspaper succeeded in convincing 500 participants to measure the levels of particulate matter in their neighborhoods with sensors made by the lab.

Why not rely on existing sensors? “Because they only reflect a part of reality”, we read on the media site. This remark underlines an important point: how relevant is the data I collect? It is indeed essential that journalists question themselves on the methodology and criterias they choose in capturing and processing data.

The ideas for articles based on sensors or connected objects are limited only by the imagination of journalists and the technical feasibility. Whether the data is collected by an object, gathered with the participation of the public, or gathered by the meticulous work of journalists, it has one essential characteristic: it constitutes a unique database that no other media has. They allow the media owner to offer exclusive, differentiating information, which can generate subscriptions or build loyalty.

Moreover, the surveys they generate generate discussion and exchange on important subjects (such as pollution and health for example) in the communities covered by the media. It positions itself as an actor in local life.

In addition, the explanations that accompany data-based publications make the working methods of journalists more transparent to the public, and have a positive effect on the degree of public trust in the media. This trust is further strengthened when public participation is solicited, as the public is then involved in the production process.

Finally, and this is an aspect that should not be neglected, these experiences also contribute to the appropriation of the technology (in a global way) by the journalists. The latter become more apt to question a society in which data is growing exponentially and algorithms determine more and more elements of our daily lives.



Maëlle Fouquenet

Journaliste en formation/reconversion data analyst, ex responsable numérique @ESJPRO. Algo, transparence, audio, ❤#Berlin, #Nantes, #freediving et #lindyhop