Replication Package
Gregorio Robles (1), Andrea Capiluppi (2), Jesús M. González-Barahona (1), Jonas Gamalielsson (3), Björn Lundel (3)
(1) Universidad Rey Juan Carlos (Madrid, Spain); (2) University of Groningen (Groningen, The Netherlands); (3) University of Skövde (Skövde, Sweden)
This replication package uses the criteria proposed in On the reproducibility of empirical software engineering studies based on data retrieved from development repositories (Open Access - Empirical Software Engineering, Volume 17, Numbers 1-2, 75-89, 2012), the attributes of this study are given in following table:
Details
Data Source
-
Identification:
- Git repositories (repos.txt)
- On-line to developers from the six projects under study (Openstack, Ceph, Linux, MediaWiki, Moodle, WebKit).
-
Description:
- Git repositories
- Description of the survey (the survey questions are included verbatim in the paper).
-
Availability:
- Git repositories: Public.
- On-line survey: Partially public. To preserve privacy, an anonymous version is released.
-
Persistence: Yes.
-
Identification: CVSAnalY (a previous version of Perceval)
-
Description: Tool to parse versioning systems (in particular, log messages).
-
Availability: Public
-
Persistence: Yes.
-
Flexibility: Yes. Released under the GNU GPL v2.0.
-
Identification: Survey Creator
-
Description: Tool to create an on-line.
-
Availability: Public
-
Persistence: Yes.
-
Flexibility: Yes. Released under the GNU GPL v2.0.
Raw Dataset
Extraction Methodology
-
Identification: Set of Python scripts that extract the relevant
information for the study from the Data Source: agregate_activity.py (3,7 KB)
-
Description: Script to transform the MySQL dump (see Data Source) into
CSV files (see Processed Dataset).
-
Availability: Public.
-
Persistence: Yes.
-
Flexibility: Yes. Released under the GNU GPLv2
Study Parameters
-
Identification: Date of data retrieval.
-
Description: Date when the OpenStack repositories were retrieved: 12 January 2014. Date when the other repositories were retrieved: Early March 2014.
-
Identification: Survey dates.
-
Description: The survey for OpenStack was opened January 30th 2014 and closed February 2nd 2014. The survey of the other projects opened March 24th 2014 and closed April 15th 2014.
-
Identification: Cleaning of the survey.
-
Description: Deletions and modifications done on the survey data. Information in survey_cleaning.txt (3 KB).
Processed Dataset
-
Identification:
-
Description: Relevant information for this study is gathered from the raw dataset and included in several CSV files. Bots have been filtered out (the bot ids are referred in the paper). (*) Each row contains monthly for each author starting January 2010 (padding months at the beginning of 2010 and end of 2014 are included). The author_id is not included, but can be obtained from authors_ids_nobots.zip as they are in the same order.
-
Availability: Public.
-
Persistence: Yes.
-
Flexibility: Yes. They are all CSV files.
Analysis Methodology
-
Identification: Scripts
-
Description: Set of Python scripts used for the analysis. Includes a README file with detailed information. Set of R and Python scripts to create the graphs included in the paper.
-
Availability: Public
-
Persistence: Yes.
-
Flexibility: Yes. All scripts have been released under the GPL v2.0 or later.
Results Dataset
-
Identification: Set of files as returned by the scripts used for
analysis.
-
Description: Structured text files (in general CSV files with the name of the columns given in the first line).
-
Availability: Public.
-
Persistence: Yes.
-
Flexibility: Yes. Provided in easy parseable text files.
Comments and suggestions: Gregorio Robles < grex at gsyc.urjc.es >.
Last modified: Jul 28th 2022.