Replication Package

Development Effort Estimation in Free/Open Source Software from Activity in Version Control Systems

This replication package uses the criteria proposed in On the reproducibility of empirical software engineering studies based on data retrieved from development repositories (Open Access - Empirical Software Engineering, Volume 17, Numbers 1-2, 75-89, 2012), the attributes of this study are given in following table:

Element	Assessment	Condensed Assessment
Data source	Usable	U
Retrieval Methodology	Usable Likely available in future flexible	U + *
Raw dataset	Usable	U
Extraction methodology	Usable Likely available in future flexible	U + *
Study parameters	Usable	U
Processed dataset	Usable	U
Analysis methodology	Usable Likely available in future flexible	U + *
Results dataset	Usable	U

Identification:
- Git repositories (repos.txt)
- On-line to developers from the six projects under study (Openstack, Ceph, Linux, MediaWiki, Moodle, WebKit).
Description:
- Git repositories
- Description of the survey (the survey questions are included verbatim in the paper).
Availability:
- Git repositories: Public.
- On-line survey: Partially public. To preserve privacy, an anonymous version is released.
Persistence: Yes.

Identification: CVSAnalY (a previous version of Perceval)
Description: Tool to parse versioning systems (in particular, log messages).
Availability: Public
Persistence: Yes.
Flexibility: Yes. Released under the GNU GPL v2.0.

Identification: Set of Python scripts that extract the relevant information for the study from the Data Source: agregate_activity.py (3,7 KB)
Description: Script to transform the MySQL dump (see Data Source) into CSV files (see Processed Dataset).
Availability: Public.
Persistence: Yes.
Flexibility: Yes. Released under the GNU GPLv2

Identification: Date of data retrieval.
Description: Date when the OpenStack repositories were retrieved: 12 January 2014. Date when the other repositories were retrieved: Early March 2014.

Identification: Survey dates.
Description: The survey for OpenStack was opened January 30th 2014 and closed February 2nd 2014. The survey of the other projects opened March 24th 2014 and closed April 15th 2014.

Identification: Cleaning of the survey.
Description: Deletions and modifications done on the survey data. Information in survey_cleaning.txt (3 KB).

Identification:
- (Anonymized) Survey data (responses after cleaning): answers_survey.public.csv (40 KB)
- Number of commits per month (*): commits_nobots.zip (114 KB)
- Number of active days per month (*): activity_nobots.zip (93 KB)
- List of identifiers for the same author (merging): authors_ids_nobots.zip (48 KB)
Description: Relevant information for this study is gathered from the raw dataset and included in several CSV files. Bots have been filtered out (the bot ids are referred in the paper). (*) Each row contains monthly for each author starting January 2010 (padding months at the beginning of 2010 and end of 2014 are included). The author_id is not included, but can be obtained from authors_ids_nobots.zip as they are in the same order.
Availability: Public.
Persistence: Yes.
Flexibility: Yes. They are all CSV files.

Identification: Scripts
Description: Set of Python scripts used for the analysis. Includes a README file with detailed information. Set of R and Python scripts to create the graphs included in the paper.
Availability: Public
- scripts.tar.gz (4,0 KB)
- graphs.tar.gz (2,1 KB)
Persistence: Yes.
Flexibility: Yes. All scripts have been released under the GPL v2.0 or later.

Identification: Set of files as returned by the scripts used for analysis.
Description: Structured text files (in general CSV files with the name of the columns given in the first line).
Availability: Public.
- results.tar.gz (3,3 KB)
Persistence: Yes.
Flexibility: Yes. Provided in easy parseable text files.

Comments and suggestions: Gregorio Robles < grex at gsyc.urjc.es >.
Last modified: Jul 28th 2022.