Replication Package
Estimating Development Effort in Free/Open Source Software Projects by Mining Software Repositories: A Case Study of OpenStack
Gregorio Robles (1), Jesús M. González-Barahona (1), Carlos Cervigón (1), Andrea Capiluppi (2) and Daniel Izquierdo (3)
(1) Universidad Rey Juan Carlos (Madrid, Spain); (2) Brunel University (London, UK); (3) Bitergia S.L. (Madrid, Spain)
Based on the criteria proposed in On the reproducibility of empirical software engineering studies based on data retrieved from development repositories (Open Access - Empirical Software Engineering, Volume 17, Numbers 1-2, 75-89), the attributes of this study are given in following table:
Details
Data Source
-
Identification:
-
Description:
- Git repositories
- Description of the survey (the survey questions are included verbatim in the paper).
-
Availability:
- Git repositories: Public.
- On-line survey: Partially public. To preserve privacy, an anonymous version is released.
-
Persistence: Yes.
-
Identification: CVSAnalY
-
Description: Tool to parse versioning systems.
-
Availability: Public
-
Persistence: Yes.
-
Flexibility: Yes. Released under the GNU GPL v2.0.
-
Identification: Survey Creator
-
Description: Tool to create a similar survey.
-
Availability: Public
-
Persistence: Yes.
-
Flexibility: Yes. Released under the GNU GPL v2.0.
Raw Dataset
Extraction Methodology
-
Identification: Set of Python scripts that extract the relevant
information for the study from the Data Source: agregate_activity.py (3,7 KB)
-
Description: Script to transform the MySQL dump (see Data Source) into
CSV files (see Processed Dataset).
-
Availability: Public.
-
Persistence: Yes.
-
Flexibility: Yes. Released under the GNU GPLv2
Study Parameters
-
Identification: Date of data retrieval.
-
Description: Date when the repositories were retrieved: 12 January 2014.
-
Identification: Survey dates.
-
Description: The survey was opened January 30th 2014 and closed February 2nd 2014.
-
Identification: Cleaning of the survey.
-
Description: Deletions and modifications done on the survey data. Information in survey_cleaning.txt (1 KB).
Processed Dataset
-
Identification:
-
Description: Relevant information for this study is gathered from the raw
dataset and included in several CSV files. Bots have been filtered out (the bot ids are referred in the paper). (*) Each row contains monthly for each author starting January 2010 (padding months at the beginning of 2010 and end of 2014 are included). The author_id is not included, but can be obtained from output_authors_ids_nobots.csv as they are in the same order.
-
Availability: Public.
-
Persistence: Yes.
-
Flexibility: Yes. They are all CSV files.
Analysis Methodology
-
Identification: Scripts
-
Description: Set of Python scripts used for the analysis. Includes a README file with detailed information. Set of R and Python scripts to create the graphs included in the paper.
-
Availability: Public
-
Persistence: Yes.
-
Flexibility: Yes. All scripts have been released under the GPL v2.0 or later.
Results Dataset
-
Identification: Set of files as returned by the scripts used for
analysis.
-
Description: Structured text files (in general CSV files with the name of the columns given in the first line).
-
Availability: Public.
-
Persistence: Yes.
-
Flexibility: Yes. Provided in easy parseable text files.
Comments and suggestions: Gregorio Robles < grex at gsyc.urjc.es >.
Last modified: Feb 9th 2014.