Replication Package

Estimating Development Effort in Free/Open Source Software Projects by Mining Software Repositories: A Case Study of OpenStack

Based on the criteria proposed in On the reproducibility of empirical software engineering studies based on data retrieved from development repositories (Open Access - Empirical Software Engineering, Volume 17, Numbers 1-2, 75-89), the attributes of this study are given in following table:

Element	Assessment	Condensed Assessment
Data source	Usable	U
Retrieval Methodology	Usable Likely available in future flexible	U + *
Raw dataset	Usable	U
Extraction methodology	Usable Likely available in future flexible	U + *
Study parameters	Usable	U
Processed dataset	Usable	U
Analysis methodology	Usable Likely available in future flexible	U + *
Results dataset	Usable	U

Identification:
- 61 OpenStack repositories from GitHub (openstack_repos.txt)
- On-line to OpenStack developers
Description:
- Git repositories
- Description of the survey (the survey questions are included verbatim in the paper).
Availability:
- Git repositories: Public.
- On-line survey: Partially public. To preserve privacy, an anonymous version is released.
Persistence: Yes.

Identification: Set of Python scripts that extract the relevant information for the study from the Data Source: agregate_activity.py (3,7 KB)
Description: Script to transform the MySQL dump (see Data Source) into CSV files (see Processed Dataset).
Availability: Public.
Persistence: Yes.
Flexibility: Yes. Released under the GNU GPLv2

Identification: Survey dates.
Description: The survey was opened January 30th 2014 and closed February 2nd 2014.

Identification: Cleaning of the survey.
Description: Deletions and modifications done on the survey data. Information in survey_cleaning.txt (1 KB).

Identification:
- (Anonymized) Survey data (all responses): answers_openstack.all.public.csv (3,7 KB)
- (Anonymized) Survey data (responses after cleaning): answers_openstack.public.csv (3,5 KB)
- Number of commits per month (*): output_commits_nobots.csv (195 KB)
- Number of active days per month (*): output_activity_nobots.csv (194 KB)
- List of identifiers for the same author (merging): output_authors_ids_nobots.csv (40 KB)
Description: Relevant information for this study is gathered from the raw dataset and included in several CSV files. Bots have been filtered out (the bot ids are referred in the paper). (*) Each row contains monthly for each author starting January 2010 (padding months at the beginning of 2010 and end of 2014 are included). The author_id is not included, but can be obtained from output_authors_ids_nobots.csv as they are in the same order.
Availability: Public.
Persistence: Yes.
Flexibility: Yes. They are all CSV files.

Identification: Scripts
Description: Set of Python scripts used for the analysis. Includes a README file with detailed information. Set of R and Python scripts to create the graphs included in the paper.
Availability: Public
- scripts.tar.gz (4,0 KB)
- graphs.tar.gz (2,1 KB)
Persistence: Yes.
Flexibility: Yes. All scripts have been released under the GPL v2.0 or later.

Identification: Set of files as returned by the scripts used for analysis.
Description: Structured text files (in general CSV files with the name of the columns given in the first line).
Availability: Public.
- results.tar.gz (3,3 KB)
Persistence: Yes.
Flexibility: Yes. Provided in easy parseable text files.

Comments and suggestions: Gregorio Robles < grex at gsyc.urjc.es >.
Last modified: Feb 9th 2014.