Licensing is Software Too:
Achievements and Challenges
(and how this relates to code provenance)
Massimiliano Di Penta
University of Sannio, Italy
2
Acknowledgements
Daniel M. Germán
, Univ. Victoria, Canada
Julius Davies
, Univ. Victoria, Canada
Giuliano Antoniol
, Ecole Polyt. Montréal, Canada
Yann-Gaël Guéhéneuc
, Ecole Polyt. Montréal,
3
Reusing Open Source Software
When developing a software system,
we try (if possible) not to reinvent the wheel
Components, libraries, source
code snippets out of there, ready to be reused
Code search engines are becoming popular
Open source code modification and
redistribution governed by
Software licenses
Copyright statements
Everything contained in a licensing
4
What does a licensing contain?
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */ /* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1 *
* The contents of this file are subject to the Mozilla Public License Version * 1.1 (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002 * the Initial Developer. All Rights Reserved.
*
* Contributor(s):
* Brian Ryner <[email protected]>
….
* decision by deleting the provisions above and replace them with the notice * and other provisions required by the GPL or the LGPL. If you do not delete * the provisions above, a recipient may use your version of this file under * the terms of any one of the MPL, the GPL or the LGPL.
*
* ***** END LICENSE BLOCK ***** */ #include "nsXULAppAPI.h" #ifdef XP_WIN #include <windows.h>
License
(MPL+GPL+LGPL)
Copyright
statement
Copyright
year
5
Restrictive vs. permissive
licenses
Restrictive (aka copyleft or reciprocal)
Changed software must be made available under
similar terms wrt. the original
Example:
GPL
Permissive
Modifications/enhancements may remain
proprietary
Distribution of source code or binary permitted
– Provided copyright notice and/or liability disclaimers
– Contributor names do not imply endorsement
Examples:
Berkeley Software Distribution (BSD),
6
FOSS development teams care!
(source: Debian)
I am in the process of trying to prepare 0.8.0 for Debian
GNU/Linux I have started going over the copyright/license
headers. In src/celeste many files are missing copyright
information. Most of these are files imported with minimal
changes from Gabor API http://www.kung-foo.tv/gaborapi.php
or libsvm http://www.csie.ntu.edu.tw/\~cjlin/libsvm/.
The attached patch adds copyright and license statements
to these files.[1]
Please apply and update the headers (adding copyright
holders) if you make substantial changes.
thanks, cu andreas
[1] I have doublechecked with Gabor API's upstream author
Adriaan Tijsseling that files like ContrastFilter.cpp are
Copyright (c) Adriaan Tijsseling and licensed under
GPLv2+, although the original headers just say:
Original Author: Yasunobu Honma
7
Conjectures
Since licenses determine the way software
can be composed and re-distributed
They may
change/evolve
as any other part of
the software
They might be subject to
bugs
too
– See our ICPC 2010 paper about how to identify
licensing incompatibilities
They might determine the success/failure of a
software project
Code provenance and licenses:
Licenses constrain source code migration
between projects
Code provenance might be useful to determine
8
Licenses influence the software
lifetime
OpenBSD founder and project leader Theo de Raadt
removed a security software package called IP-Filter
[written by Darren Reed] after its author changed its
license.
Stephen Shankland, CNET News, 2001/05/30.
Licenses evolve as software does
Failing to account for that would cause copyright
infringements
Decisions on license changes impact as other
decisions on software evolution
Little attention so far from the scientific community
9
Example: Java
Until November 2006, the license of Java JDK v1.2 said:
“Except as specifically authorized in any Supplemental
License Terms, you may not make copies of Software,
other than a single copy of Software for archival
purposes”
This disallowed the inclusion of Java in Linux distributions
Java 5.0 released under the GPL v2 with the
CLASSPATH exception:
Java could be modified/updated under the GPL v2
Java programs could be released under any license as long as
they satisfy the conditions stated in the CLASSPATH exception
Changing the license of a system can promote
and ease the distribution and reuse of a
11
Example: QT
First released under a non-open source but free
license, called the FreeQT License, and a commercial
license
QT became the basis for KDE
QT v2.0 was released under a new license, the Q Public
License
incompatible with the GPL
GNOME project started as a QT-free alternative to KDE
Harmony project started as a GPL replacement of QT
Trolltech changed the license of QT v3 to the GPL v2
The Harmony project was abandoned
Changing the license of FOSS system
13
Empirical Study
Goal:
analyze licensing evolution
Purpose:
investigating how
developers change licensing
statements
Context:
CVS/SVN repositories of
ArgoUML, Eclipse-JDT, the FreeBSD and
14
Research Questions
RQ1:
To what extent are files
changing their licenses?
RQ2:
How are copyright years
changed in licensing statements?
RQ3:
Who are the contributors of a
15
Licensing Analysis Method –
Extracting Licensing statements
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */ /* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1 *
* The contents of this file are subject to the Mozilla Public License Version * 1.1 (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002 * the Initial Developer. All Rights Reserved.
*
* Contributor(s):
* Brian Ryner <[email protected]>
….
* decision by deleting the provisions above and replace them with the notice * and other provisions required by the GPL or the LGPL. If you do not delete * the provisions above, a recipient may use your version of this file under * the terms of any one of the MPL, the GPL or the LGPL.
*
* ***** END LICENSE BLOCK ***** */ #include "nsXULAppAPI.h"
#ifdef XP_WIN
16
Licensing Analysis Method –
Classifying licenses
FoSSology [Gobeille, MSR 2008]
: detects licenses
using the Binary Symbolic Alignment Matrix (bSAM)
Ninka [German et al., ASE 2010]:
uses a
pattern-matching approach
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */ /* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1 *
* The contents of this file are subject to the Mozilla Public License Version * 1.1 (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002 * the Initial Developer. All Rights Reserved.
*
* Contributor(s):
* Brian Ryner <[email protected]>
….
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */ /* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1 *
* The contents of this file are subject to the Mozilla Public License Version * 1.1 (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002 * the Initial Developer. All Rights Reserved.
*
* Contributor(s):
* Brian Ryner <[email protected]>
….
17
Licensing Analysis Method –
Identifying changes in copyright
years
Mining references to years in licensing…
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */ /* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1 *
* The contents of this file are subject to the Mozilla Public License Version * 1.1 (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002 * the Initial Developer. All Rights Reserved.
*
* Contributor(s):
* Brian Ryner <[email protected]>
….
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */ /* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1 *
* The contents of this file are subject to the Mozilla Public License Version * 1.1 (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002 * the Initial Developer. All Rights Reserved.
*
* Contributor(s):
* Brian Ryner <[email protected]>
18
Licensing Analysis Method –
Identifying contributor names
Mining emails, plus various patterns
Copyright … year name Contributor(s) …
And mapped to committers, whenever possible
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */ /* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1 *
* The contents of this file are subject to the Mozilla Public License Version * 1.1 (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002 * the Initial Developer. All Rights Reserved.
*
* Contributor(s):
* Brian Ryner <[email protected]>
….
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */ /* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1 *
* The contents of this file are subject to the Mozilla Public License Version * 1.1 (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002 * the Initial Developer. All Rights Reserved.
*
* Contributor(s):
* Brian Ryner <[email protected]>
19
RQ1: Most relevant license changes
Eclipse-JDT
Common Public License v1.0
Eclipse Public License v1.0
CHANGE
2394
Common Public License v0.5
Common Public License v1.0
UPDATE
808
Mozilla
NPL
'NPL v1.1'-style+GPL v2+LGPL v2.1
DUAL
2914
NPL
'Dual MPL GPL'-style+MPL
DUAL
1274
'Dual MPL GPL'-style+MPL
NPL
BUG
1194
Licensing updated as new licenses were
developed
Eclipse JDT:
CPL 0.5
CPL 1.0
EPL 1.0
IBM has relinquished control of licenses to the Eclipse
Foundation
Mozilla:
NPL
MPL + GPL (+ LGPL)
NPL allowed to release Netscape 6 as a proprietary system
MPL only allows to re-distribute the source code under the
MPL
20
RQ1: Most relevant license changes
FreeBSD
BSD UCRegents (4-cl BSD)
'BSD UCRegents'-style
(4-cl BSD)
UPDATE
491
'BSD UCRegents'-style (4-cl BSD)
'INRIA-OSL'-style (3-cl BSD)
UPDATE
300
OpenBSD
'BSD UCRegents'-style (4-cl BSD)
'INRIA-OSL'-style (3-cl BSD)
UPDATE
964
BSD UCRegents (4-cl BSD)
'BSD UCRegents'-style
(4-cl BSD)
UPDATE
414
FreeBSD and OpenBSD
are more eclectic
than other projects
Moving from BSD-4 clauses to the
more
21
RQ1: Most relevant license changes
ArgoUML
None
'Free with copyright clause'-style +'UC Regents free with
copyright clause'-style
ADD
127
Samba
None
GPL v2
ADD
15
ArgoUML and Samba
kept the same
licenses over the analyzed time span
Change is from
None
to a simple license
Authors realized the importance of including a
22
RQ2: How and why were
copyright years changed?
Files for which the copyright years were
updated underwent a significantly higher
number of changes than others
When developers perform substantial changes to a
file, they also update copyright years
Required by copyright regulations
Lack of updates with substantial changes would
allow an infringer to claim “
innocent infringement
”
Commits explicitly targeted to copyright years
“Updated copyrights”
23
RQ3: When do contributors change?
Changes where contributor
names are added are significantly
bigger than other changes
Contributors often added
when they make substantial
changes
Contributor names are important
assets in source code
Like the signature on a picture
However…
contributors can change during the time
no standard way of reporting them
no clear rule on when one should become a
contributor
25
Free (software) as a bird…
As
birds
migrate differently
during different seasons….
Code might have a
migration preferential
direction
Given two systems
e.g. FreeBSD and Linux
We find the same code in
both systems
Three scenarios:
Migration FreeBSD
Linux
Migration Linux
FreeBSD
Migration third-party
27
Sibling(s) Origin
Identify siblings
between systems using clone detection
CCFinderX
, with >100 tokens as threshold, plus other heuristics
Trace back
into past siblings – their code fragments in the
same files
Again clone detection, the sibling fragment wrt. previous file
revisions
When they disappear
, then we have their origins
Take the oldest of the two as the true origin
Sys 1 – File i
Sys 2 – File j
siblings
Cloned fragments
Cloned fragments
28
Code Migration and Licenses
FreeBSD
Linux
Files
BSD
GPL
8
BSD
MIT
2
BSD
None
2
Corporate
BSD+GPL
89
GPL
None
1
Phrase
BSD+GPL
1
X.Net+BSD MIT
1
Linux
FreeBSD
Files
BSD+GPL
Corporate
8
GPL
BSD
17
GPL
BSD+GPL
1
GPL
CPL+BSD+GPL
1
MIT
BSD
1
MIT+GPL
None
2
None
BSD
1
Phrase+GP
L
MIT
2
OpenBSD
Linux
Files
BSD
BSD+GPL
1
BSD
MIT
2
BSD
Unknown
1
BSD+GPL
GPL
1
BSD+Phras
e
Phrase+GPL
1
MIT
GPL
23
After Jan 1, 2002
Nothing before
Before
Jan 1, 2002
29
Discussion
Siblings have a
preferential flow
Initially from BSD(s) to Linux – frequent
Today from Linux to FreeBSD – less frequent
Thus, due to licenses but also to the system
level of development
Companies directly contribute
to code in
different kernels – see Intel drivers with
dual licenses
In this case, code migrates from a third party
31
Motivations
Very often, Java open source software
is distributed in jar archives
See
http://mvnrepository.com/
Problem:
the jar might not contain
licensing info
Under what conditions can we integrate
the component?
The jar might not be legally used
Even if it’s from open source code, we
32
Search-driven approach
Extracting info from the class bytecode
Class and package names.. or a fingerprint..
We use the ASM library (
http://asm.ow2.org/
)
Querying Google Code Search
Using the full qualified class name
Using the package only
Query performed using the Google Code API
(
http://code.google.com/apis/gdata/
)
If the same class is not found, its license is
34
% of correct classifications
Found license:
Min. 29%
(commons.codec), Avg.
82%, median: 89.5%
Inferred licenses:
Min. 62% (JLayer 1.0),
Avg. 95%, median 100%
The inferring heuristic
significantly better
both in terms of
35
Incorrect classifications
Most of them are between LGPL
and GPL and between BSD and
Apache.
commons-codec:
mismatching
between Apache and BSD
files licensed under the Apache v 1.1
derived from the BSD
JLayer:
mismatching between GPL
and LGPL
same inferred licenses in both
releases (0.4 and 1.0)
however,
JLayer moved from GPL to
36
Conclusions
We proposed a code analysis method as
support for lawyers other than for software
engineers
We studied how licensing are used and
evolve
License type, copyright year, contributors
Main findings:
License influence projects outcome
License influence code migration
Moving towards more permissive licenses
Copyright years and contributor names updated
37
Licensing and code provenance
Licensing influences the
direction in which
code flows
from a system towards another
one
Often code flows in the direction of more
permissive licenses…
..but there are many other factors influencing how
code flows
Search-driven approaches can be adopted to
determine
from what code does a closed
component come from
And thus its licensing…
Issues related to the capabilities of the code
39
References
Daniel M. Germán, Jens H. Weber-Jahnke, Massimiliano Di Penta:
Lawful
Software Engineering
, Proceedings of FoSER: Working Conference on the
Future of Software Engineering Research, November 2010, Santa Fe', USA,
2010, ACM
Daniel M. Germán, Massimiliano Di Penta, Julius Davies:
Understanding and
Auditing the Licensing of Open Source Software Distributions
. ICPC 2010:
84-93
Massimiliano Di Penta, Daniel M. Germán, Yann-Gaël Guéhéneuc, Giuliano
Antoniol:
An exploratory study of the evolution of software licensing
. ICSE
2010: 145-154
Massimiliano Di Penta, Daniel M. Germán, Giuliano Antoniol:
Identifying
licensing of jar archives using a code-search approach
. MSR 2010: 151-160
Massimiliano Di Penta, Daniel M. Germán:
Who are Source Code Contributors
and How do they Change?
WCRE 2009: 11-20
Daniel M. Germán, Massimiliano Di Penta, Yann-Gaël Guéhéneuc, Giuliano
Antoniol:
Code siblings: Technical and legal implications of copying code
between applications
. MSR 2009: 81-90
Daniel M. Germán, Yuki Manabe, Katsuro Inoue:
A sentence-matching method
for automatic license identification of source code files
. ASE 2010: 437-446
Daniel M. Germán, Ahmed E. Hassan:
License integration patterns: Addressing
license mismatches in component-based development
. ICSE 2009: 188-198