Software Engineering Laboratory, Department of Computer Science, Graduate School of Information...

Post on 01-Apr-2015

217 views 2 download

Transcript of Software Engineering Laboratory, Department of Computer Science, Graduate School of Information...

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

A Preliminary Study on Impact of Software Licenses on

Copy-and-Paste Reuse

Yu Kashima† , Yasuhiro Hayase†† ,Norihiro Yoshida††† ,

Yuki Manabe† , Katsuro Inoue†

† : Osaka University †† : Toyo University†††: Nara Institute of Science and Technology

1

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Software Reuse

• Purpose of software reuse– Development of reliable software– Increasing software productivity

• We focus on Copy-and-Paste(CnP)– A basic method of software reuse

2

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Open Source Software and Licenses

• Open Source Software(OSS)– Derivative works from OSS products are allowed

to be distributed– Reusable source code is increasing because of

increasing OSS products• OSS Licenses

– Many kind of licenses are designed for satisfying various developer’s intent

– Each OSS licenses have different conditions– Reuse is also restricted by the licenses

3

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Representative OSS Licenses

• 3-clause BSD License(BSD3)– A derivative work must retain copyright notices, list of

conditions and disclaimer of warranties• Apache License Version 2(Apachev2)

– A derivative work must retain copyrights, patents, trademarks and attribution notices

• GNU General Public License Version 2(GPLv2)– A derivative work must be distributed under GPLv2

• LicenseName Code ≡ source code distributed under LicenseNameEx. BSD3 code ≡ source code distributed under BSD3

4

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

CnP between different license files

• If a developer reuse source code; – Both license of reused code and license of

developing code must be satisfied simultaneously

– Distributions of developing code are prohibited in case

CnP

5

BSD3 GPLv2

CnP

CnP

Apachev2 GPLv2

CnP

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Impact of License on CnP

• Hypothesis– Characteristic of source code reuse depends on

their license• Frequency of CnP• Kind of licenses used by source code developed by CnP

• To our knowledge, there are no quantitative studies on CnP reuse from the aspect of software license

• We investigate actual OSS to confirm this hypothesis

6

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Experiment

• An quantitative experiment was performed on a small set

• Purpose– Confirming our hypothesis– Investigating the scalability of our method

• Overview– Investigation of the number of CnP on each license– Code clone detection is used for CnP detection

• Code clone is a code fragment similar to other• Code clone is typically generated by CnP

7

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Method of Experiment

Step1. License

detection

Source Files

Application X

Application YStep3. Counting Code Clones

Code fragments grouped by their license

8

License #Code Fragments

License A 10

License B 3

… …

Unknown

License A

License B

License A License A

License A License B

Step2. Code Clone

Detection

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Step1. License Detection

• Ninka[1] is used for detecting licenses of source files– Analyzing license description in the source file– Having the high precision of the detected license

• Excluding files Ninka fails to detect their licenses– Files which contain no license description or

unknown license description

[1] D. M. German, Y. Manabe and K. Inoue: “A sentence-matching method for automatic license identification of source code files”, ASE 2010, pp. 437–446 (2010)

9

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Step2. Code Clone Detection

• CCFinder[2] is used for extracting code clone across different application– We assume that CnP within application will not cause license problems

• Filtering– Excluding code clones generated by other than CnP

Ex. getter/setter, variable declarations

• Directions of CnP are undecided

10

License A License B License C

Application X Application Y Application Z

CnP CnP

Getter/Setter[2] T. Kamiya, S. Kusumoto and K. Inoue: “CCFinder: A multilinguistic token-based code clone detection system for large scale source code”, IEEE Transactions on Software Engineering, 28, pp. 654–670 (2002)

Variable Declarations

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Step3. Counting Code Clones(1/2)

• Repeating the following steps to target licenses

1. Select a license as an analysis target

2. Extract clone sets including the license code• Clone set is a set of code clones similar to each

other

3. Count code fragments in extracted clone sets grouped by their license

11

License A License B License C License #Code Fragments

License A 2

License B 1

License C 2

Application X Application Y Application Z

Fragments having CnP relations to License A code

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Step3. Counting Code Clones(2/2)

• A clone set including both original code fragments and code fragments generated by CnP

→ Counting code fragments in clone sets approximates counting the number of CnP

• Counting the number of CnP to/from target license code fragments

• Although this table includes the CnP of opposite direction, it is enough to understand the brief of summary

12

License A License B License C License #Code Fragments

License A 2

License B 1

License C 2

Application X Application Y Application Z

Fragments having CnP relations to License A code

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Analyzed Code

• Java files(.java) in Debian GNU/Linux 5.0.2 main section

• Reasons for selecting this target– consisted of various licenses– enable to be analyzed by both Ninka and

CCFinder– an feasible scale for this experiment

13

#Packages 452

#Files 77,452

LOC 8,530,896

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

License Distribution in Analyzed Code

14

Apach

ev2

GPLv2+

Less

erGPLv

2.1+

GPLnoV

ersio

n,GPLv

2+,L

inkExc

eptio

n

GPLv2

BSD3

GPLv2,

ClassP

athE

xcep

tion

othe

r

No Not

ificat

ion

Unkno

wn lic

ense

02000400060008000

100001200014000160001800020000

#Files

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Result ( BSD3 )

15

License #Fragments Percentage

BSD3 613 92%

GPLv2+ 20 3.0%

Apachev2 16 2.4%

LesserGPL2+ 14 2.1%

GPLv2,ClassPathException 1 0.15%

LesserGPL2.1+ 1 0.15%

• Result of counting code fragments in clone sets including BSD3 fragments grouped by their license• The frequency of license used by code fragments having CnP relationship to BSD3 fragments

• BSD3 code is mostly reused by BSD3 code

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Result ( Apachev2 )License #Fragments Percentage

Apachev2 1533 77%

Apachev1.1 316 16%

LesserGPL2.1+ 42 2.1%

MPLv1.1 33 1.6%

BSD3 29 1.5%

MX4JLicensev1 16 0.80%

GPLv2+ 4 0.20%

LibraryGPL2+ 3 0.15%

MPLv1.0 2 0.10%

MITX11noNotice 2 0.10%

Public Domain 1 0.050%

Subversion+ 1 0.050%

EPLv1 1 0.050%

16

• Large percentage of CnP between Apachev2 code fragments

• Apachev1.1 code has been changed their license to Apachev2

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Result ( GPLv2+ )

17

License #Fragments Percentage

GPLv2+ 268 44%

GPLnoVersion,GPLv2+,LinkException 225 41%

BSD3 28 5.1%

LibraryGPLv2+ 20 3.6%

Apachev2 4 0.73%

LesserGPLv2.1+ 4 0.73%

• CnP within GPLv2+ code occupy the highest percentage • “GPLnoVersion, GPLv2+, LinkException” has high percentage

• “GPLnoVersion, GPLv2+, LinkException” code is reused by GPLv2+ code.

CnP

GPLnoVersion, GPLv2+, LinkException GPLv2+

CnP

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

#Files and #Fragments under Each License

18

#Fragments #Files #Fragments / #Files

BSD3 665 2181 0.305

Apachev2 1983 16350 0.121

GPLv2+ 549 8160 0.0673

• The frequency of CnP per file BSD3 > Apachev2 > GPLv2+

• Code under a license is copy-and-pasted frequently, if “#Fragments / #Files” of the license is large

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Summary of the Results

• Common characteristic of all licenses– CnP within code distributed under same license or

licenses designed by the same organization have a majority• CnP might happen mostly in an organization

• Apachev2 has CnP relations to various licenses– Files under Apachev2 have the largest number– The condition of Apachev2 is more relaxed than

that of GPLv2+• The frequency of CnP per file

BSD3 > Apachev2 > GPLv2+

19

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Threat to Validity

• Insufficient to apply this result to general OSS– This analysis target is small

→ We plan large scale analysis– Only Java files were analyzed

• History of Java files is short, hence Java files are less copy-and-pasted than others

→ We plan analysis of C/C++ files• Overlap code fragments may be counted separately

– Number of overlap code fragments might be small

20

Fragment A

Fragment B

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Scalability of Investigating Method

• This method can apply to large target, because each step can– License detection

• Ninka can analyze files in linear order

– Code clone detection• There are more scalable tools than CCFinder such

as CCFinderX and D-CCFinder.

– Counting code clone• This process did not take a long time

21

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Conclusion

• A preliminary study of impact of licenses on CnP was performed– Java files in Debian/GNU Linux 5.0.2 main section

were analyzed• CnP are happened mostly within code

distributed under the same license or licenses designed by the same organization

• The frequency of CnP per file– BSD3 > Apachev2 > GPLv2+

• Our method can be applied to a large target

22

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Future Work

• Large Scale Experiment• Investigating that code fragments are

copy-and-pasted mostly in an organization• Detecting direction of CnP

23