From wkt at tuhs.org  Wed Sep 10 08:02:30 2003
From: wkt at tuhs.org (Warren Toomey)
Date: Wed Sep 10 08:02:36 2003
Subject: [TUHS] Fwd: Helping in the battle against SCO
Message-ID: <20030909220230.GA70691@minnie.tuhs.org>

All,
	This e-mail below was prompted by an interview I gave about
the SCO thing for an Australian paper:
http://www.theage.com.au/articles/2003/09/09/1062902037394.html

----- Forwarded message from Ulrik Petersen < emdros@yahoo.dk> -----
Date: Tue, 9 Sep 2003 19:44:51 +0200 (CEST)
From: Ulrik Petersen < emdros@yahoo.dk>
Subject: Helping in the battle against SCO

I saw a recent article in the Sydney Morning Herald in
which a Dr. Warren Toomey (presumably you?) was quoted
as saying that the TUHS has several members who have
access to old copies of UNIX source code.  

Please ask these people to try out one of the three
"shredders" which can compare sourcecode from Linux
with other sourcecode, and, if possible, analyze and
publish the results.

One of these shredders is written by Eric S. Raymond. 
Here is a link to an article in which he calls for
action by people with access to UNIX sourcecode:

http://www.eweek.com/article2/0,4149,1257617,00.asp

The program itself can be found here:

http://www.catb.org/~esr/comparator/

Regards,
Ulrik Petersen, Denmark
----- End forwarded message -----

Anyway, I think it's a good idea, so I'd like to hear from people
who have access to recent AT&T code. My GPG and PGP keys are at
http://minnie.tuhs.org/warren.html and on most keyservers if you
so wish to use them.

Thanks,
	Warren

From matt at aclaro.com  Wed Sep 10 14:18:15 2003
From: matt at aclaro.com (Matthew Mastracci)
Date: Thu Sep 11 09:08:47 2003
Subject: [TUHS] Fwd: Helping in the battle against SCO
Message-ID: <3F5F8707.7070504@aclaro.com>

What about comparing SVR1/2 to 2.4.x?  SCO seems to be picking on the 
early 2.4.x codebase.  This should also pick up the SGI code comments in 
the malloc() function that were recently publicized, though I'm not sure 
which version Linus removed the code from.

Matt.

From wkt at tuhs.org  Thu Sep 11 09:17:40 2003
From: wkt at tuhs.org (Warren Toomey)
Date: Thu Sep 11 09:17:55 2003
Subject: [TUHS] Fwd: Helping in the battle against SCO
In-Reply-To: <3F5F8707.7070504@aclaro.com>
References: <3F5F8707.7070504@aclaro.com>
Message-ID: <20030910231740.GA82319@minnie.tuhs.org>

On Wed, Sep 10, 2003 at 02:18:15PM -0600, Matthew Mastracci wrote:
> What about comparing SVR1/2 to 2.4.x?  SCO seems to be picking on the 
> early 2.4.x codebase.  This should also pick up the SGI code comments in 
> the malloc() function that were recently publicized, though I'm not sure 
> which version Linus removed the code from.
> Matt.

Yes we can do this. But I'm suspecting that SCO has found lots of BSD
code in both Linux and their codebase. SysVR2 didn't have any networking,
so we probably won't get much similarity. Anyway, we can try!

	Warren

From norman at nose.cs.utoronto.ca Wed Sep 10 19:59:07 2003
From: norman at nose.cs.utoronto.ca (Norman Wilson)
Date: Thu Sep 11 10:03:14 2003
Subject: [TUHS] Fwd: Helping in the battle against SCO
Message-ID: <20030911000259.F34D91E83@minnie.tuhs.org>

I don't see how any diffing we do will make any difference
`in the battle against SCO.' If we find cases in which Linux
has incorporated System V licensed code, that will certainly
be meaningful; but if, as seems likely, we don't, SCO can
just say their tools are better than hours. And besides, it
is SCO who have brought the complaint, so both legally and
ethically it's up to SCO to prove the case, not up to others
to disprove it, no matter what fearsome roars SCO emit.

Comparisons done by others are certainly interesting, and I
don't want to discourage anyone from doing them; just don't
expect it to make any difference to the lawyers. (Not that
I'm one, of course.)

Norman Wilson
Toronto ON

From luvisi at andru.sonoma.edu  Wed Sep 10 17:41:49 2003
From: luvisi at andru.sonoma.edu (Andru Luvisi)
Date: Thu Sep 11 10:28:57 2003
Subject: [TUHS] Fwd: Helping in the battle against SCO
In-Reply-To: <20030911000259.F34D91E83@minnie.tuhs.org>
Message-ID: < Pine.LNX.4.44.0309101738100.2092-100000@gladen>

On Wed, 10 Sep 2003, Norman Wilson wrote:
> I don't see how any diffing we do will make any difference
> `in the battle against SCO.'
[snip]

Some ways that I can see it being a good thing to do:

  If SCO holds up a piece of common code and the good guys have no
  response, that is bad.

  If SCO holds up a piece of common code and the good guys already know
  that it actually came from BSD, and are prepared to demonstrate such,
  that is good.

  If SCO holds up a piece of common code and the good guys already know
  that it was contributed to Linux by SCO/Caldera themselves, and are
  prepared to demonstrate such, that is good.

  If there is infringing code, it should be taken out of Linux as quickly
  as possible.

Andru
-- 
Andru Luvisi

Quote Of The Moment:
  Heisenberg may have been here.

From grog at lemis.com Thu Sep 11 10:25:46 2003
From: grog at lemis.com (Greg Lehey)
Date: Fri Sep 12 03:29:13 2003
Subject: [TUHS] Fwd: Helping in the battle against SCO
In-Reply-To: <20030911000259.F34D91E83@minnie.tuhs.org>
References: <20030911000259.F34D91E83@minnie.tuhs.org>
Message-ID: <20030911172545.GC946@adelaide.lemis.com>

On Wednesday, 10 September 2003 at 19:59:07 -0400, Norman Wilson wrote:
> I don't see how any diffing we do will make any difference `in the
> battle against SCO.'

It could. There's a lot of confusion out there. The people on this
list have a much better understanding of the technical issues than
just about any other group of people I can think of.

> If we find cases in which Linux has incorporated System V licensed
> code, that will certainly be meaningful; but if, as seems likely, we
> don't, SCO can just say their tools are better than hours.

FWIW, the first example that SCO showed in Las Vegas on 18 August does
appear to be derived from System V.3 malloc(). See
http://www.lemis.com/grog/SCO/code-comparison.html for the details.
Also, if anybody else can confirm or deny my analysis based on code
inspection, I'd be *very* grateful.

Summary: the first example showed a slightly modified version of Third
Edition malloc() being used for a slightly different purpose in the
SGI ia64 port only. The slight modifications tracked those in System
V.3, suggesting that SGI derived their code from System V, and not
from an earlier version. On the other hand, the differences in System
V.3 were removed again, and in fact the Linux community had already
removed the entire code before SCO "revealed" it.

> And besides, it is SCO who have brought the complaint, so both
> legally and ethically it's up to SCO to prove the case, not up to
> others to disprove it, no matter what fearsome roars SCO emit.

No question.

Greg
--
Finger grog@lemis.com for PGP public key
See complete headers for address and phone numbers

From grog at lemis.com Thu Sep 11 10:32:32 2003
From: grog at lemis.com (Greg Lehey)
Date: Fri Sep 12 03:35:55 2003
Subject: [TUHS] Fwd: Helping in the battle against SCO
In-Reply-To: <Pine.LNX.4.44.0309101738100.2092-100000@gladen>
References: <20030911000259.F34D91E83@minnie.tuhs.org>
<Pine.LNX.4.44.0309101738100.2092-100000@gladen>
Message-ID: <20030911173232.GD946@adelaide.lemis.com>

On Wednesday, 10 September 2003 at 17:41:49 -0700, Andru Luvisi wrote:
> On Wed, 10 Sep 2003, Norman Wilson wrote:
>> I don't see how any diffing we do will make any difference
>> `in the battle against SCO.'
> [snip]
>
> Some ways that I can see it being a good thing to do:
>
> If SCO holds up a piece of common code and the good guys have no
> response, that is bad.

Agreed. That doesn't apply to either piece of code they've shown so
far. This is http://www.lemis.com/grog/SCO/code-comparison.html
again.

> If SCO holds up a piece of common code and the good guys already
> know that it actually came from BSD, and are prepared to
> demonstrate such, that is good.

That's the second example :-) The question I've asked SCO is: how
could you have missed the Berkeley license agreement at the beginning
of this file? SCO have backed off claiming that this is System V
code, and claim it's just an example of their code comparison
techniques. But on slide 15 of their presentation
(http://www.vangennip.nl/perens/SCOsource_Briefing_II.2.pdf), they
clearly claim that it's System V code. This suggests that SCO have
recognized their error, though they haven't yet had the decency to
apologize to the BSD community.

Greg
--
Finger grog@lemis.com for PGP public key
See complete headers for address and phone numbers

From norman at nose.cs.utoronto.ca  Mon Sep 15 16:39:31 2003
From: norman at nose.cs.utoronto.ca (Norman Wilson)
Date: Tue Sep 16 06:44:06 2003
Subject: [TUHS] Fwd: Helping in the battle against SCO
Message-ID: <20030915204355.EB9311E5D@minnie.tuhs.org>

Andru Luvisi:

  If SCO holds up a piece of common code and the good guys have no
  response, that is bad.

  If SCO holds up a piece of common code and the good guys already know
  that it actually came from BSD, and are prepared to demonstrate such,
  that is good.

  If SCO holds up a piece of common code and the good guys already know
  that it was contributed to Linux by SCO/Caldera themselves, and are
  prepared to demonstrate such, that is good.

  If there is infringing code, it should be taken out of Linux as quickly
  as possible.

======

I'll grant all those points, but if the idea is to defang SCO, the
effort still seems fruitless to me.

System V and Linux both contain appallingly large volumes of code.
(On a list that discusses the UNIX of the 1970s, perhaps I can say
that without creating undue ruckus.)  The odds are that quite a
lot of the code is similar.  Should we really spend months and months
tracking it all down and trying to declare where each line came from,
or should we wait until SCO declares a specific set of cases that matter
(as they must do sooner or later or abandon the court battle)?

When one is faced with an enormous set of possible computations, of
which only a handful are likely to be needed in the end, lazy evaluation
is usually the better choice.

It does seem sensible to me for the Linux community to do its best to
hunt down any infringing code, and to try to assess whether there's a
serious problem lurking that nobody had noticed.  But that ought to be
a matter of basic ethics, having nothing to do with SCO.  I doubt it
is likely to make much difference to the court battle anyway: SCO's
claim is that the infringing code is there now, that it was put there
deliberately at IBM's instigation to do harm to them, and that the harm
already exists; removing it now won't change any of that.  I think it's
a good idea to remove any infringements that are there now, even if they
are trivial ones; but let's not fool ourselves that it will pull SCO's
fangs to do so.

Norman Wilson
Toronto ON

From wkt at tuhs.org  Tue Sep 16 08:48:53 2003
From: wkt at tuhs.org (Warren Toomey)
Date: Tue Sep 16 08:49:12 2003
Subject: [TUHS] Lexical comparator, was Re: the battle against SCO
In-Reply-To: <20030915204355.EB9311E5D@minnie.tuhs.org>
References: <20030915204355.EB9311E5D@minnie.tuhs.org>
Message-ID: <20030915224853.GA27957@minnie.tuhs.org>

On Mon, Sep 15, 2003 at 04:39:31PM -0400, Norman Wilson wrote:
> It does seem sensible to me for the Linux community to do its best to
> hunt down any infringing code... But that ought to be a matter of basic 
> ethics, having nothing to do with SCO.  I doubt it is likely to make 
> much difference to the court battle anyway... I think it's
> a good idea to remove any infringements that are there now, even if they
> are trivial ones; but let's not fool ourselves that it will pull SCO's
> fangs to do so.
 
For me it's not just a matter of defeating SCO, it's also one of sheer
indignation in the face of Saganesque FUD ("billions and billions of
lines of code"). I seriously want to know if there's even the tiniest
possibility that SCO is right, or if they're are just Smoking Crack Often.
 
While we're on the topic, I saw esr's code shredder/comparator that works
on lines of code. This isn't going to work if variables get renamed etc.
I'm writing a code comparator that works on a lexical basis, comparing
C tokens. It's only going to be proof of concept (i.e. slow), but I
should have it done by week's end and I'll pop a notice in here when it's
ready.
 
Cheers,
        Warren

From norman at nose.cs.utoronto.ca  Mon Sep 15 20:02:52 2003
From: norman at nose.cs.utoronto.ca (Norman Wilson)
Date: Tue Sep 16 10:07:13 2003
Subject: [TUHS] Lexical comparator, was Re: the battle against SCO
Message-ID: <20030916000703.635F31E5D@minnie.tuhs.org>

Warren Toomey:

  For me it's not just a matter of defeating SCO, it's also one of sheer
  indignation in the face of Saganesque FUD ("billions and billions of
  lines of code"). I seriously want to know if there's even the tiniest
  possibility that SCO is right, or if they're are just Smoking Crack Often.

That's fair enough.  Just remember that no matter how much you scan
the code, you can't beat the FUD campaign by doing so.  SCO can just
claim their tools are better than yours, and continue to stonewall
about showing their evidence.  

And as I said last week, both legally and morally the onus is on
SCO to provide proof of their claims: the infringement, that it
was done maliciously, that it has caused them harm.  The `evidence'
they have shown so far makes me doubt very much that they can
prove all three of those things, or possibly any but the least-
significant case of the first.

As I also said last week, I don't mean to discourage anyone from
doing code comparisons.  Intellectually it's an interesting exercise.
Ethically it's the right thing to do if the Linux community thinks it's
possible that licensed code got into the system.  Even legally it
might make some difference to have shown due diligence, though not
in the matter presently before the courts.  If it makes someone feel
less frustrated, that's fine too.

But scanning the Linux code won't provide hard proof of anything,
any more than you can claim to prove there are no leaks in your roof
solely by inspection.  If proof is possible, it will work the other way.

Norman Wilson
Toronto ON

From rweather at zip.com.au  Tue Sep 16 10:46:49 2003
From: rweather at zip.com.au (Rhys Weatherley)
Date: Tue Sep 16 11:01:02 2003
Subject: [TUHS] Lexical comparator, was Re: the battle against SCO
In-Reply-To: <20030915224853.GA27957@minnie.tuhs.org>
References: <20030915204355.EB9311E5D@minnie.tuhs.org>
 <20030915224853.GA27957@minnie.tuhs.org>
Message-ID: <200309161046.49958.rweather@zip.com.au>

On Tuesday 16 September 2003 08:48 am, Warren Toomey wrote:

> While we're on the topic, I saw esr's code shredder/comparator that works
> on lines of code. This isn't going to work if variables get renamed etc.

I'd like to point out that the more steps that are taken to factor out 
identifier names, whitespace conventions, etc, the closer you approach a 
situation where the tool says "both programs are written in the same 
programming language" or "both programs use binary searching somewhere in 
their code".  Which, while true, isn't terribly useful to know.  A human 
being still needs to wade through the results and inspect them manually.

Cheers,

Rhys Weatherley.

From robert at timetraveller.org  Mon Sep 15 21:35:12 2003
From: robert at timetraveller.org (Robert Brockway)
Date: Tue Sep 16 11:39:56 2003
Subject: [TUHS] Lexical comparator, was Re: the battle against SCO
In-Reply-To: <20030916000703.635F31E5D@minnie.tuhs.org>
References: <20030916000703.635F31E5D@minnie.tuhs.org>
Message-ID: < Pine.LNX.4.56.0309152131290.24213@zen.canint.timetraveller.org>

On Mon, 15 Sep 2003, Norman Wilson wrote:

> possible that licensed code got into the system.  Even legally it

Hi.  Don't want to nitpick here but many of us think it is important to
get this point straight whenever we are talking about GPLed code.  The
kernel is licenced (as I'm sure you know).  What we are of course
concerned about is:

a) Code which is licenced in a manner incompatible with the GPL

b) Code that the copyright holder did not authorise going into the kernel.

I'm sure you were just speaking in shorthand but it is subtle point that
many misinterpret.  Many people outside the OSS community think that "all
that free code" is in the public domain, which it is most definately not.

> Norman Wilson
> Toronto ON

A fellow Torontonian, perhaps we may meet at TLUG sometime.  I'm giving
the next talk.

Cheers,
	Rob

-- 
Robert Brockway B.Sc. email: robert@timetraveller.org, zzbrock@uqconnect.net
Linux counter project ID #16440 (http://counter.li.org)
"The earth is but one country and mankind its citizens" -Baha'u'llah

From norman at nose.cs.utoronto.ca  Mon Sep 15 22:11:41 2003
From: norman at nose.cs.utoronto.ca (Norman Wilson)
Date: Tue Sep 16 12:16:17 2003
Subject: [TUHS] Lexical comparator, was Re: the battle against SCO
Message-ID: <20030916021555.EFF121EB2@minnie.tuhs.org>

Robert Brockway:

  Hi.  Don't want to nitpick here but many of us think it is important to
  get this point straight whenever we are talking about GPLed code.  The
  kernel is licenced (as I'm sure you know).  What we are of course
  concerned about is:

  a) Code which is licenced in a manner incompatible with the GPL

  b) Code that the copyright holder did not authorise going into the kernel.

  I'm sure you were just speaking in shorthand but it is subtle point that
  many misinterpret.  Many people outside the OSS community think that "all
  that free code" is in the public domain, which it is most definately not.

====

Quite right.  I wasn't speaking in shorthand, I was speaking in
clumsy; what I should have written is `possible that code restricted
by the System V license got into the system.'

Licenses come in all flavours, and whether there is any license
at all is not the issue here.  I certainly didn't mean, for
example, to imply that all licenses are evil, reptilian kitten-
eaters from another planet.

Norman Wilson
Toronto ON

From imp at bsdimp.com  Mon Sep 15 21:01:26 2003
From: imp at bsdimp.com (M. Warner Losh)
Date: Tue Sep 16 16:02:07 2003
Subject: [TUHS] Fwd: Helping in the battle against SCO
In-Reply-To: <20030915204355.EB9311E5D@minnie.tuhs.org>
References: <20030915204355.EB9311E5D@minnie.tuhs.org>
Message-ID: <20030915.210126.54187719.imp@bsdimp.com>

In message: <20030915204355.EB9311E5D@minnie.tuhs.org>
            norman@nose.cs.utoronto.ca (Norman Wilson) writes:
: tracking it all down and trying to declare where each line came from,

In BSD land, we can do that.  We have cvs annotate.  Looks like the
stubborn refusal to use source code control, and to have only a few
people putting things together makes it a lot harder to track things
down after the fact.  Good call that.

Warner

From imp at bsdimp.com  Mon Sep 15 21:10:11 2003
From: imp at bsdimp.com (M. Warner Losh)
Date: Tue Sep 16 16:02:08 2003
Subject: [TUHS] Lexical comparator, was Re: the battle against SCO
In-Reply-To: < Pine.LNX.4.56.0309152131290.24213@zen.canint.timetraveller.org>
References: <20030916000703.635F31E5D@minnie.tuhs.org>
	< Pine.LNX.4.56.0309152131290.24213@zen.canint.timetraveller.org>
Message-ID: <20030915.211011.51703000.imp@bsdimp.com>

In message: < Pine.LNX.4.56.0309152131290.24213@zen.canint.timetraveller.org>
            Robert Brockway < robert@timetraveller.org> writes:
: a) Code which is licenced in a manner incompatible with the GPL
: b) Code that the copyright holder did not authorise going into the kernel.

There's a lot of code that originated in the BSD world that had its
copyrights shorn off, a GPL splatted on and the mass hacking began.
Many of these are no longer recognizable from there original form, and
aren't a problem.  Some have much more in common with the original.
Linux is vulnerable to the original author having a shit fit if they
ever find out.  Most of the open source authors are amused when this
happens, so the odds are low a big deal would be made of it.  This
practice was wide-spread in the early 1990s, although things have
improved a lot.

However, without something like CVS and the legal assignment of
copyright (or formal acknowledgement of licensing under the GPL, which
is harder to defend), this will always be a problem with Linux.

The BSD projects are a little tigher about this, but still would be
vulnerable.

Warner

From wkt at tuhs.org Thu Sep 18 12:56:32 2003
From: wkt at tuhs.org (Warren Toomey)
Date: Thu Sep 18 12:56:39 2003
Subject: [TUHS] Lexical comparator
In-Reply-To: <20030915224853.GA27957@minnie.tuhs.org>
References: <20030915204355.EB9311E5D@minnie.tuhs.org>
<20030915224853.GA27957@minnie.tuhs.org>
Message-ID: <20030918025632.GA50614@minnie.tuhs.org>

On Tue, Sep 16, 2003 at 08:48:53AM +1000, Warren Toomey wrote:
> While we're on the topic, I saw esr's code shredder/comparator that works
> on lines of code. This isn't going to work if variables get renamed etc.
> I'm writing a code comparator that works on a lexical basis, comparing
> C tokens. It's only going to be proof of concept (i.e. slow), but I
> should have it done by week's end and I'll pop a notice in here when it's
> ready.

Well, it's done. The software is now available at
http://minnie.tuhs.org/Programs/Ctcompare. I have also made available
some tokenised source trees so you can do some comparisons straight away.

If anybody has Unix kernel trees which they cannot divulge due to licensing
restrictions, I'd appreciate you creating tokenised files of the kernel
source and e-mailing them to me.

Thanks!
Warren

From wkt at tuhs.org  Thu Sep 18 21:45:26 2003
From: wkt at tuhs.org (Warren Toomey)
Date: Thu Sep 18 21:45:32 2003
Subject: [TUHS] Lexical comparator
In-Reply-To: <200309181041.h8IAfAWe000686@skeeve.com>
References: <200309181041.h8IAfAWe000686@skeeve.com>
Message-ID: <20030918114526.GA54312@minnie.tuhs.org>

On Thu, Sep 18, 2003 at 01:41:10PM +0300, Aharon Robbins wrote:
> > If anybody has Unix kernel trees which they cannot divulge due to licensing
> > restrictions, I'd appreciate you creating tokenised files of the kernel
> > source and e-mailing them to me.
> 
> Hmmm.  Just between us chickens, given tokenized versions of an entire tree,
> how hard would it be to recreate a functional kernel?

Pretty damn hard. All identifiers, (variable names) are reduced to
a single token. Actually, that's not true. The meaning of the names
is removed an replaced with numeric identifiers that are unique to
each file. Here's a tokenised portion of 32V (bio.c):

   56:   struct id10 * 
   57:   id13 ( id14 , id15 ) 
   58:   id16 id14 ; 
   59:   id17 id15 ; 
   60:   { 
   61:   register struct id10 * id18 ; 
   62:   
   63:   id18 = id19 ( id14 , id15 ) ; 
   64:   if ( id18 ->id20 & id21 ) { 
   65:   #ifdef id1 
   66:   id9 . id5 ++ ; 
   67:   #endif 
   68:   return( id18 ) ; 
   69:   } 
   70:   id18 ->id20 |= id22 ; 
   71:   id18 ->id23 = id24 ; 
   72:   ( * id25 [ id26 ( id14 ) ] . id27 ) ( id18 ) ; 
   73:   #ifdef id1 
   74:   id9 . id3 ++ ; 
   75:   #endif 
   76:   id28 ( id18 ) ; 
   77:   return( id18 ) ; 
   78:   } 

Now go and check the actual source and work out which function it is!
[ see http://minnie.tuhs.org/UnixTree/32VKern/usr/src/sys/sys/bio.c.html ]

	Warren