v 0.1.2

This is ybsnarfz a rather simplistic package to snarf the yahoo financial boards 
for any given stock.

v 0.1.2
    Please run FixTime.pl to fix times between midnight and 1 am.


License:

All files, scripts and programs in this package are available under terms of the GPL 
v2 from the FSF.  A copy of that license is in this package.

Dependencies:

	perl. Obviously, I run 5.8.5 but they should work on somewhat earlier versions 
	as well.

	Getopt::Long package for perl

	Mysql. Nothing exotic except that it uses the "replace into" which may require 
	v4 of mysql.

	DBD/DBH for mysql.

	wget. This is how data is pulled from yahoo.

	linux/posix/unix.  Internally there are some system calls to standard posix/unix 
	commands, cat, ls, and mv.

Package Contents:

	for v 0.1.0
		
		CHANGELOG	    - List of changes.
		detag.pl	    - The html scanner program.
		FindBadDates.pl     - Refetch posts where the date is bad.
		FindMissing.pl      - Fetch any posts missing from the db (not deleted).
		FixTime.pl          - Fix bad times between midnight and 1 am.
		GPLv2.txt	    - Text copy of the GPL v2 license.
		htdocs/		    - directory for php scripts etc
		htdocs/ybsnarfz.php - A PHP to display data.
        htdocs/scox-properties.php  - A sample properties file.
		README		    - This file.
		sample-output/      - some sample output from ybsnarfz.php
		scox.properties     - Sample properties file.
		showMessages.pl	    - Display (not ready for prime time)
		TODO		    - Things which might get done.
		UpdateRecs.pl       - Go back and update number of recs.
		work/		    - working directory to accumulate messages.
		yahooGetLast.pl     - Scan to find last message number.
		yahooMsgScan.pl	    - The message parser program.
		yahooRecsParse.pl   - Parse the message list for recs.
		yboard.tabdef	    - The input to mysql to create the messages table.
		ybScan.pl	    - The program which saves messages to the db.
		YbsOptions.pm	    - Header and common code used by other programs.
		ypull.pl	    - The program to pull messages from the yahoo board.


	
Installation and Operation:

	Unpack the tarball:  tar -xvzf ybsnarfz-X.X.X.tar.gz.   This should create the 
	directory ybsnarf-X.X.X (the X's are the version number, very low!).  You may 
	leave them where they are or move things as desired.  The programs expect each 
	other and the properties files and the work directory relative to where they 
	are run; they will also write some transient files there (um, should be changed 
	to use a temp directory).

	Create a database and user to access it.  Currently it will only build one table.  
	The example I've used is "yboard" for everything; if the user yboard is not 
	otherwise used this should not be a problem to follow this exactly; otherwise, 
	or if paranoid change as desired.

		mysqladmin -u<your master> -p<your masterpw> create yboard
		mysql -u<your master> -p<your masterpw> mysql
		grant all privileges on yboard.* to yboard@localhost identified by 'yboard';
   		exit;

	Create the table.

		cd into the install directory

		mysql -uyboard -pyboard yboard <yboard.tabdef


	Fix the configuration file:  For SCOX the scox.properties file may look like this:

	domain    = Yahoo Finance Board
	locus     = SCOX
	boardid   = 1600684464
	boardname = cald
	workdir   = work
	dbhost    = localhost
	dbuser    = yboard
	dbpass    = yboard
	dbset     = yboard

	domain is the primary qualifier. locus is really not used but is there jic. boardid 
	is required by wget for yahoo to qualify the url as is boardname.  Note that the 
	domain and boardid, along with the message number is the primary key for the messages 
	table.  workdir, is the name of the work directory where pulled posts are stored until 
	ybScan.pl processes them.  The dbxxxx settings are for access to your database.

	All options:

		boardid		- The numeric id used by yahoo to uniquely identify a 
				  message board.  Can be seen in the url.  Required.

		boardname	- The cannonical name of the board in the yahoo url.
				  Required.

		dbhost		- The hostname to connect to the database. defaults as
				  'localhost'.

		dbpass		- The password to connect to the database. Default is 
				  'yboard'.

		dbset		- The database name.  Default is 'yboard'.

		dbuser		- The database connection id.  Default 'yboard'.

		detag		- The html scanner program.  Defaults to "detag.pl" (part
				  of this package) under the directory specified in 
				  execdir.
		
		domain		- The domain of messages.  This is text and can be any 
				  value and will be the same for all rows as part of 
				  the primary key but is required.

		execdir		- Defaults to ".", and tells the ybScan program where to
				  find the html scanner and message parser.  This is
				  included so you can move the executables elsewhere.

		getlast		- Program to find last post number (after catenated from
				  detag).  Defaults to "yahooGetLast.pl" (in package) 
				  under execdir.

		locus		- Just a name tag in the database for these scans.  A
				  descriptive is suggested.

		parser		- The message parser program.  Defaults to 
				  "yahooMsgScan.pl" (part of this package) under the 
				  directory specified in execdir.

		puller		- The puller program.  ypull.pl under execdir.

		scanner		- The message parser program. yahooMsgScan.pl under
				  execdir.

		tempdir		- Location where ypull writes some temporary files.

		workdir		- Location where ypull writes the message files of messages
				  retrieved from yahoo.  ybScan subsequently deletes 
				  them from here after writing them to the DB.

There are three basic programs which are run from the immediate directory where these reside.  
They are ypull.pl, ybScan.pl and UpdateRecs.pl.   All of these require an argument of the prefix 
of the properties file. (viz "./ypull.pl scox" for scox.properties).   Both can use a -d <number>
</number> option for debug level.  -d 1 will give more info per process; -d 2 will give a lot 
more in the ybScan... you probably don't want to do that unless you are fixing a code problem.  

ypull.pl  optionally takes additional arguments of the first and last message numbers to pull.  
A zero (or not specified) for the first number tells it to look in  the messages table for the 
maximum message number for that domain and board id (the last will be rescanned).  ypull doesn't 
update the table; it writes the data in the work directory (with a magic tag on the top saying 
what message number) as <name>-<msgnum>.post.

ybScan.pl invokes detag and yahooMsgScan for all .post files in the work directory with a prefix 
of the properties file name and a ".post" suffix.  After processing successfully ybScan deletes 
them from the work directory.

UpdateRecs.pl will by default go back 2000 records from the most recent and update the number 
of recs on posts. A starting message number may be supplied as an optional parameter.

	You may wish to set up a simple cron script which looks like (vary for version):

	cd ~/ybsnarfz-0.1.0
	./ypull.pl scox
	./ybScan scox

	... and have it run every day, hour or however often you feel like.  Similarly you may 
	wish to run UpdateRecs.pl somewhat less often.


The Messages Table:
	
	The full text of this definition is in yboard.tabdef.

  domain varchar(64) 	# A descriptive name for the domain. 
  locus varchar(64) 	# A descripive name for the particular part of domain.
  boardid varchar(64) 	# A unique identifier in the domain
  msgn bigint(20) 		# The message number, unique in domain and boardid.
  poster varchar(255) 	# Who sent the message
  posttime datetime 	# Date and time of posting
  recs int(11) 			# How well regarded was this message.
  title varchar(255)	# Subject line of the message
  refmsg bigint(20)		# Parent message number.
  refby varchar(255) 	# Parent message poster.
  msg blob 				# Text of the message.
  PRIMARY KEY  (domain,boardid,msgn),
  KEY msgn (msgn),
  KEY poster (poster),
  KEY refmsg (refmsg),
  KEY postTime (postTime)


Usage Notes:


sh-3.00$ 
sh-3.00$ ./ypull.pl -h

 usage is [path/]ypull.pl [-h][-d n] <prop> [start] [end]

This program pulls messages from the Yahoo Finance board and
   saves them for further processing.

A -h (or no arguments) will print this message.

A -d n will turn on debugging if n is not zero.

<prop> is the prefix name of a .properties file.

"start" is the first message number; if zero or not specified
   ypull will find the last message in the database and restart
   from there.

"end" is the last message number to pull.  If unspecified or
   zero ypull will find the last message number currently 
   available on the board and use that.

sh-3.00$ 
sh-3.00$ 
sh-3.00$ ./ybScan.pl -h

usage is [path/]ybScan.pl [-h][-d n] <prop>

This program takes Yahoo Finance Board posts as captured by 
   ypull.pl, parses out the information and saves it in a 
   database table, removing the posts from the archive created
   by ypull.

-h (or no arguments) causes this message to print.

-d n will display some or lots of debug information if n is 1 or
   higher.

<prop> is the prefix of a .properties files specifying how to
   handle this information.

sh-3.00$ 
sh-3.00$ 
sh-3.00$ ./UpdateRecs.pl 

 usage is [path/]UpdateRecs.pl [-h][-d n] <prop> [start] 

This program updates the number of recs in the messages table.
   saves them for further processing.

A -h (or no arguments) will print this message.

A -d n will turn on debugging if n is not zero.

<prop> is the prefix name of a .properties file.

"start" is the first message number; if zero or not specified
   this will find the last message in the database and start
   from 2000 back.

sh-3.00$ 
sh-3.00$ 

ybsnarfz.php

	This is a php script/program to display data in useful ways. It depends on a configuration 
	(<name>-properties.php) file being in the same directory as the php script, and readable by 
	apache (or web server program); this configuration is identical to the .properties file used 
	by the perl programs save for language changes for PHP.   This displays in both list and 
	threaded modes.

xybsnarfz.php
   
	Very similar to ybsnarfz.php but will truncate display of message at 600 characters and 
	include an iframe back to the yahoo board.

Gotchas:
	
	The primary gotcha in this system is that it does not respect multiple space lines.  More 
	than one blank line in the input will always be reduced to a single blank line.  That could 
	be fixed... do I want to?

-- TWZ