Hi,
Occasionally I go to Project Gutenberg ( http://www.gutenberg.org ) for
public domain books that have been digitized and made available for
downloading. I thought it might be interesting to examine them using text
analysis software that would give me an idea of the difficulty level and
frequency count of words and phrases. I would also like to have the word
count and frequency results graphed or made available for export to Excel.
Is anyone familiar with a software package, freeware or otherwise, that can
handle this level of analysis and charting?
I did find the web site textalyser (
http://www.lexicool.com/text_analyzer.asp ) that has some of these
capabilities but no way to specify varying phrase lengths or export options
for charting nor does it handle book size jobs.
Thanks for the help, Steve.
credoquaabsurdum - 13 May 2005 14:37 GMT
> Is anyone familiar with a software package, freeware or otherwise, that can
> handle this level of analysis and charting?
I think the Oxford Wordsmythe tools at oup.com are what you're looking
for, but you should carefully check them out before you sign up. I
might not have a real handle on what you really need.
You might want to hit postgraduate student discussion groups at major
applied linguistics places like U. Edinburgh and Nottingham.
Good luck.
John Ings - 13 May 2005 15:41 GMT
>Hi,
>
[quoted text clipped - 12 lines]
>capabilities but no way to specify varying phrase lengths or export options
>for charting nor does it handle book size jobs.
Is there any possibility that you might have the stones to try a
little programming? There are a variety of computer languages that do
this sort of thing. You might be on a learning curve for a while, but
you would end up having complete control of your results. Compilers,
interpreters and user manuals for these languages are available on the
net for free, so all you need invest is your time. PERL used to be the
language of choice: http://www.perl.com/ but it has been surpassed in
my opinion by PYTHON http://www.python.org
Chances are any software packcage that offers what you want will have
been written in one of those two, since such parsing and analysis is
their forte.
Lee Sau Dan - 13 May 2005 16:31 GMT
>>>>> "Steve" == Steve <abc@123.com> writes:
Steve> Hi, Occasionally I go to Project Gutenberg (
Steve> http://www.gutenberg.org ) for public domain books that
Steve> have been digitized and made available for downloading.
Digitized? An *image* (e.g. in JPEG, GIF, PNG formats) of scanned
book pages and documents is also in digitized form. I hope you don't
mean that. If you're talking about text format, then you have better
luck.
Steve> I thought it might be interesting to examine them using
Steve> text analysis software that would give me an idea of the
Steve> difficulty level
No. Unless you define "difficulty level". Computers can't read your
mind. They won't know what you mean by "difficult". If you don't
give them a clear and unambiguous mathematical formula for "difficulty
level", they can't do it.
Steve> and frequency count of words and phrases.
Frequency count of words is trivial.
Phrases... that's more difficult, esp. with ambiguous sentences.
("Time flies like an arrow; fruit flies like a banna." How do you
break them up into phrases?) Computers are pretty incapable of
dealing with ambiguities.
Steve> I would also like to have the word count and frequency
Steve> results graphed or made available for export to Excel.
Steve> Is anyone familiar with a software package, freeware or
Steve> otherwise, that can handle this level of analysis and
Steve> charting?
Word frequency is trivial. A simple unix command does it:
tr -dc a-zA-Z < *.txt | sort | uniq -c | sort -nr
would give you the frequency of words, listed in descending order of
frequency, appearing in the files matching "*.txt". Pipe the results
into a charting software (e.g. gnuplot) and you're done.
Steve> I did find the web site textalyser (
Steve> http://www.lexicool.com/text_analyzer.asp ) that has some
Steve> of these capabilities but no way to specify varying phrase
Steve> lengths or export options for charting nor does it handle
Steve> book size jobs.
And I guess it won't do "difficult level". Right?

Signature
Lee Sau Dan ???u?? ~{@nJX6X~}
E-mail: danlee@informatik.uni-freiburg.de
Home page: http://www.informatik.uni-freiburg.de/~danlee
Jan - 14 May 2005 09:34 GMT
You can try
http://taporware.mcmaster.ca/~taporware/
for one-off jobs over the Internet, but I don't think it does
book-length texts.
For 'charting', if you're not into programming, find a program that
spits an index or word frequency count out in a text file and import it
into a spreadsheet (e.g. Excel).
Do you have a Mac? I have a free program I downloaded that does word
counts, indexing, concordancing etc. and has managed reasonable-length
English teaching coursebooks, though it can't do right/left sorting. I
can never remember the name, but when I get to my Mac on Monday I can
let you have it if it'll help.
And finally, if you have a bit of money, you can find programs quite
easily searching through google.
Jan
Jan - 14 May 2005 09:38 GMT
PS You mentioned 'difficulty level'. Have a look at TextLadder:
http://www.readingenglish.net/software/
I have never had the time to sit down and plod through what it does,
but it sounds interesting enough. It does some kind of ordering of
texts to even out the number of new words in each text a student
encounters over a reading program.
Jan
steverossiter@sbcglobal.net - 16 May 2005 09:08 GMT
Thank you everyone for your help, Steve.