Blog Archives

How do you choose which books to read?

Hello all, I’ve been volunteering at my local library, trying to help them improve their sci-fi collection (their acquisitions have been pretty random in recent years). As one aspect of the project we came up with a short survey, which I hope you will take, about what factors are important when people check out speculative fiction books.

The survey can be found here: https://goo.gl/forms/dFl7IiSmWgyLstQL2

It only takes about 5 minutes and all questions are optional. The survey will stay open until at least the end of May 2018.

ps. Please note that this is a non-profit project in cooperation with a public library.   I’ll release the results in a few weeks and post a link here.

 

Advertisements

Word Frequency Analysis – The Most Common Words

There are any number of reasons why you might need a list of the most common words in the language. In my case, I was working on a piece of software to speed the process of building indexes for my print books. My program reads the book and suggests a list of words that the author might want to include in the index. It needed a list of the most common words so it would know not to bother suggesting them. I’ll post that script in a couple of days. For now, though, I thought I would give you a very simple piece of Python code that reads a directory full of text files, counts how many times each word occurs, and prints a list of those which show up most often. I set it to give me the most common 1000 words. You could generate a list of any length, though, just by changing one number in the code.

If you don’t care to look behind the curtain and just want to cut and paste my word list, feel free to scroll down to the bottom of the post.

For raw data, I used a sample of 37,358 Project Gutenberg texts. PG is kind enough to offer an interface for researchers like me to harvest books. Note that this would work nearly as well with a much smaller sample. But I had already downloaded the books for another project, so I figured I might as well use them. If you use a PG harvest for your data set, make sure and remove the Human Genome Project gene sequence files (a full dump contains at least three copies of the full human genome). Otherwise, this script will have major grief when it tries to count each gene as a word.

Note that, as currently written, this script requires GNU Aspell and a system that works correctly with pipes. This means it should run fine on nearly any Unix-like system, but you Windoze people are on your own.

The first part of the script loads a a few standard modules. Then it gets a listing of the current directory and starts looping through each text file in it. With each iteration it prints a status message with the file name and percent completion. With scripts like this that take a day or two to run I like to be able to see at a glance how far along I am. As an aside, if you access your computer through a terminal like I do you will probably want to use GNU Screen or a similar utility to protect yourself from accidental disconnects while it’s still running.

#! /usr/bin/env python

'''Word frequency analyzer'''

import os, stat, string, subprocess

wordcounts={}

filelist = os.listdir('.')

counter = 0
for f in filelist:
    
    counter += 1
    
    if os.path.splitext(f)[1] == '.txt':
        print f+'t', str(counter)+' of '+str(len(filelist))+'t', 
        print str((float(counter)/float(len(filelist)))*100)[:6]+'%'

The next portion opens each book file and reads it in. Next, because I’m using PG books as a data set I need to trip off all of the boilerplate license text which occurs at the beginning and end of the files. Otherwise, because similar text appears in every file, it will skew the word distributions. Luckily, PG marks the actual text of the book by bracketing it in the words “START OF THIS PROJECT GUTENBERG EBOOK” and “END OF THIS PROJECT GUTENBERG EBOOK”. The front part is easy, we just do a string find to get the location of the first line-feed character after the start text appears. The end part is a little trickier; the easiest way to get it is to reverse the whole book. This means, however, that we also need to flip the search text. Pretty neato, huh?

   with open(f, "rb") as infile:  book=infile.read()
    
        #try to determine if this is a Gutenberg ebook.  If so, attempt
        #to strip off the PG boilerplate 
        if "PROJECT GUTENBERG EBOOK" in book:
    
            a = book.find("START OF THIS PROJECT GUTENBERG EBOOK")
            if a <> -1:
                b = book.find('n', a)
            c = list(book); c.reverse()
            book = string.join(c, '')
            
            d = book.find('KOOBE GREBNETUG TCEJORP SIHT FO DNE')
            if d <> -1:
                e = book.find('n', d)
            c = list(book); c.reverse()
            book=string.join(c, '')
            book = book[b:len(book)-e]

The next step is to check the book text for words that aren’t in the dictionary, simply because there is no reason to count words that aren’t part of Standard English. The easiest way to do this on a Linux system like mine is to run the system’s spellcheck, Aspell, on the file. We also want to eliminate duplicate words from this list, since it will save iterations later.

        #see which words aren't in the dictionary
        oddwords = subprocess.check_output(
                    "cat "+f+" | aspell list", shell=True).split()

        #find unique words
        u_oddwords = []
        for w in oddwords:
            if w not in u_oddwords: u_oddwords.append(w)

Next, we go through the book text and strip out most of the punctuation. The string containing the punctuation to be removed looks a lot like the string you get by calling string.punctuation. Note, though, that I left in the “‘” and “-” characters because they are actually a part of contractions and compound words, respectively. I also split the book text, which at this point is one big string, into a list of words and capitalize them.

        #strip out most of the punctuation
        book2=''
        for i in range(len(book)):
            if book[i] not in '!"#$%&()*+,./:;<=>?@[\]^_`{|}~':  
                book2=book2+str(book[i])
                
        book=str(book2).capitalize().split()

In the final segment of the script we count how many times the words occur and update the counters, which are kept as a dictionary object. Then we convert the dictionary to a list, sort it, and print the 1000 most common words to a CSV data file. If you need a different number of words, just change the 1000 to another value.

        for w in book:  
            if w not in u_oddwords:
                if w not in wordcounts:
                    wordcounts[w] = 1
                else:
                    wordcounts[w] += 1
                    

final_list = []
for w in wordcounts:
    final_list.append([wordcounts[w], w])

final_list.sort()
final_list.reverse()

                    
with open('wordcounts_pg', 'w') as wc_output:
    
    for i in range(min(1000, len(final_list)-1)):
        wc_output.write(final_list[i][1]+', '+str(final_list[i][0])+'n')
        

That’s all there is to it. Pretty easy, huh? Now set it to run, detach the terminal, and ignore it until this time tomorrow. My machine can count words in about 1500 books per hour, so it takes about 25 hours to make it through the full sample.

And now, finally, here is the list of words. Feel free to cut and paste it to use for your own projects:

Word Occurrences
the 149164503
of 81154540
and 73797877
to 60771291
a 47925287
in 41773446
that 26590286
was 24584688
he 24462836
i 24025629
it 22795878
his 20173668
is 18378165
with 18081192
as 17645451
for 17473870
had 14408612
you 13939609
be 13252982
on 13207285
not 13181744
at 13015022
but 12718486
by 12438046
her 11878371
which 10826405
this 10263128
have 10196168
from 10088968
she 9778689
they 9715080
all 8819085
him 8771048
were 8314601
or 8143254
are 7787136
my 7572900
we 7412199
one 7373621
so 7203582
their 7018823
an 6518028
me 6419080
there 6267776
no 6185033
said 5938853
when 5899530
who 5878132
them 5808758
been 5787319
would 5689624
if 5655080
will 5166315
what 4895509
out 4556168
more 4440752
up 4416055
then 4222409
into 4129481
has 4000893
some 3929663
do 3914008
could 3749041
now 3747314
very 3630489
time 3571298
man 3559452
its 3544086
your 3522411
our 3517346
than 3494543
about 3349698
upon 3337366
other 3316391
only 3285019
any 3236410
little 3183383
like 2993385
these 2979508
two 2943507
may 2934056
did 2915540
after 2853393
see 2852408
made 2842273
great 2839852
before 2774768
can 2746279
such 2734113
should 2708032
over 2672597
us 2651042
first 2553483
well 2517899
must 2484839
mr 2465607
down 2433044
much 2428947
good 2376889
know 2372135
where 2353232
old 2291164
men 2286995
how 2261780
come 2217201
most 2188746
never 2160804
those 2135489
here 2122731
day 2071427
came 2061124
way 2042813
own 2037103
go 2009804
life 2007769
long 1992150
through 1989883
many 1982797
being 1976737
himself 1941387
even 1915129
shall 1890432
back 1865988
make 1852069
again 1848115
every 1845835
say 1817170
too 1810172
might 1807261
without 1781441
while 1759890
same 1701541
am 1696903
new 1687809
think 1665563
just 1660367
under 1649489
still 1643537
last 1616539
take 1614771
went 1595714
people 1593685
away 1582685
found 1574065
yet 1563963
thought 1556184
place 1543300
hand 1500131
though 1481938
small 1478723
eyes 1469270
also 1467931
house 1438223
years 1435529
1433313
another 1415606
don’t 1381480
young 1379348
three 1378462
once 1377940
off 1376942
work 1375035
right 1360201
get 1345597
nothing 1344419
against 1325938
left 1289397
ever 1269433
part 1261573
let 1260289
each 1258840
give 1258179
head 1254870
face 1253762
god 1249406
0 1239969
between 1225531
world 1219519
few 1213621
put 1200519
saw 1190392
things 1188437
took 1172602
letter 1167755
tell 1160034
because 1155609
far 1154860
always 1152942
night 1152416
mrs 1137055
love 1121812
both 1111644
sir 1100855
why 1097538
look 1095059
having 1069812
mind 1067461
father 1062643
called 1062190
side 1053255
looked 1051044
home 1036554
find 1036485
going 1034663
whole 1033731
seemed 1031466
however 1027701
country 1026854
got 1024945
thing 1022424
name 1020634
among 1019175
seen 1012779
heart 1011155
told 1004061
done 1000189
king 995498
water 994392
asked 993082
heard 983747
soon 982546
whom 979785
better 978434
something 957812
knew 956448
lord 956398
course 953585
end 947889
days 929530
moment 926478
enough 925144
almost 916006
general 903316
quite 902582
until 902333
thus 900738
hands 899106
nor 876106
light 869941
room 869532
since 864596
woman 864072
words 858824
gave 857475
b 853639
mother 852308
set 851757
white 850183
taken 848343
given 838078
large 835292
best 833941
brought 833270
does 826725
next 823345
whose 821731
state 820812
yes 817047
oh 815302
door 804702
turned 804433
others 800845
poor 800544
power 797133
present 792424
want 791194
perhaps 789201
death 788617
morning 786748
la 783512
rather 775384
word 774340
miss 771733
less 770410
during 763957
began 762442
themselves 762418
felt 757580
half 752587
lady 742708
full 742062
voice 740567
cannot 738450
feet 737299
order 736997
near 736832
true 735006
1 730887
it’s 727886
matter 726818
stood 725802
together 725703
year 723517
used 723293
war 720950
till 720824
use 719314
thou 714663
son 714275
high 713720
round 710093
above 709745
certain 703716
often 698006
kind 696975
indeed 696469
i’m 690646
along 688169
case 688098
fact 687334
myself 684387
children 683334
anything 682888
four 677704
dear 676320
keep 675722
nature 674055
known 671288
point 668710
p 668356
friend 666493
says 666011
passed 665792
within 665633
land 663605
sent 662540
church 659035
believe 656459
girl 652783
city 650397
times 649022
form 647388
herself 646989
therefore 644835
hundred 640059
john 639007
wife 636379
fire 632762
several 632704
body 630129
sure 629252
money 629251
means 627640
air 626921
open 626306
held 625660
second 622526
gone 614808
already 613870
least 609236
alone 606078
hope 602206
thy 599253
chapter 597339
whether 596307
boy 596048
english 594784
itself 591838
2 591413
women 589579
hear 587189
cried 586705
leave 586112
either 581618
number 576685
rest 575648
child 574531
behind 572007
read 571445
lay 571286
black 569530
government 567320
friends 567282
became 564384
around 559161
river 556286
sea 552753
ground 550622
help 549284
c 548349
i’ll 546929
short 546465
question 545629
reason 545464
become 544896
call 544576
replied 544286
town 543694
family 542309
england 542109
lost 537241
speak 537188
answered 536154
five 535088
coming 534713
possible 534639
making 530530
hour 530471
dead 529575
really 528631
looking 528622
law 528248
captain 525928
different 522269
manner 519256
business 516115
states 511757
earth 511042
st 510820
human 510666
early 508769
sometimes 507383
spirit 506297
care 505984
sat 505109
public 504862
close 503948
towards 503262
kept 502051
french 501813
party 500749
truth 500365
line 498822
strong 498492
book 496520
able 494330
later 494101
return 492237
hard 490701
mean 489853
feel 487798
story 486538
m 485841
received 483744
following 481558
fell 480591
wish 480562
person 480508
beautiful 479656
seems 477423
dark 476293
history 475744
followed 474307
subject 473058
thousand 470929
ten 469675
returned 469387
thee 467513
age 466838
turn 466674
fine 466630
across 466545
show 465685
arms 465504
character 464946
live 464642
soul 463939
met 463300
evening 463176
die 462851
common 459553
ready 457764
suddenly 456627
doubt 455415
bring 453346
ii 453190
red 450793
free 447675
that’s 445572
account 444530
cause 444403
necessary 444147
can’t 443812
need 443326
answer 442440
miles 441924
carried 438793
although 438423
fear 437796
hold 437493
interest 437382
force 436993
illustration 436577
sight 435854
act 435269
master 433105
ask 432510
idea 432424
ye 432036
sense 430693
an’ 430321
art 430226
position 429722
rose 428624
3 427441
company 427142
road 425669
further 425131
nearly 424118
table 424064
everything 423740
brother 423088
sort 422809
south 421800
reached 420190
london 418755
six 418131
didn’t 416216
cut 412716
taking 412571
continued 411607
understand 411326
appeared 409564
sun 407584
none 407168
else 406851
big 406799
o 406388
longer 406382
deep 406170
army 405897
beyond 405580
view 404378
strange 400814
natural 400483
talk 399814
north 398556
suppose 396693
court 396267
service 393925
bed 393878
past 393609
ought 393331
street 392970
cold 391836
hours 391460
toward 390231
added 389818
spoke 389420
seem 388757
neither 388355
late 388105
probably 387568
real 386926
clear 385649
chief 385350
run 385269
certainly 385179
est 384982
united 384930
stand 384385
forward 384028
front 383866
purpose 382457
sound 382443
feeling 382032
eye 380164
happy 378251
i’ve 377633
except 374853
knowledge 374155
blood 373563
low 373268
remember 373173
pretty 372548
change 372221
living 371264
american 369773
bad 369425
horse 369396
peace 369168
meet 366864
effect 365907
boys 364460
en 364172
school 362681
comes 362575
france 360771
fair 359826
forth 359249
died 359161
fall 358176
placed 357047
note 354944
led 354740
saying 354703
length 354502
pass 353234
gold 350268
entered 349397
doing 348304
latter 347844
written 347699
laid 346808
4 344382
according 343990
daughter 343682
opened 343526
dr 340867
trees 339826
distance 339817
office 339771
attention 339722
hair 337835
n 337111
prince 335635
wild 335514
wanted 335167
society 335139
husband 332251
play 331807
wind 330079
green 329633
greater 329453
tried 328784
west 328702
important 327851
ago 327793
bear 325469
various 325246
especially 324511
mine 321967
paper 320046
island 320002
glad 319989
makes 319717
instead 319188
faith 318882
lived 318731
pay 318090
heaven 316878
ran 315958
s 315761
blue 315697
minutes 315172
duty 315065
foot 314708
ship 314700
fellow 314523
letters 313624
persons 311105
action 310840
below 309831
heavy 309808
york 309749
strength 308836
pleasure 307965
immediately 307823
remained 307750
save 306991
standing 306911
whatever 306070
won’t 305381
trouble 305338
e 305293
window 305257
object 305202
try 304928
parts 304007
period 303992
desire 303985
beauty 303513
opinion 303459
arm 303347
system 302641
third 302389
chance 301890
books 301331
george 300975
doctor 300779
british 300353
silence 300238
he’s 300053
enemy 298899
hardly 298533
5 296045
greek 295622
exclaimed 294602
send 293592
food 293239
happened 293092
lips 292334
sleep 291632
influence 290698
slowly 290590
works 289252
months 288930
generally 288629
gentleman 287966
beginning 287473
tree 287341
boat 286781
mouth 285685
there’s 285569
sweet 285425
drew 284944
deal 284389
v 284339
future 284186
queen 284002
yourself 283364
condition 283335
figure 283153
single 283016
smile 282793
places 282793
besides 281838
girls 281703
rich 281130
afterwards 281017
battle 280676
thinking 280651
footnote 280245
presence 279893
stone 279829
appearance 279691
follow 279498
iii 279239
started 278072
caught 277993
ancient 277595
filled 277238
walked 276882
impossible 276720
broken 276365
former 276016
century 275990
march 275880
274800
field 274479
horses 274255
stay 274139
twenty 273187
sister 272290
getting 271641
william 270478
knows 269506
afraid 269150
result 268749
seeing 268724
you’re 268500
hall 267020
carry 266780
arrived 266706
easy 266309
lines 265956
wrote 265929
east 265852
top 265242
wall 264942
merely 264898
giving 264484
raised 264154
appear 264015
simple 263923
thoughts 263760
struck 263694
moved 263492
mary 263463
direction 263444
christ 263262
wood 263260
born 263084
quickly 262966
paris 262393
man’s 262105
visit 261882
outside 260418
holy 260348
entirely 259045
somewhat 259020
week 258960
laughed 258562
secret 258198
village 257758
henry 257557
christian 257504
danger 257486
wait 257012
wonder 256770
learned 256420
stopped 256191
tom 256117
covered 256117
6 255876
bright 255349
walk 255090
leaving 254851
experience 254763
unto 254610
particular 254564
loved 254479
usual 254307
plain 253867
to-day 253804
seven 253567
wrong 253172
easily 252954
occasion 252780
formed 252707
ah 252144
uncle 252120
quiet 252035
write 251743
scene 251380
evil 250993
married 250965
please 250781
fresh 250507
camp 249947
german 248539
beside 248522
mere 248276
fight 247957
showed 247904
grew 247866
expression 247804
scarcely 247641
board 247578
command 247398
language 247302
considered 247260
regard 247101
hill 246854
finally 246533
national 246452
paid 246364
joy 246060
worth 245352
piece 244733
religion 244677
perfect 244671
royal 244615
tears 244448
president 244135
value 244084
dinner 243572
spring 242721
produced 242576
middle 242282
charles 242134
brown 241885
expected 241668
lower 241299
circumstances 241150
remain 241102
wide 240773
political 240686
charge 240464
success 240254
per 240083
officers 239806
hath 239618
indian 239572
observed 239548
lives 239448
respect 238787
greatest 238784
w 238776
cases 238527
tone 238005
america 237215
youth 236992
summer 236698
garden 236552
music 236354
waiting 236223
due 236178
modern 235763
jack 235557
unless 235428
study 235093
allowed 234852
leaves 234652
bit 233774
race 233156
military 232907
news 232435
meant 232274
afternoon 232063
winter 231867
picture 231735
houses 231575
goes 231281
sudden 230675
proper 230476
justice 230410
difficult 229784
changed 229658
grace 229281
chair 228931
10 228875
private 228392
eight 228222
hot 227873
reach 226608
silent 226552
‘i 226540
flowers 226379
laws 226197
noble 225931
watch 225328
floor 225326
killed 225020
built 224484
declared 224477
judge 224393
colonel 224303
members 224213
broke 224166
fast 223897
duke 223481
o’ 223293
shot 223105
sit 222222
usually 222162
step 222119
speaking 222101
attempt 221687
marriage 221054
walls 220575
stop 220466
special 220316
religious 220300
discovered 220260
beneath 219894
supposed 219260
james 219013
gives 218988
forms 218743
turning 218692
authority 218686
original 218519
straight 218414
property 218393
page 218233
plan 218185
drawn 217873
personal 217458
l 217130
cry 217022
passing 216926
class 216527
likely 216216
sitting 215841
cross 215821
spot 215719
soldiers 215683
escape 215311
complete 215288
eat 215120
bound 214985
conversation 214895
trying 214332
meeting 213898
determined 213756
simply 213506
shown 213457
bank 213261
shore 212917
running 212509
corner 212507
soft 212163
journey 212007
isn’t 211316
i’d 211132
reply 210852
author 210827
believed 210653
rate 210607
prepared 210558
lead 210548
existence 210220
enter 209851
indians 209589
troops 209398
wished 209068
glass 208986
notice 208859
higher 208770
social 208685
iron 208019
rule 207943
orders 207856
building 207813
madame 207780
mountains 207700
minute 207575
receive 207440
offered 207306
h 206821
names 206725
learn 206618
similar 206437
closed 206419
considerable 206102
lake 206017
wouldn’t 206012
8 205864
pleasant 205487

And here is the complete script:

#! /usr/bin/env python

'''Word frequency analyzer'''

import os, stat, string, subprocess

wordcounts={}

filelist = os.listdir('.')

counter = 0
for f in filelist:
    
    counter += 1
    
    if os.path.splitext(f)[1] == '.txt':
        print f+'t', str(counter)+' of '+str(len(filelist))+'t', 
        print str((float(counter)/float(len(filelist)))*100)[:6]+'%' 
    
        with open(f, "rb") as infile:  book=infile.read()
    
        #try to determine if this is a Gutenberg ebook.  If so, attempt
        #to strip off the PG boilerplate 
        if "PROJECT GUTENBERG EBOOK" in book:
    
            a = book.find("START OF THIS PROJECT GUTENBERG EBOOK")
            if a <> -1:
                b = book.find('n', a)
            c = list(book); c.reverse()
            book = string.join(c, '')
            
            d = book.find('KOOBE GREBNETUG TCEJORP SIHT FO DNE')
            if d <> -1:
                e = book.find('n', d)
            c = list(book); c.reverse()
            book=string.join(c, '')
            book = book[b:len(book)-e]
                
        
        #see which words aren't in the dictionary
        oddwords = subprocess.check_output(
                    "cat "+f+" | aspell list", shell=True).split()

        #find unique words
        u_oddwords = []
        for w in oddwords:
            if w not in u_oddwords: u_oddwords.append(w)
            
        
        #strip out most of the punctuation
        book2=''
        for i in range(len(book)):
            if book[i] not in '!"#$%&()*+,./:;<=>?@[\]^_`{|}~':  
                book2=book2+str(book[i])
                
        book=str(book2).capitalize().split()
        
        for w in book:  
            if w not in u_oddwords:
                if w not in wordcounts:
                    wordcounts[w] = 1
                else:
                    wordcounts[w] += 1
                    

final_list = []
for w in wordcounts:
    final_list.append([wordcounts[w], w])

final_list.sort()
final_list.reverse()

                    
with open('wordcounts_pg', 'w') as wc_output:
    
    for i in range(min(1000, len(final_list)-1)):
        wc_output.write(final_list[i][1]+', '+str(final_list[i][0])+'n')
        
 

First Preview of my Upcoming Book

Last week I placed a new academic working paper on Academia.edu that roughly parallels Chapter 11 of my upcoming book.  The version in the book will be written at a different reading level and without the math equations, but this is still a pretty good taste of what is coming.

Screenshot of paper from academia.edu

Scholars like to post these preliminary drafts for several reasons.  The most important one for an independent researcher like myself is to receive feedback and suggestions prior to submission.  Another reason is to make findings available to the community sooner.  The average turn-around time to publish a journal article is two or three years and the field may have moved on by the time the paper hits the presses.

I probably don’t need to worry about obsolescence with this particular article, since the events with which it deals happened back in the 1950’s and 1960’s.  My book will be a study of the role of outside scholars in our society and, in particular, their ability to shape public policy.  Outside Scholars, in my usage, are people who engage in research and knowledge creation without being formally affiliated with the dominant academic community.  This particular article/chapter deals with an outside scholar named Victor Sharrow who devoted his life to arguing for what he saw as the “correct” interpretation on the Fourteenth Amendment.  He was ultimately unsuccessful, but I feel his career provides several intriguing insights as a characteristic outside scholar narrative.

Sharrow saw the Fourteenth Amendment as the key to dismantling the Jim Crow system in the South.  In the months prior to the 1958 election he mounted an intense one-man lobbying campaign to sway Dwight Eisenhower and other politicians to his views.  In my article I examine several of his arguments from a standpoint of modern data science.

Those of you who read my posts on data science and Python programming might be interested in the simulation models I describe in the paper.  I would be happy to send my spreadsheet and code to anyone who is interested.  Just e-mail me or message my Facebook page.

If all goes well, the book should be released in late 2016 or early 2017.

 

Degree of Voting Restriction by State in 1956, as Calculated by the Model Described in my Paper

Degree of Voting Restriction by State in 1956, as Calculated by the Model Described in my Paper

Book Review: At The End of An Age

Lukacs, At The End of An Age, cover picture

At the End of an Age is a small book, and John Lukacs’ elegant yet simple prose could easily lull you into thinking it is an easy read.  It doesn’t take many pages, though, to realize that every paragraph in this book (or rather, book-length essay) is laden with complex ideas and meaning.  I found myself rereading whole pages to make sure I understood, and I suspect that I would need to read the whole book two or three times to pick up on all of his points.  That being said, the book is worth it.

As I mentioned in my previous post, the ostensible thesis of the book is that the modern age, which Lukacs calls the “bourgeoisie age” is nearing its end.  He offers cogent arguments and examples in support and, in general, makes a strong case.  As it happens, I agree with him; I wrote something very similar on this blog a couple weeks ago, before I had ever read Lukacs.  I think that anyone with some level of historical awareness can see that our civilization is gearing up for a drastic change.  Other historians I have read would have spent the entire book (or 12, in the case of Toynbee) expanding on their particular theory.  Lukacs, having laid out his arguments, then moves up to a higher, more meta-historical level.  Lukacs is interested not just in how history works, but in the epistemology and metaphysics of history and its relationship to the other sciences.  These are deep waters indeed.  Only Lukac’s strong voice and skill as a writer keep the reader from sinking.  Since I lack his mastery, I will not attempt to explain his points here, but will merely mention a couple of his main themes.

Lukacs believes that in history, as in quantum physics, the phenomena is ultimately inseparable from the observer.  The historian does not just record history but, in the act of writing it, actually influences and creates it.  This means that true objectivity is impossible for the historian, and that a purely deterministic conception of history is as obsolete as deterministic physics was after Heisenberg.  This matches up with comments I have occasionally made about history as a narrative.  History is based on fact but, ultimately, is a literary discipline.  This historian doesn’t just tell the story, he creates it.

Another major theme in the book is the role of the human mind in creating history.  Lukacs asserts that “the inclinations of men’s minds” and their beliefs are more important than their competence or any material factor.  “Mind” in this sense means consciousness or soul, separate from brain and body.  Lukacs believes in the power of the mind to influence reality and manifest different potentialities.  Comparative metaphysics is far from my specialty.  However, this sounds very similar to the writings of various New Thought philosophers,  particularly Earnest Holmes and his Science of Mind disciples.  I wonder to what extent the young John Lukacs was influenced by these metaphysical systems.  Regardless, the take away is that if a historian wants to understand a person or group he needs to go beyond studying their situation and strive to understand their minds.

Overall, I found many ideas in this book which I could agree with, or at least try on for size.  There were a few arguments, however, with which I did take minor issue.  In an early section of the book, as part of an overview of various ways the social structures of the current age are breaking down, he discusses the trend towards women’s equality in the workplace and announces that,

Women thought (or, rather, convinced themselves) that they were reacting against the age-old and often senseless categories and assertions  of male authority; yet their dissatisfaction often arose not because of the oppressive strength but because of the weakness of males.  The rising tide of divorces and abortions, the acceptance of sexual liberties, including pre-marital (and sometimes post-marital) habits of frequent copulation and other forms of cohabitation, the increasing numbers of unmarried women and single mothers, the dropping birth rate–thus the decline of the so-called “nuclear” family–were, especially after 1955, grave symptoms suggesting vast social changes.  They included the perhaps seldom wholly conscious, but more and more evident, tendency of many young women to desire any kind of male companionship, even of a strong and brutal kind, if need be at the cost of their self-respect. (pp. 23-24)

He offers no support for this complex, arguable, and potentially inflammatory claim.  This is not the sort of paragraph you just casually slip into a book without offering evidence to back it up.  This is the sort of thing which would have caused me, when I was still a teaching assistant grading papers, to circle the whole paragraph with red pen and write “BURDEN OF PROOF” in the margin.

Lukacs is also universally deprecatory of post-modernism in all of its forms, seeing it as a basically vague and degenerate direction for scholarship and culture.  That is a legitimate, if somewhat reactionary stance.  However, Lukacs, who escaped communist Hungary as a young man, is also blatantly anti-Marxist.  Since, as a historian, Lukacs could not help but be aware of the many contributions that Marxism has made to post-modern analysis and art, I have to question whether he might not be biased on the whole subject of post-modernism.

Finally, Lukacs is dismissive of any value in mathematics for the study of history.  As a “quant”, I feel compelled to respond.  As evidence, he cites his own non-deterministic, non-objectivist view of history as well as Gödel’s incompleteness theorems, which say that 1) Any non-trivial mathematical system contains some postulates which can not be proven without going beyond the system.  2) No mathematical system is capable of proving its own consistency.  Personally, I have been fascinated by Gödel’s theorems since I first studied them in an Abstract Algebra class that I took as a college junior.   As an illustration of what they mean, consider Euclid’s geometrical system, as set down in the Elements.  Euclid begins “A point is that which has position but no dimension.”  The entire system doesn’t work without this axiom, yet there is no way to prove that a point has no dimension using only Euclidean geometry.  You would need to introduce propositions from topology and/or calculus–which are themselves systems which contain propositions which can not be proven without introducing even more complex systems of mathematics.

Kurt Gödel in 1925 [public domain via Wikimedia]

Kurt Gödel in 1925 [public domain via Wikimedia]

And yet, geometry works quite well enough for most purposes, as do topology and calculus.  Granted, the incompleteness theorems seem to imply that a grand-unified theory of history, in the sense of of a closed form solution (plug all the variables into the equation, predict what will happen next) is impossible.  But applied math and statistics are about approximations, empirical formulas, noisy data, and models that work “well enough”, with a quantifiable margin of error.  The incredible advances over the past fifty years in fields like data mining, complexity theory, machine learning, and signal processing have paved the way for a useful discipline of mathematical history, probably within our own lifetimes.  Such a system will only be one more tool for the historian to use, and the results must not be allowed to dominate the historical narrative itself.  But to dismiss all mathematical history out of hand because it will not be an internally provable system seems like a major error.  Even in a non-deterministic universe, mathematical modeling can still provide startling and useful insights.

Despite these minor qualms, I truly enjoyed this book and would recommend it.  Overall, in fact, it is the kind of book I would like to write myself some day.  I will absolutely be reading (and probably reviewing) more of Lukacs’ works in the future.

Why Reading Level Matters for a Writer

This morning I read a fascinating blog article by Shane Snow in which he used two measures of reading level to rank a large number of books, both fiction and non fiction.  His main contention was that many of the most successful books, at least in modern times, are comparatively easy to read.  This makes sense; not many people are going to slog through a novel if the reading level is too challenging for them.  He also drew the inference that blog articles with a lower reading level are much more likely to be shared on social media.  Obviously, these insights are of great interest to me as a writer.  Because the article piqued my interest, and because I’m at the point in writing my own book where I am happy to jump at any distraction, I decided to extend his analysis a bit on my own.

It only took a minute or two to find an open source Java app that calculates the Flesh-Kinkaid Grade Level and Flesh Reading Ease Level of any text or PDF file.  The former gives the number of years of education required to comprehend the writing.  The later is a similar measure, in which a higher score indicates that the work is easier to read.

The first thing I did was to run it on several manuscripts which I have on my laptop.  These included my recently published monograph, the current draft of the nonfiction book I’m writing, and a novel manuscript and three short stories which I am currently trying to sell.  I also ran it on all four of my blogs.

My Own Writing
Nonfiction Flesh-Kincaid Grade Level Flesh Reading Ease Level
Books
Current Book Project(1) 12.91 43.67
Monograph (2) 19.10 1.87
Book Average 16.01 22.77
Blogs
This Blog 10.18 56.95
Handyman Kevin Companion Blog 7.97 70.46
Angry Transportation Rants (Dormant) 7.69 68.43
Old School Essays (Dormant) 8.26 60.16
Blog Average 8.53 64.00
Nonfiction Average 11.02 50.26
Fiction (3)
Novel 4.99 80.60
Short Story 5.03 82.03
Short Story 5.67 75.15
Short Story 5.46 76.62
Fiction Average 5.29 78.60
Overall Average 8.73 61.59
NOTES
(1) First draft, about 6% complete
(2) Body text is nearly identical to my MBA thesis
(3) Unpublished manuscripts from my current “slush pile”

Since raw numbers aren’t that intuitive, I plotted a chart.  Notice how the different pieces of writing cluster quite neatly by type.

Reading level scores of several pieces of my owned writing

I was happy to see that both my fiction and my current nonfiction project are in same zones that  Snow found for these types of writing.  This is quite important from a marketability standpoint, since any editor I send them to would be instantly turned off if the reading level were too high or low.

My blogs fall in the middle, which makes sense since they are basically nonfiction, but are written more casually than a nonfiction book.  However, going by Snow’s article, they are probably written at too high a reading level to be shared much.  In fact, I don’t get many shares compared to other bloggers.  I think I can live with that, since I tend to target my blogging towards my fellow writers.  I suspect that you people are comfortable reading at a higher level than the general public.

My monograph, Freight Forwarding Cost Estimation: An Analogy Based Approach, appears to be nearly unreadable to anyone without a graduate degree in operations research.  I suppose that explains why sales haven’t exactly skyrocketed.  It is what it is, though–an adaptation of my master’s thesis.  My committee loved it.

I think there is real benefit to a writer knowing that the reading level of his work is appropriate to the target audience.

Of course, being a Great Books fan, my next move was to run the app on all the Great Books that I have written about so far on this blog, as well as the next few I plan to cover.

Selected Great Books
Flesh-Kincaid Grade Level Flesh Reading Ease Level
Homer
Iliad 4.48 78.58
Odyssey 3.93 80.98
Average 4.21 79.78
Hebrew Bible (1) 7.57 76.51
Aeschylus
House of Atreus 2.23 90.86
Other Plays 3.24 85.18
Average 2.81 87.61
Sophocles 1.86 90.83
Herodotus 11.75 60.37
Euripides
Hippolytus; Bacchae 4.73 83.72
Medea 4.98 81.14
Average 4.81 82.86
Thucydides 13.34 49.67
Aristophanes
Clouds 2.05 86.65
Birds 5.47 74.76
Frogs 2.67 84.57
Average 3.40 81.99
Plato
Apology, Crito & Phaedo 8.03 70.01
Gorgias 9.15 63.57
Meno 8.51 64.49
Phaedrus 10.05 60.85
Protagoras 9.11 64.30
Republic 8.78 65.42
Sophist 9.32 59.56
Symposium 10.36 60.59
Theaetetus 9.54 61.07
Average 8.99 64.53
Aristotle
Ethics 12.21 55.34
Poetics 10.48 54.32
Politics 11.34 56.35
Average 11.34 55.34
Walt Whitman
Leaves of Grass 12.26 58.00
Overall Average 7.49 71.59
NOTES
(1) King James Version

Again, when I plotted the points, they clustered nicely by type.

Reading levels of selected Great Books (as English translations)

These results held a few surprises.  First was the fact that Homer and the Greek dramas are actually written at a very low reading level, at least in terms of sentence and word length.  I believe this is because these works were intended to be recited or performed orally.  Spoken language is always simpler than written language.  Also, these reading level metrics don’t take vocabulary into account.  Epic poetry and Greek drama tend to use a much wider range of words than a novel, for example.  Examining this factor would require some sort of word frequency analysis.  Unfortunately, I didn’t have an “off the shelf” app to conduct a frequency analysis.  I’m sure I could have kludged up a Python script in a couple hours, but that would have been more time than I wanted to spend.

Another surprise was that Walt Whitman’s Leaves of Grass, which I would have expected to show up close to Homer’s epics, is actually a much tougher read.  It graphs down closer to the serious Greek philosophical works.  As I’ve stated before, though, Leaves of Grass is a rather unique work.

The biggest surprise, however, is that the Great Books are written at a lower reading level, on average, than my own work.  Granted, these sample sizes are pretty small.  I suspect, however, that I have stumbled upon another of the factors that contribute to a book being Great:  the authors manage to convey complicated ideas in simple, readable language.

So, besides being a good way to check the appropriateness of my manuscripts for the target audience, does any of this have a practical application?  Well, the fact that books cluster by type means that reading level could be a good way to sort them.  It would be quite simple to modify the Java app into a data mining tool to sort a collection of books into categories like fiction, nonfiction, plays, etc.  I can easily see situations where this could be useful for anyone who has a large collection of e-books with incomplete meta-information.  Project Gutenberg and Internet Archive, I’m looking at you.

2015 Preview

As the year winds down, I thought I would give you a preview of what’s to come on this blog.

I’ve mentioned in the past how much I enjoy being able to download free books from Project Gutenberg.  Given a choice, though, I do prefer physical books.  This month I received some holiday cash from relatives and I went a bit mad at the used books stores and library friends’ sales.  It’s easy to justify buying classics when you have a blog about the Great Books.

Picture of a pile of used books

In the picture:  A complete set of the Durants’ The Story of Civilization, both volumes of Somervell’s abridgement of Toynbee’s A Study of History, plus Toynbee’s essay collection Civilization on Trial, The Federalist Papers, Herodotus’ The Persian War, Milton’s Paradise Lost and Other Poems, Virgil’s Aeneid, Tolstoy’s Anna Karenina, Thoreau’s Walden and Civil Disobedience, Hawthorne’s The Scarlet Letter, Allan Bloom‘s translation of Plato’s Republic, de Sade’s Justine, Philosophy in the Bedroom and Other Writings, three tragedies of Sophocles, Joyce’s Dubliners, Freud’s Civilization and its Discontents, Fitzgerald’s The Great Gatsby, and Hesse’s Siddhartha.  Not shown is a collection of Aeschylus’ plays that I have momentarily misplaced.

And of course I still have 36 more books of the Hebrew Bible to cover.

I plan to read and blog about all of these over the next two years.  I must offer the caveat, however, that as I get closer to my own book deadline my blogging projects may be pushed to the back burner.

In regards to my next book, I am still in a very early phase of the writing and I don’t want to give many details.  I will say, however, that it deals mainly with modern American history.  I can also confidently promise you it will be a much more enjoyable read than my last book–though I suppose that is faint praise, considering that my last book started life as an operations management thesis.  Speaking of the operations/data science end of things, I’m sure I’ll be writing a few more of those posts too.  I also have some ideas for a couple of posts about the business side of writing which some of you should find interesting, since most of you are writers like me.

I’d like to wish a hearty “thank you” to everyone who has read and subscribed to my blog in 2014.  See you next year!

Why You Can’t Make a Living by Blogging

How much can you actually make from blogging?  The question is of more than casual interest to me and to every other blogger out there.  At some point we’ve all wondered how many articles we need to write to get rich, or at least to pay our internet bill.

A couple of years ago Nate Silver, blogging for the _New York Times_, did a rather enlightening analysis  of the Huffington Post’s blog business.  He concluded that HuffPost makes about $13 per blog article.  They don’t pay their bloggers, but if they did they would clearly be paying them less than $13.

Go on Fiverr or similar freelancer marketplaces and you will find any number of people offering to “write a 500 word blog article on any subject” for $5.  I have no idea how many of these they actually book.
What about the rest of us, slightly more casual, bloggers?  Many of us write on Blogger or something similar and monetize through Google Adsense.  What sort of revenue can we expect?

I don’t have access to other bloggers’ data.  I do, however, use Blogger for three of my own blogs.  I started the oldest in 2008 and have posted sporadically ever since.  I decided to see what insights I could glean from my own data.
It wasn’t hard to throw the numbers into a spreadsheet and draw a histogram:

Histogram of actual blog hits.
This distribution might be a little misleading, though.  After all, some of these articles are eight years old, while others were posted this week.  Since blog articles stay on the web forever the older ones will tend to have more lifetime hits, and I needed to correct for this.

Blogger’s dashboard doesn’t give week-by-week histories for individual articles, but I was able to model an article’s hits over time by assuming that it gets 50% of its lifetime visits the first year, 50% of its remaining visits the next year, and so on forever.  If you took calculus you will probably recognize this as an infinite series.  Being a basically lazy person, I avoided doing the math and simply built a spreadsheet to work backwards.  (I won’t go into details.  It involves data tables and lookup functions).  The new distribution, of estimated lifetime hits for all my blog articles, is:

Histogram of lifetime blog hits
By now you will have noticed one of the sad truths about blogging:  for every article that gets a respectable number of hits you write several that hardly anyone reads.

I wanted to come up with an expected number of hits per article.  Since this was a small sample size with an irregular distribution, the best way to handle it was with a simple simulation (statistics nerds would call it a bootstrap).  Returning to my spreadsheet I sampled my distribution 10,000 times.  This allowed me to estimate the expected number of lifetime hits for an article as 1,271, with a 95% confidence interval from 1,192 to 1,350.

According to Adsense, my lifetime RPM (revenue per 1,000 impressions) is $0.96.  I’ve talked with other bloggers, and this seems pretty typical.  By simple multiplication, my expected revenue for a blog article is about $1.22.

One hears stories about people who can bang out five articles a day, every day.  I am not one of those people; I doubt many bloggers are.  When I don’t have any other writing projects, I might be able to manage five a week.  If I did this all year long, I would make about $317.31 from selling ads.  If I sold all of my articles on Fiverr, I could rake in $1,300.  Even if I made as much per article as the Huffington Post, that would still only be $3,380.   Better not quit my day job.  Wait, it’s too late for that.

I think that most bloggers out there are more like me than not, which means that none of us are going to be able to support ourselves from blogging alone.

So Why Do it at All?

The blogging itself doesn’t pay, but it can still make economic sense to blog.  One of the main reasons is to build a writing portfolio that will help you get actual, paying freelance work, or maybe even a regular column.  People have managed it.

Then there are the merchandising opportunities:  You could sell swag like t-shirts and stickers.  Your gross revenue on one bumper sticker is probably bigger than on 1,000 advertising hits.  Or you could try crowdfunding.  Your blog followers are the natural people to hit up for a contribution to your next Kickstarter campaign.

A huge reason for nonfiction writers like me to blog is the chance to post and get feedback on material that will later go in a book.  One of my newer blogs was actually designed from the start to be the first draft of a DIY handbook.  As soon as I hit 150,000 words I’m going to download the whole thing and start arranging it into chapters.

The Lesson

You will never make enough from blogging alone to make a financial difference. However as a writer, blogging might fit into your larger career plan, or help you generate revenue from other sources.

This article was published simultaneously on LinkedIn.

Easy Double Exponential Smoothing in Python

I realized this morning that it has been a while since I posted any Python code. I’ve been a bit busy with Handyman Kevin and haven’t been doing much data science. Still, I decided it was time to carve out a couple hours this morning to practice my skills. The result are these functions, which perform basic double exponential smoothing using the Holt-Winters method. I deliberately avoided using NumPy, SciPy, or any other libraries. It isn’t that I dislike Numpy/Scipy (far from it), but you can’t always get sysadmins to install extra libraries on the machines you’re using, especially if you are a guerrilla data scientist like me.

There are a lot of different time series methods out there, and they all have their points. Holt-Winters is the one that I keep coming back to, though. One of the reasons is simplicity–I can always remember it and bang it into a spreadsheet without needing to Google anything or download libraries. About the 40th time I typed it into a spreadsheet, though, it occurred to me that it would be smart to implement it in Python so I could save some typing.

The first function, MAPE, simply calculates the mean absolute percentage error (MAPE) of a list of estimated values, as compared to a list of actual values.

The next function, holtwinters, uses Holt-Winters to predict the next three values in a time series. You need to supply two smoothing coefficients, alpha and beta, for the level and trend, respectively. Typically, you would have a pretty good idea what these were from doing similar forecasts in the past.

If you don’t know the coefficients then use the third function, holtwinters_auto, to automatically determine them. This function uses a grid search. Those of you who have read my monograph probably remember that I’m not usually wild about grid searches. In this case it makes sense, though, since you don’t usually need more than a few digits of precision on the coefficients.

Screenshot (3)

def MAPE(actual, estimate):
    '''Given two lists, one of actual values and one of estimated values, 
        computes the Mean Absolute Percentage Error'''
        
    if len(actual) != len(estimate):
        print "ERROR: Lists not the same length."
        return []
        
    pcterrors = []
    
    for i in range(len(estimate)):
        pcterrors.append(abs(estimate[i]-actual[i])/actual[i])
    
    return sum(pcterrors)/len(pcterrors)
def holtwinters(ts, *args):
    '''Uses the Holt-Winters exp. smoothing method to forecast the next
       three points in a time series.  The second two arguments are 
       smoothing coefficients, alpha and beta.  If no coefficients are given,
       both are assumed to be 0.5.
       '''
       
    if len(args) >= 1:
        alpha = args[0]
       
    else:
        alpha = .5
        findcoeff = True
    
    if len(args) >= 2:
        beta = args[1]
    else:
        beta = .5
            
    if len(ts) < 3:
        print "ERROR: At least three points are required for TS forecast."
        return 0
    
    est = []    #estimated value (level)
    trend = []  #estimated trend
    
    '''For first value, assume trend and level are both 0.'''
    est.append(0)
    trend.append(0)
    
    '''For second value, assume trend still 0 and level same as first          
        actual value'''
    est.append(ts[0])
    trend.append(0)
    
    '''Now roll on for the rest of the values'''
    for i in range(len(ts)-2):
        trend.append(beta*(ts[i+1]-ts[i])+(1-beta)*trend[i+1])
        est.append(alpha*ts[i+1]+(1-alpha)*est[i+1]+trend[i+2])
        
    
    '''now back-cast for the first three values that we fudged'''
    est.reverse()
    trend.reverse()
    ts.reverse()
    
    for i in range(len(ts)-3, len(ts)):
        trend[i] = beta*(ts[i-1]-ts[i-2])+(1-beta)*(trend[i-1])
        est[i] = alpha*ts[i-1]+(1-alpha)*est[i-1]+trend[i]
    
       
    est.reverse()
    trend.reverse()
    ts.reverse()
    
    '''and do one last forward pass to smooth everything out'''
    for i in range(2, len(ts)):
        trend[i] = beta*(ts[i-1]-ts[i-2])+(1-beta)*(trend[i-1])
        est[i]= alpha*ts[i-1]+(1-alpha)*est[i-1]+trend[i]
        
    
    '''Holt-Winters method is only good for about 3 periods out'''
    next3 = [alpha*ts[-1]+(1-alpha)*(est[-1])+beta*(ts[-1]-ts[-2])+(1-beta)*         trend[-1]]
    next3.append(next3[0]+trend[-1])
    next3.append(next3[1]+trend[-1])
    
    return next3, MAPE(ts,est)
def holtwinters_auto(ts, *args):
    '''Calls the holtwinters function, but automatically determines the
    alpha and betta coefficients which minimize the error.
    
    The optional argument is the number of digits of precision you need
    for the coefficients.  The default is 4, which is plenty for most real
    life forecasting applications.
    '''
    
    if len(args) > 0:
        digits = args[0]
    else:
        digits = 4
    
    '''Perform an iterative grid search to find minimum MAPE'''
    
    alpha = .5
    beta = .5
    
    for d in range(1,digits):
        grid = []
        for b in [x * .1**d+beta for x in range(-5,6)]:
            for a in [x * .1**d+alpha for x in range(-5,6)]:
                grid.append(holtwinters(ts, a, b)[-1])
                if grid[-1]==min(grid):
                    alpha = a
                    beta = b
            
    next3, mape = holtwinters(ts, alpha, beta)
        
    return(next3, mape, alpha, beta)

My Monograph Goes “Live” Today

Hello all,

At the risk of shameless self promotion, I think I should mention that my first monograph was published today.  It will take while to work its way though the distribution channels, but should be available at all major booksellers in a few weeks.  In the words of the press release:

FOR IMMEDIATE RELEASE
CONTACT: Kevin Straight, 562-587-7700
Email: kstra002@ucr.edu

AGSM Alumnus Publishes Monograph

Saarbrücken, Germany, September 3, 2014

AGSM Alumnus Kevin Straight’s début non-fiction book, Freight Forwarding Cost Estimation: an Analogy Based Approach (ISBN 978-3-659-58859-4) was released today by Lambert Academic Press. It should be available from all major booksellers within six weeks.

The book is based on Mr. Straight’s master’s thesis, which was the result of three months of field research in Dublin, Ireland during the summer of 2013. Straight studied thousands of international shipping records and used them to create a cost estimation system using modern machine learning and data mining techniques. Straight demonstrates, via a dynamic computer simulation, that his method is both accurate and more cost effective than traditional estimation techniques.

Mr Straight currently lives in Montrose, California, where he is doing research on how to adapt his estimation methodology to health care applications, such as blood sugar levels in type II diabetes patients.

High resolution images are available at: http://www.kevinastraight.com/freight-forwarding-cost-estimation-images