Dallas R Users Group Message Board › An R programming question

An R programming question

Scott
user 4819328
Frisco, TX
Post #: 6
I have a peculiar question that I think may have a simple programming solution in R.

I have a vector with thousands of observations of budgetary change (the substance does not matter much). Theory suggests that the distribution of this variable should be leptokurtic -- which it is. This being the case, if you present the vector as a density figure or histogram there are four cross over points.

Short version -- I would like to analytically identify the cross-over points.

A fully analytical solution would require fitting to a probability distribution that nests the standard normal and a leptokurtic normal. I thought of a cheat, though, that will be good enough for my purposes.

I would like to bin the empirical distribution as in a histogram. For each bin (say values = to .03), I would have the count of observations in that bin (2354 observations of value = .03). I could then generate a random normal with the same mean, variance, and size. Finally, I could subtract the bin count of the standard normal from the bin count of the empirical distribution. This will show me the ranges over where one distribution is consistently produces more probability density than the other (though I will have figure out a way to non-arbitrarily deal with zones that switch back and forth -- but that is a problem for another day).

Any ideas on how to bin values of a variable and then count how many are in that bin for a large number of bins (say, 300-500 bins)?

This strikes me as a problem that may be solvable with some programming tricks like loops - but I have vanishingly little programming experience and thought I would throw the question out to the group.

Thank you for any help you can offer,

Scott

PS If you want to see the problem this is intended to address, you can see the arbitrary method I have used in the past to identify cross over points. The citation is:

@article{robinson2007explaining,
title={Explaining policy punctuations: Bureaucratization and budget change},
author={Robinson, S.E. and Caver, F. and Meier, K.J. and O'Toole Jr, L.J.},
journal={American Journal of Political Science},
volume={51},
number={1},
pages={140--150},
year={2007},
publisher={Wiley Online Library}
}
Larry
user 14060621
Group Organizer
Richardson, TX
Post #: 61
To me this sounds a like a job for the cut() function. You can set the breaks in the cut function to any value you want.
Eldon
user 47387632
Austin, TX
Post #: 2
Alternatively you could use the hist() function which is supposed to be more efficient and use less memory according to the R documentation. You can specify your breaks in very much the same way you would for the cut() function. Here is an example:

data<-rnorm(1000, mean=0, sd=1) #generate your data
your_breaks<-seq(-4, 4, by=0.2) #create a vector that specifies your breaks
results<-hist(data,breaks=your_brea­ks) #use hist() and specify your breaks
results$breaks #take a look at your breaks
results$counts #see how many you have in each bin



Larry
user 14060621
Group Organizer
Richardson, TX
Post #: 62
I didn't know that hist() was quicker. Some CPU time results.

> data<-rnorm(400000, mean=0, sd=1) #generate your data
> your_breaks<-seq(-6, 6, by=0.2) #create a vector that specifies your breaks
> system.time(hist(data,breaks=your_breaks­))
user system elapsed
0.07 0.03 0.10
> system.time(cut(data, breaks=your_breaks))
user system elapsed
0.17 0.00 0.17

Also hist() returns an object. cut() only returns a vector which you'll have to manipulate later.
Scott
user 4819328
Frisco, TX
Post #: 7
Wow. I had not considered the cut command. I need to confirm that the results of cuts can be used for the subsequent analysis (in terms of types of variables) -- but so far so good.
Powered by mvnForum

People in this
Meetup are also in:

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy