Probabilistic Lookup
Here's a challenge for someone. Well, maybe it's easy, but it's challenging for me. For lack of a better name, I call it probabilistic lookup. Say you have a list of names, ordered by popularity (according to the U.S. Census Bureau):
Mary
Nancy
Pamela
Beatrice
Deborah
Enid
Lolita
Delta
Mayola
Concha
How can you randomly select a name such that names at the top of the list are more likely to be selected than names at the bottom of the list? In other words, if you randomly select 1,000 names from this list and then do a sorted frequency distribution of the selected names, they should (roughly) appear in the same order as the original list.
It needs to be generalizable to handle a much larger list (~10,000) . I'm not seeking statistical perfection, just a reasonably accurate way to give names higher on the list a greater chance of being selected, and names lower on the list a lower chance of being selected.
I'd prefer a formula solution, if possible.
Jack Dahlgren:
I can think of a number of ways.
Here is a pretty simple one.
Reverse the list order.
Add a column
Autonumber 1 through n+1 (assuming n names)
In the next column use a formula
The first cell is 1, the next one down = cell above + the autonumbered cell
Drag all the way down to one beyond the last name.
This is now your lookup table.
Next use random() and multiply by the last value
Use that as the input to your lookup table and return the name associated with that number.
If you had 4 names the first name would have 1 chance in 10, the second 2 chances in ten etc…
1 Bob
12 July 2006, 6:34 pm3 Carol
6 Ted
10 Alice
James Low:
Hi, not sure if this is the best way but:
http://jameslow.com/ProbabilisticLookup.xls
12 July 2006, 8:31 pmShawn Stackhouse:
Hi John,
How about this solution:
Assign your name list a named range of Names.
This formula will return the names at the top of the list more often:
=INDEX(Names,INT(COUNTA(Names)-(SQRT(2*(RAND()*COUNTA(Names)*(COUNTA(Names)+1)/2)+0.25)-0.5))+1)
The weights work like this:
e.g. If there are 3 names in the list
Name 1 is returned around 3/6 of the time
Name 2 is returned around 2/6 of the time
Name 3 is returned around 1/6 of the time
e.g. If there are 4 names in the list
Name 1 is returned around 4/10 of the time
Name 2 is returned around 3/10 of the time
Name 3 is returned around 2/10 of the time
Name 4 is returned around 1/10 of the time
etc.
If you’re interested in the math, then I could explain it!
Thanks,
12 July 2006, 8:40 pmShawn
Bill Simoni:
John - I’ve taken a stab at it. My solution requires some sort of frequency to be specified. Lacking one, I assumed the top item in a list of n items happens n times, the second item happens (n-1) times, and so forth. I suppose it’s possible some sort of megaformula could be made to handle it all in one calc, but I’d prefer not to try and wrap my brain around that…
I’ve posted the file at my site so anybody can download and play with it: http://bill.simonifamily.net/2006/07/12/probabilistic-lookup-in-excel/
12 July 2006, 9:48 pmJason Morin:
Here’s a start. Assuming the names are in A1:A10, I assigned probabilities to each name in B1:B10 and then sorted both columns ascending based on column B. So I have:
Delta 0.005
Mayola 0.005
Concha 0.005
Lolita 0.01
Enid 0.025
Beatrice 0.05
Deborah 0.05
Pamela 0.1
Nancy 0.25
Mary 0.5
I used a helper column (column C) by multiplying each probability by 10000 (cell G1) and then adding the values cumulatively down col. C. So C1 has: =B1*$G$1 and C2: =B2*$G$1+C1. Copy the formula in C2 down to C10. Now I have 3 columns that look like:
Delta 0.005 50
Mayola 0.005 100
Concha 0.005 150
Lolita 0.01 250
Enid 0.025 500
Beatrice 0.05 1000
Deborah 0.05 1500
Pamela 0.1 2500
Nancy 0.25 5000
Mary 0.5 10000
From here I used an INDEX/MATCH combo with RANDBETWEEN selecting a whole number between 0 and 9999:
=INDEX(A1:A10,MATCH(RANDBETWEEN(0,G1-1),C1:C10,1)+1)
Of course the biggest issue is that the numbers 0-49 will product #N/A, and the formula will never return “Delta.” That should be an easy workaround but I can’t find one at the moment. I’ll address later.
I also picked “pretty” probabilities in this example.
12 July 2006, 10:06 pmSam:
Hi
Formula in Colum C =IF(RANDBETWEEN(1,26)>ROW()=TRUE,A2,0)
Formula in Colum D =IF(C2=0,”",IF(AND(C20),”T”,”B”))
This gives more random “T”’s than “B”’s
Regards
Sam
A B C D
12 July 2006, 11:24 pmSl Name TOP/BOTTOM
1 a 1 T Top 10 77%
2 b 0 Bottom 3 23%
3 c 3 T 13
4 d 4 T
5 e 5 T
6 f 6 T
7 g 0
8 h 8 T
9 i 9 T
10 j 0
11 k 11 T
12 l 12 T
13 m 13 T
14 n 14 B
15 o 0
16 p 0
17 q 17 B
18 r 0
19 s 0
20 t 0
21 u 21 B
22 v 0
23 w 0
24 x 0
25 y 0
26 z 0
XLurker:
Hi John,
If the list is in, say, cells A1 to A10, then you could make your selections with a formula like:
=OFFSET($A$1,TRUNC(RAND()*RAND()*COUNTA($A$1:$A$10),0),0)
Cheers,
XLurker
13 July 2006, 12:25 amMike Alexander:
Six Steps:
1. Copy list to A1:A10
2. Sort list by smallest frequency of occurance to largest
3. Enter 0 in Cell B1
4. Enter =INT(AVERAGE(B1+2,(B1+2)*2)) in cell B2
5. Copy formula down to Cell B11 (Yes…one row below the end the of list)
6. Enter this formula anywhere =INDEX($A$1:$A$10,(MATCH(INT((RAND()*MAX($B:$B)+1)),$B$1:$B$11,1)))
Copy formula down to 1000 rows, and you will get a distribution that should support the premise of your original list. The more rows you include, the closer you get to the correct distribution.
13 July 2006, 12:30 amThe formula in step 4 could be tweaked for better results.
Jasser:
Hi,
here’s my try:
=INDEX(INDIRECT(”A1:A”&COUNTA($A:$A)),COUNTA($A:$A)-INT(LOG10(10*(9/10*RAND()+0.1))*COUNTA($A:$A)))
Assuming that your list of names starts at A1.
-Asser
13 July 2006, 12:33 amMpemba:
First is to generate an index table:
“Occurance” is the relative indvidual frequency of the names in ascending order. It does not matter how these are scaled.
“Prob” is the cumulative probability of that name being returned. The top name will always be 1.000
The formula in the Prob column is:= SUM(B2:$B$11)/SUM($B$2:$B$11)
To actually return names the formula is:
=INDEX(Names,MATCH(RAND(),Prob,-1))
Where:
“Names” represents A$2:$A$11
“Prob” represents C$2:$C$11
Example Result for a block of 10000 selections
FWIW formula to complie this was:
=COUNTIF($J$1:$S$1000,A2) - to Give answer for “Mary”
Mary 2337Nancy 2316
Pamela 1325
Beatrice 960
Deborah 922
Enid 783
Lolita 520
Delta 348
Mayola 308
Concha 181
Mpemba
Editor: Mpemba, I’ve been using your comment to play around with the formatting.
13 July 2006, 2:32 amMpemba:
Darn that formatting: it looked fine in the preview window (Courier) !
I should add, if you want a working copy of the spreadsheet I’d gladly mail it.
M
13 July 2006, 4:04 amXLurker:
Hi,
If the list is in, say, cells A1 to A10, then how about making your selections with a formula like this:
=OFFSET($A$1,TRUNC(RAND()^2*COUNTA($A$1:$A$10),0),0)
The COUNTA is for scalability of the list. Set the power of 2 higher to skew the selection more to the top of the list, or between 0 and 1 to skew the selection to the bottom of the list.
Cheers,
XLurker
13 July 2006, 4:47 amMpemba:
Me again:
In hindsight I should add that in post 4 my “Occurance” column does not have to be sorted in to ascending order. The order does not matter a jot using this method.
I must also admit I found it surprising Mary was at the top of the example list. The 2005 data for the USA listed at:
http://www.socialsecurity.gov/OACT/babynames/
Top ten for the 2000s
13 July 2006, 5:32 amhttp://www.socialsecurity.gov/OACT/babynames/decades/names2000s.html
Emily 149,420 1.2525%
Madison 123,729 1.0372%
Hannah 110,081 0.9228%
Emma 106,428 0.8921%
Ashley 91,644 0.7682%
Abigail 89,848 0.7532%
Alexis 89,512 0.7503%
Olivia 88,971 0.7458%
Samantha 88,669 0.7433%
Sarah 85,747 0.7188%
Gareth:
RNDBTWN? thght mst b mssng smthng, hng n tls > dd-ns > nlyss tlpk.
It`s making a lot more sense now.
13 July 2006, 6:16 amJ Walker:
Mpemba:
13 July 2006, 6:45 amYou shouldn’t be surprised Mary was at the top of the list. If you go to the site you linked and check the “Popularity of a Name” on Female for the past 100 years using the name Mary, you’ll find it was a very popular name for a very long time.
John Walkenbach:
Thanks to everyone. I haven’t gone through them all yet, but this is a great help.
BTW, here’s where I got the names from:
http://www.census.gov/genealogy/names/names_files.html
And each name has a frequency.
13 July 2006, 7:48 amShawn Stackhouse:
For a graphical representation of name popularity over the decades, check out the totally fun Baby Name Wizard at http://www.babynamewizard.com/namevoyager/
Poor Mary ain’t what she used to be!
From these charts, you can easily pinpoint when I was born.
13 July 2006, 8:19 amAndrew Reynolds:
{=OFFSET($A$1,MATCH(RAND(),1-EXPONDIST(ROW(1:[n])-1,[tweakable constant],TRUE),-1),0)} … so entered as an array formula over the required output cells.
A1 is one cell above the input list
13 July 2006, 11:49 am[n] is the upper bound for the number of input values + 1
[tweakable constant] controls the decay in probabilities for the exponential function. 0.1 ‘works’ over 10,000 cells, 0.9 can be seen to ‘work’ for a shorter list. 0.1 is very ’slow’ decay in probability.
Cyrus Sujar...:
Statistical Perfection yields the best results. Sorry so late but I was helping someone with homework and came across this unsatisfactory answer. What you need is an “adjusted cumulative frequency” column to the left of the names. Luckily, the data includes the cumulative frequency:
adj.cf name freq cum.freq(%) rank
0 MARY 2.629 2.629 1
.02629 PATRICIA 1.073 3.702 2
.03702 LINDA 1.035 4.736 3
.04736 BARBARA 0.980 5.716 4
.05716 ELIZABETH 0.937 6.653 5
=VLOOKUP(RAND(), $A$2:$B$6, 2)
or if you use the actual data
=VLOOKUP(RAND(), $A$2:$B$4275, 2)
http://www.excelforum.com/showthread.php?t=628930
15 July 2008, 10:44 amshows similar question with Letters of alphabet