#271397 enamdict: add frequency statistic

#271397#5
Date:
2004-09-12 21:58:37 UTC
From:
To:
It would be neat if there was some statistic added so we could tell
which names are more popular.

That way I could e.g., make a better guess as to which of the several
choices a "Mr. Tashiro" really is.

#271397#10
Date:
2004-09-13 00:19:59 UTC
From:
To:
Greetings,

[Dan Jacobson (Bug#271397: enamdict: add frequency statistic) writes:]

Yes, it would be very neat. And it's something I'd be happy to do if
there were to be a reliable way of extracting name frequencies. The
closest I got was NTT's collection of names from their phone book, but
they put an embargo on any publications based on it; even just using it
to indicate frequency of occurrence. Can't imagine why they did, but
there it is.

If it's any help, given a name like "Tashiro", many Japanese have to ask
how it is spelled.

Cheers

Jim

#271397#15
Date:
2004-09-13 00:19:59 UTC
From:
To:
Greetings,

[Dan Jacobson (Bug#271397: enamdict: add frequency statistic) writes:]

Yes, it would be very neat. And it's something I'd be happy to do if
there were to be a reliable way of extracting name frequencies. The
closest I got was NTT's collection of names from their phone book, but
they put an embargo on any publications based on it; even just using it
to indicate frequency of occurrence. Can't imagine why they did, but
there it is.

If it's any help, given a name like "Tashiro", many Japanese have to ask
how it is spelled.

Cheers

Jim

#271397#20
Date:
2011-08-10 16:05:16 UTC
From:
To:
Hi,

This is about: http://bugs.debian.org/271397

Mr. Tashiro is quite obvious.(% population uses, popularity position)
 田代(0.061%,  #287th) - I pick this without second thought.
 田城(0.001%, #6981th) - mozc Japanese imput listed this too.

Not that popular names but this names pupolar than 田代 covers 50% of
Japanese population.

I got this base facts using data by 城岡研究室
静岡大学   人文学部   言語文化学科比較言語文化コース
http://www.ipc.shizuoka.ac.jp/~jjksiro/shiro.html
(With UTF-8 conversion/Openoffice Calc)

There is a page
http://www.ipc.shizuoka.ac.jp/~jjksiro/kensaku.html
(You can read javascript source and identify the list location as:
http://www.ipc.shizuoka.ac.jp/~jjksiro/sei.csv

Since he seems to love to use old BSD tools sed/awk/..., he may agree to
license this data as BSD :-)  Just sweat talk to him ..., Jim, I think
you have good chance.

Nw Japanese copyright law allows copying to analyze data:

(情報解析のための複製等)

第四十七条の七 著作物は、電子計算機による情報解析(多数の著作物その他の大量の
情報から、当該情報を構成する言語、音、影像その他の要素に係る情報を抽出し、比較
、分類その他の統計的な解析を行うことをいう。以下この条において同じ。)を行うこ
とを目的とする場合には、必要と認められる限度において、記録媒体への記録又は翻案
(これにより創作した二次的著作物の記録を含む。)を行うことができる。ただし、情
報解析を行う者の用に供するために作成されたデータベースの著作物については、この
限りでない。

Old electric Phone books, I guess did not have obnoxous restriction as
now.  So he could do this.

There is also TOP 100 popular name is published by 明治安田生命、2008年。
http://www.meijiyasuda.co.jp/profile/release/2008/pdf/20080924.pdf

Osamu

#271397#23
Date:
2011-08-10 16:05:16 UTC
From:
To:
Hi,

This is about: http://bugs.debian.org/271397

Mr. Tashiro is quite obvious.(% population uses, popularity position)
 田代(0.061%,  #287th) - I pick this without second thought.
 田城(0.001%, #6981th) - mozc Japanese imput listed this too.

Not that popular names but this names pupolar than 田代 covers 50% of
Japanese population.

I got this base facts using data by 城岡研究室
静岡大学   人文学部   言語文化学科比較言語文化コース
http://www.ipc.shizuoka.ac.jp/~jjksiro/shiro.html
(With UTF-8 conversion/Openoffice Calc)

There is a page
http://www.ipc.shizuoka.ac.jp/~jjksiro/kensaku.html
(You can read javascript source and identify the list location as:
http://www.ipc.shizuoka.ac.jp/~jjksiro/sei.csv

Since he seems to love to use old BSD tools sed/awk/..., he may agree to
license this data as BSD :-)  Just sweat talk to him ..., Jim, I think
you have good chance.

Nw Japanese copyright law allows copying to analyze data:

(情報解析のための複製等)

第四十七条の七 著作物は、電子計算機による情報解析(多数の著作物その他の大量の
情報から、当該情報を構成する言語、音、影像その他の要素に係る情報を抽出し、比較
、分類その他の統計的な解析を行うことをいう。以下この条において同じ。)を行うこ
とを目的とする場合には、必要と認められる限度において、記録媒体への記録又は翻案
(これにより創作した二次的著作物の記録を含む。)を行うことができる。ただし、情
報解析を行う者の用に供するために作成されたデータベースの著作物については、この
限りでない。

Old electric Phone books, I guess did not have obnoxous restriction as
now.  So he could do this.

There is also TOP 100 popular name is published by 明治安田生命、2008年。
http://www.meijiyasuda.co.jp/profile/release/2008/pdf/20080924.pdf

Osamu

#271397#28
Date:
2011-08-11 08:00:55 UTC
From:
To:
I would be quite happy to add some sort of frequency metric
to given and family names in the ENAMDICT file. The trouble
is I have no time spare to go digging out the data. If someone
else were prepared to compile it, I'd be glad to add it.

Jim Breen

2011/8/11 Osamu Aoki <osamu@debian.org>:

#271397#31
Date:
2011-08-11 08:00:55 UTC
From:
To:
I would be quite happy to add some sort of frequency metric
to given and family names in the ENAMDICT file. The trouble
is I have no time spare to go digging out the data. If someone
else were prepared to compile it, I'd be glad to add it.

Jim Breen

2011/8/11 Osamu Aoki <osamu@debian.org>:

#271397#36
Date:
2011-08-11 10:50:46 UTC
From:
To:
Hi,

I have found a data as below in CSV format for family name.
Anyway raw data has a bit over 100,600 names.
Given name is a bit difficult.

It looks like....

"sei","rank","number"
"佐藤","1位",481980
"鈴木","2位",426804
"高橋","3位",353911
"田中","4位",334073
"渡辺","5位",276257
"伊藤","6位",270047
"山本","7位",269344
...
"天徳寺","88108位",1
"天寅","88108位",1
"天屯","88108位",1
"天秤","88108位",1
"天彦","88108位",1
"天峯","88108位",1
"天霧","88108位",1
"天野盛","88108位",1
"天雷","88108位",1
"天路","88108位",1

So remaining task is to ask copyright holder and merge this into your
dictionary (I assume XML one is the one you wish to update.)

I assume normalizing "Number" into % may be a good idea.  But we may put
low number ones as rare.  Alternatively, -10*LOG(ratio) may provide better
index covering wider range.  Please think about it.

I see there are some manual touch ups needed.  I can help.

I will write to the data producer for the license.

I will mention our intent of use and ask him to put his database under
the same term as yours.

Regards,

Osamu

#271397#41
Date:
2011-08-11 13:12:35 UTC
From:
To:
こんばんは,

2011/8/11 Osamu Aoki <osamu@debian.org>:

Yes, but family names is a great start.

I see you have emailed about it. Thank you for doing that.

I was thinking of dividing into 10 ranks: R1 to R10, with R1 being the most
common.

Something like (in Python): 10-int(math.log10(number)/.63)
would turn those numbers into a 1-10 ranking.

Thanks for doing this.

Cheers

JIm

#271397#46
Date:
2015-01-29 10:01:55 UTC
From:
To:

#271397#49
Date:
2015-01-29 10:01:55 UTC
From:
To: