It would be neat if there was some statistic added so we could tell which names are more popular. That way I could e.g., make a better guess as to which of the several choices a "Mr. Tashiro" really is.
Greetings, [Dan Jacobson (Bug#271397: enamdict: add frequency statistic) writes:] Yes, it would be very neat. And it's something I'd be happy to do if there were to be a reliable way of extracting name frequencies. The closest I got was NTT's collection of names from their phone book, but they put an embargo on any publications based on it; even just using it to indicate frequency of occurrence. Can't imagine why they did, but there it is. If it's any help, given a name like "Tashiro", many Japanese have to ask how it is spelled. Cheers Jim
Greetings, [Dan Jacobson (Bug#271397: enamdict: add frequency statistic) writes:] Yes, it would be very neat. And it's something I'd be happy to do if there were to be a reliable way of extracting name frequencies. The closest I got was NTT's collection of names from their phone book, but they put an embargo on any publications based on it; even just using it to indicate frequency of occurrence. Can't imagine why they did, but there it is. If it's any help, given a name like "Tashiro", many Japanese have to ask how it is spelled. Cheers Jim
Hi, This is about: http://bugs.debian.org/271397 Mr. Tashiro is quite obvious.(% population uses, popularity position) 田代(0.061%, #287th) - I pick this without second thought. 田城(0.001%, #6981th) - mozc Japanese imput listed this too. Not that popular names but this names pupolar than 田代 covers 50% of Japanese population. I got this base facts using data by 城岡研究室 静岡大学 人文学部 言語文化学科比較言語文化コース http://www.ipc.shizuoka.ac.jp/~jjksiro/shiro.html (With UTF-8 conversion/Openoffice Calc) There is a page http://www.ipc.shizuoka.ac.jp/~jjksiro/kensaku.html (You can read javascript source and identify the list location as: http://www.ipc.shizuoka.ac.jp/~jjksiro/sei.csv Since he seems to love to use old BSD tools sed/awk/..., he may agree to license this data as BSD :-) Just sweat talk to him ..., Jim, I think you have good chance. Nw Japanese copyright law allows copying to analyze data: (情報解析のための複製等) 第四十七条の七 著作物は、電子計算機による情報解析(多数の著作物その他の大量の 情報から、当該情報を構成する言語、音、影像その他の要素に係る情報を抽出し、比較 、分類その他の統計的な解析を行うことをいう。以下この条において同じ。)を行うこ とを目的とする場合には、必要と認められる限度において、記録媒体への記録又は翻案 (これにより創作した二次的著作物の記録を含む。)を行うことができる。ただし、情 報解析を行う者の用に供するために作成されたデータベースの著作物については、この 限りでない。 Old electric Phone books, I guess did not have obnoxous restriction as now. So he could do this. There is also TOP 100 popular name is published by 明治安田生命、2008年。 http://www.meijiyasuda.co.jp/profile/release/2008/pdf/20080924.pdf Osamu
Hi, This is about: http://bugs.debian.org/271397 Mr. Tashiro is quite obvious.(% population uses, popularity position) 田代(0.061%, #287th) - I pick this without second thought. 田城(0.001%, #6981th) - mozc Japanese imput listed this too. Not that popular names but this names pupolar than 田代 covers 50% of Japanese population. I got this base facts using data by 城岡研究室 静岡大学 人文学部 言語文化学科比較言語文化コース http://www.ipc.shizuoka.ac.jp/~jjksiro/shiro.html (With UTF-8 conversion/Openoffice Calc) There is a page http://www.ipc.shizuoka.ac.jp/~jjksiro/kensaku.html (You can read javascript source and identify the list location as: http://www.ipc.shizuoka.ac.jp/~jjksiro/sei.csv Since he seems to love to use old BSD tools sed/awk/..., he may agree to license this data as BSD :-) Just sweat talk to him ..., Jim, I think you have good chance. Nw Japanese copyright law allows copying to analyze data: (情報解析のための複製等) 第四十七条の七 著作物は、電子計算機による情報解析(多数の著作物その他の大量の 情報から、当該情報を構成する言語、音、影像その他の要素に係る情報を抽出し、比較 、分類その他の統計的な解析を行うことをいう。以下この条において同じ。)を行うこ とを目的とする場合には、必要と認められる限度において、記録媒体への記録又は翻案 (これにより創作した二次的著作物の記録を含む。)を行うことができる。ただし、情 報解析を行う者の用に供するために作成されたデータベースの著作物については、この 限りでない。 Old electric Phone books, I guess did not have obnoxous restriction as now. So he could do this. There is also TOP 100 popular name is published by 明治安田生命、2008年。 http://www.meijiyasuda.co.jp/profile/release/2008/pdf/20080924.pdf Osamu
I would be quite happy to add some sort of frequency metric to given and family names in the ENAMDICT file. The trouble is I have no time spare to go digging out the data. If someone else were prepared to compile it, I'd be glad to add it. Jim Breen 2011/8/11 Osamu Aoki <osamu@debian.org>:
I would be quite happy to add some sort of frequency metric to given and family names in the ENAMDICT file. The trouble is I have no time spare to go digging out the data. If someone else were prepared to compile it, I'd be glad to add it. Jim Breen 2011/8/11 Osamu Aoki <osamu@debian.org>:
Hi, I have found a data as below in CSV format for family name. Anyway raw data has a bit over 100,600 names. Given name is a bit difficult. It looks like.... "sei","rank","number" "佐藤","1位",481980 "鈴木","2位",426804 "高橋","3位",353911 "田中","4位",334073 "渡辺","5位",276257 "伊藤","6位",270047 "山本","7位",269344 ... "天徳寺","88108位",1 "天寅","88108位",1 "天屯","88108位",1 "天秤","88108位",1 "天彦","88108位",1 "天峯","88108位",1 "天霧","88108位",1 "天野盛","88108位",1 "天雷","88108位",1 "天路","88108位",1 So remaining task is to ask copyright holder and merge this into your dictionary (I assume XML one is the one you wish to update.) I assume normalizing "Number" into % may be a good idea. But we may put low number ones as rare. Alternatively, -10*LOG(ratio) may provide better index covering wider range. Please think about it. I see there are some manual touch ups needed. I can help. I will write to the data producer for the license. I will mention our intent of use and ask him to put his database under the same term as yours. Regards, Osamu
こんばんは, 2011/8/11 Osamu Aoki <osamu@debian.org>: Yes, but family names is a great start. I see you have emailed about it. Thank you for doing that. I was thinking of dividing into 10 ranks: R1 to R10, with R1 being the most common. Something like (in Python): 10-int(math.log10(number)/.63) would turn those numbers into a 1-10 ranking. Thanks for doing this. Cheers JIm