Difference between revisions of "Ar Tonelico III"
From Learning Languages Through Video Games
Jump to navigationJump to searchLine 7: | Line 7: | ||
I used the Japanese IME "canna" under ubuntu, compiled from source, and changed kana-kanji dictionaries, so that kanji+furigana is written upon entering and converting Japanese te | I used the Japanese IME "canna" under ubuntu, compiled from source, and changed kana-kanji dictionaries, so that kanji+furigana is written upon entering and converting Japanese te | ||
==Script== | ==Script== | ||
− | Yeah, the script in Japanese with a complete English translation [http://www.2shared.com/file/6v-zJ7J4/joined.html](3.2MB) :) | + | Yeah, the script in Japanese with a complete English translation [http://www.2shared.com/file/6v-zJ7J4/joined.html][https://rapidshare.com/files/457743039/joined](3.2MB) :) |
For anyone who's interested, what I found out... | For anyone who's interested, what I found out... |
Revision as of 19:40, 16 April 2011
Lots and lots of text and lots of obscure kanji!
Translations
(or should I translate to German??)
Tips
I used the Japanese IME "canna" under ubuntu, compiled from source, and changed kana-kanji dictionaries, so that kanji+furigana is written upon entering and converting Japanese te
Script
Yeah, the script in Japanese with a complete English translation [1][2](3.2MB) :)
For anyone who's interested, what I found out...
AT3 ebd script files • consists of EVENT_MESSAGE_SW[2digit-NUMBER]_[3digit-NUMBER].ebm (called DIAG from now on) and EVENT_SW[2digit-NUMBER]_[3digit-NUMBER].ebm (called CTRL from now on) • each DIAG corresponds do a CTRL file with the same NUMBER's • DIAG to contains the main dialogue lines, while CTRL is probably system-related • DIAG files are also usually only a few hundred bytes long • DIAG has a header of 4 bytes, then comes the main part • the first 26 bytes of CTRL are as follows (decimal): [#1] 000 000 000 000 000 000 000 005 000 000 000 110 097 109 101 000 005 000 000 000 144 224 150 190 000 [#2] [#3] 000 000 [#4] 000 [#5] [#6] [#7] [#8] [#8] 000 [#9] 000 whereas #[n] are -- #1takes many different values, 001 is very frequent (~50%) -- #2 takes many different values -- #3 mostly 000, a few times 001, 002, 4 times 003, 3 times 004 -- #4 mostly small bytes <=021, 021 and 00x frequently occur in adjacent files together, takes 044 in two instances -- #5 either 000, 016, 049, or 064 -- #6 always either 113, 116, 117, 119, or 127 -- #7 bytes <= 025, either 00x or 021x with x<=5 except a handful of times -- #8 almost always 000, except 10 and 7 files respectively -- #9 either 000, 001, 002, 003, 004, 005, 017, 019, 021. 025 with the lower bytes much more common • the byte of CTRL always seems to be <bh:7f>, the last 26 bytes only being somewhat similiar • in general, CTRL displays a high ration of <bh:00> • CTRL contains no UTF8 chars • the main part of CTRL, apart from the man 0's, contains only ASCII chars, most of which are LATIN characters and punctuation, with a few special chars such as <bh:f4>, <bh:dc> (Ü) • the main part of DIAG is in the following format, after the 3-byte header comes: [SEPARATOR] [UTF8-sequence][SEPARATOR][UTF8-sequence] ... [UTF8-sequence][SEPARATOR] • as the text is Japanese, [UTF8-sequence] is usually a multiple of 3-byte blocks, each block representing a multi-byte for one Japanese character; it terminates on a zero-byte • the main text may contain a ※削除※ line, [LEADING] is then <bh:ff> • [SEPARATOR] always consists of 36 bytes, each byte smaller than <bd:192>, with the only exception it may also contain <bh:ff>. Not counting the <bh:00> byte UTF8 terminating byte. • [SEPARATOR]: most bytes are constant, except the following meaningful bytes • the 25th byte: it is a [LEADING] number, counting the dialogue lines • a [LEADING] byte <bh:ff> this line is outside the "normal" dialogue flow, ie a system message ("You got item..") or "Party member xyz joined." or "……。" or "…!?" &c. • the 13th byte: this indicates the [SPEAKER]. [SPEAKER] is <bh:ff> when there is no speaker • the first byte indicates the [MODE] <bh:00> - talk with speech bubbles at character's 3D models <bh:01> - talk with 2D character portraits <bh:02> - item get TO SUMMARIZE • dialogue in EVENT-MESSAGE file: [3 byte header][36-byte separator][UTF8 byte sequence, terminating on <bh:00>], repeat • 13th byte [SEPARATOR] is speaker, 26th [SEPARATOR] marks "normal" spoken text
And here is an improved lua script I wrote that looks for valid UTF8 sequences in a file, works much better and doesn't need specific information on separators &c.:
--PARSE BYTES FOR VALID UTF8 string terminated sequences (except ASCII, ie the first bit non-zero) --specify how what's between UTF8 should be interpreted, don't forget the newline! function process_separator(sep) -- return "" --just use this if you only the UTF8 data if #sep>25 then return "(" .. sep[1] .. "・" .. sep[13] .. "・" .. sep[25] .. ")" else local tag = "" local to_number = {A=10, B=11, C=12, D=13, E=14, F=15, a=10, b=11, c=12, d=13, e=14, f=15} to_number["0"] = 0 to_number["1"] = 1 to_number["2"] = 2 to_number["3"] = 3 to_number["4"] = 4 to_number["5"] = 5 to_number["6"] = 6 to_number["7"] = 7 to_number["8"] = 8 to_number["9"] = 9 for _,v in ipairs(sep) do local h1=to_number[string.sub(v,1,1)] local h2=to_number[string.sub(v,2,2)] if h2 then tag = tag .. string.char(16*h1+h2) else tag = tag .. string.char(h1) end end return tag end end --true is interpreted as 1, nil or false as 0 function dec_to_8bit(dec,byte) --byte must point to an initialized table of wrong (false or nil) values local exp = 128 for i=1,8 do if dec >= exp then byte[i] = true dec = dec-exp end exp = exp*.5 end end function get_utf8(filename, outname) local infile = io.open(filename,"rb") if infile then print("Searching for valid UTF8 in file: " .. filename .. "...") out = io.open(outname,"a+") --change "w+" to "a+" to append to end of file, not deleting previous data out:write("#FILE: " .. filename .. "\n") local occ = 0 --just count how many valid chars we found local cur_pos local len = 0 local len2 local utf8 = "" local betw = {} --what's between the utf8 sequences local betw2 local insert_line_break = false local insert_sep = false local file_len = infile:seek("end") infile:seek("set") repeat local dec = string.byte(infile:read(1)) cur_pos = infile:seek("cur") local byte = {} dec_to_8bit(dec,byte) if len >= 1 then if not byte[1] or byte[2] then --UTF8 multibyte chars MUST start with 10 except the first byte! utf8 = "" insert_sep = true table.insert(betw,string.format("%x",betw2)) --return to where we wrongly assumed UTF8 started... cur_pos = cur_pos+len2-len-1 infile:seek("set",cur_pos) len = 0 else --valid utf8 found, dumping... utf8 = utf8 .. string.char(dec) len = len - 1 occ = occ + 1 if len==1 then --utf8 sequence end len = 0 if insert_sep and #betw>0 then out:write(process_separator(betw) .. utf8) betw = {} insert_sep = false else out:write(utf8) end insert_line_break = true utf8 = "" end end else if byte[1] and byte[2] then --we are not interested in ASCII chars... otherwise allow b2=="0" -- now determine byte length of glyph len = 2 repeat len = len+1 until not byte[len] len = len-1 if len > 6 then --UTF8 only allows for 6byte chars at most table.insert(betw,string.format("%x",dec)) len = 0 insert_sep = true else utf8 = utf8 .. string.char(dec) len2 = len betw2= dec end else if (dec == 0) and insert_line_break then --zero terminated :) out:write("\n") else table.insert(betw,string.format("%x",dec)) end insert_sep = true end insert_line_break = false end until cur_pos >= file_len out:write("\n") infile:close() out:close() print("Found " .. occ .. " valid UTF8 chars, except ASCII.\nWritten to " .. outname .. ".\nDone.") end return occ end get_utf8(arg[1],arg[2])
Oh, and btw, the AT3 script contains 1026643 characters : )