A brief introduction to disassem package.
=========================================

1. How you can break down byte stream of an executable file 
   into instructions?
2. Can we use some kind of parsing tools (compiler compiler)
   such as yacc to transform the structure (grammar) into 
   parsing code itself?
3. So what is preccx anyway?
4. To understand windows executable files what we need to know?
5. Combine two things and we get the basic disassembler.
6. What is the problem anyway?
7. Can we solve the problem?
8. What's next step?
9. Some explanation about each files and directory.


--------------------------------------------------------------------------
1. How you can break down byte stream of an executable file 
   into instructions?
--------------------------------------------------------------------------

I found the following structures for intel pentium instructions.
The structure is not related to any architectural considerations,
it is selected for the decoding purpose only.

   <instruction>       =    <prefixes>* <instruction_body> 
   
 % instruction consists of zero or more prefixes followed by 
   instruction body 

   <instruction_body>  =    <one_byte_instruction>      	
                       |    "0x0F"  
					        <two_byte_instruction>     
 
 % instruction body is either one byte instruction 
   or "0x0F" followed by two byte instruction  
   
   <one_byte_instruction>   
                       =    <opcode0>        
					   |    <opcode1> <byte>
					   |    <opcode2> <word>
					   |    <opcode3> <word> <byte>
					   |    <opcode4> <double_word>
					   |    <opcode5> <pointer_word>
					   |    <opcode6> <mod_reg_or_mem>
					   |    <opcode7> <mod_reg_or_mem> <byte>
					   |    <opcode8> <mod_reg_or_mem> <double_word>
					   |    <opcode9> <opcode_extention>
					   |    <opcode10> <opcode_extention> <byte>
					   |    <opcode11> <opcode_extention> <double_word>
					   |    <opcode12> <opcode_extention_group>
					   |    <case_jump_block>
					   |    <test_group>
					   |    <wait_group>
					   |    <repeat_group>

  % there are 17 sub cases for one byte instruction
    where byte and word and double word are well understood, 
	pointer word consists of 6 bytes,
	the complication comes from mod_reg_or_mem byte (or bytes) 
	which determines addressing mode of the instruction, 
	opcode extention is same byte as mod_.. byte but the role is different,
	case jump block needs special attention because there may be a lot of
	addresses following the instruction itself, 
	test group, wait group, and repeat group is also different 
	from other cases.

   <two_byte_instruction>
                       =    <opcode20>
					   |    <opcode21> <double_word>
					   |    <opcode22> <mod_reg_or_mem>
					   |    <opcode23> <mod_reg_or_mem> <byte>
                       |    <opcode24> <opcode_extention> 
					   |    <opcode25> <opcode_extention> <byte>

 % there are 7 sub case for two byte instruction
   everything is explained above or in the following

   <mod_reg_or_mem>    =    <mod1>
                       |    <mod2> <sib_star> <double_word>
					   |    <mod2> <sib_non_star>
					   |    <mod3> <double_word>
					   |    <mod4> <byte>
					   |    <mod5> <sib> <byte>
					   |    <mod6> <double_word>
					   |    <mod7> <sib> <double_word>
					   |    <mod8>

   <sib_star>          =    <byte>
   <sib_non_star>      =    <byte>
   <sib>               =    <byte>
   <word>              =    <byte> <byte>
   <double_word>       =    <byte> <byte> <byte> <byte>
   <pointer_word>      =    <word> <double_word>
   <mod*>     		   =    <byte>
   <opcode*>           =    <byte>
  
 % mod_reg_or_mem can be either just one byte, or
   it can follow sib (scale index base) byte again 
   it can follow byte or double word depending on the cases.
   
   <case_jump_block>   =    "0xFF" "0x24" <sib> <label_start_position> <label>*

   <label_start_position>
                       =    <label>

   <label>             =    <double_word>

 
   <test_group>        =    "0xF6" <reg/0> <byte>
                       |    "0xF6" <reg> <opcode_extention>
					   |    "0xF7" <reg/0> <double_word>
					   |    "0xF7" <opcode_extention>

   <reg/0>             =    <reg>	         % register part is zero
   <reg>               =    <byte>
   <reg/6>             =    <reg>            % register part is 6
   <reg/7>             =    <reg>            % register part is 7

   <wait_group>        =    "0x9B" "0xD9" <op/6>
                       |    "0x9B" "0xD9" <op/7>
					   |    "0x9B" "0xDB" "0xE2"
					   |    "0x9B" "0xDB" "0xE3"
					   |    "0x9B" "0xDD" <op/6>
					   |    "0x9B" "0xDD" <op/7>
					   |    "0x9B" "0xDF" "0xE0"
					   |    "0x9B"



   <repeat_group>      =    "0xF2" <opcode>		   % for printing purpose
                       |    "0xF3" <opcode>
					   |    "0xF3" <opcode>


Of course for simplicty, not all of structures are presented here.


--------------------------------------------------------------
2. Can we use some kind of parsing tools (compiler compiler)
   such as yacc to transform the structure (grammar) into 
   parsing code itself?
--------------------------------------------------------------

Of course, we have to use parse generator to finish our project 
as early as possible. I choose the parser generator "preccx", 
because i didn't have any experience with yacc, and i needed 
some more and possibly better tool for my disassembler.

Since preccx is top down parser with infinite back tracking ability, 
the script itself is easy to construct and easy to modify, and there 
are a lot of power in your hand, I don't have any experience with yacc 
so i cannot compare two tools.
But believe me this tool seems better. 

To learn to use some tools as complicated as preccx or yacc is painful, 
and it is not easy. But since I provide the preccx script for disassembler, 
you can skip this part. 

Preccx generates C source code which is modular and very compact in size.
It is incremental, which means you can break down scripts and process it 
independently. Well what the heck, most of you don't understand what i am
talking about. But after you get use to this tool, then  ... 

To run generated parser, you need runtime support functions, which is also 
included in preccx package. But I have to modify some part of it, since
the preccx accepts input from the keyboard, and I need to process data from
disk, and I don't have any end of line business, or what ever it matters.


---------------------------------------------------------------------
3. So what is preccx anyway?
---------------------------------------------------------------------

     Preccx is a  compiler  compiler.  It  converts  preccx-style
     context-grammar  definition  scripts  (with  a .y extension)
     into C code scripts (with a .c extension). The  output  code
     compiles  under  ANSI  C  compilers such as the GNU Software
     Foundation's gcc(1).

     There is an easy-to-use hook for lex(1) tokenisers.

     Preccx extends the UNIX yacc(1) utility by allowing:

     [0] Contextual definitions. Each grammar definition  may  be
     parameterized  with  contexts.   For example, some languages
     determine whether a declaration is local (and  to  what)  or
     global  in  scope  by  relative indentation, and this can be
     encoded in preccx using the number of spaces indentation  as
     a parameter, n:

           @ decl(n) = <' '>*n expr <'\n'> decl(n+1)*

     This definition is intended to mean that a  "decl"  indented
     by  n spaces consists of n spaces, an expression, and a new-
     line, optionally followed by one or several "decl"s indented
     still further.

     [1] Infinite lookahead and backtracking in place of the yacc
     1-token  lookahead,  This  means that preccx parsers distin-
     guish correctly between sentences of the form `foo bah  gum'
     and  `foo  bah NAY' on a single pass.  If you cannot imagine
     why one should want to decide between the two,  think  about
     `if ... then ...' and `if ... then ... else ... '.

     [2] Arbitrarily complex expressions. This  means  that  com-
     pound definitions like

          explain {{this | that} {several | no} times}+

     are legal within preccx definition scripts.

     [3] Preccx has the  postfix  operators  `*'  (zero  or  more
     times), `*n' (exactly n times), `+' (one or more times), and
     `!' (execute accumulated actions now) built in,  along  with
     the  `[ ]'  (optionally)  outfix  operator. For example, the
     following means `exactly n spaces':

           @ space(n) = <' '>*n

     The other built-ins are

          `?' (any token)

          `^' (beginning of line)

          `$' (end of line)

          `|' (or, placed between alternate phrases of the  gram-
          mar)

          `{ }' (grouping brackets)

          `< >' (around literals)

          `> <' (to mean `not a particular literal')

          `( )' around the name of a BOOLEAN valued predicate  on
          tokens,  defined  as  an  int 1 or 0 -valued C function
          elsewhere in the script, and

          `) (' (anti-brackets) round a C expression  of  BOOLEAN
          type, meaning a logical test condition.

          `]..[' anti-brackets hide an expression, causing it  to
          be required but ignored.

     `]a[ b' means that input must satisfy both a and b, while `a
     ]b[' means that b is trailing context.

     `$!'  is a shorthand for matching  end-of-line  followed  by
     execution  of  pending  actions  (it  also  causes the input
     buffer to start being written from the beginning again).  It
     is roughly  equivalent  to the  conjunction '! $',  but more
     efficient.

     `a b c' (conjunction) is the  term  denoting  an  expression
     consisting of an `a expression' followed by a `b expression'
     followed by a `c expression'. An example of a preccx  script
     follows in the section USAGE.

     [4] Modular output.  Parts of  a  script  can  be  preccx'ed
     separately,  compiled  separately,  and then linked together
     later, which makes maintenance and version control easy.

     [5] Speed. Preccx is fast,  typically  taking  two  to  five
     seconds  to compile scripts of several hundred lines. And it
     builds fast parsers too.

and some more follows...


----------------------------------------------------------------------
4. To understand windows executable files what we need to know?
----------------------------------------------------------------------

General Layout

   MS-DOS header (64 bytes)
   
   real-mode program stub
   
   PE file header
      signature
	  number of sections
	  time date stamp
	  pointer to symbol table
	  number of symbols
	  size of optional header
	  characteristics
   
   Optional header	(224 bytes)
   ---standard fields  
	  signature
	  linker version
	  code size
	  data size
      entry point RVA
	  base of code
	  base of data
   ---NT specific fields	 
	  image base
	  initial stack size
	  program entry point location
	  preferred base address
	  operation system version
	  section alignment information
	  file alignment
	  os information
	  user information
	  system informatiion
	  reserved
	  image size
	  header size
	  file checksum
	  subsystem
	  dll flags
	  stack size
	  heap size
	  loader flags
	  number of data entries
	     export table
		 import table
		 resource table
		 exception table
		 security table
		 base relocation table
		 debug table
		 copyright 
		 global ptr
		 tls table
		 load config table
		 reserved
   
   Section table(Section Headers)
      --- each section header has following   
	  section name
	  virtual size
	  RVA/Offset
	  size of raw data
	  pointer to raw data
	  pointer to relocs
	  pointer to linenumbers
	  number of relocs
	  number of linenumbers
	  section flags
   
   Section data
      the data for each section is located at the file offset given
	  by the pointer to raw data field, size is also given.

      Data directories exist within the body of 
      their corresponding data section. Typically, 
      data directories are the first structure within 
      the section body, but not always. For that 
      reason, you need to retrieve information from 
      both the section header and optional header to 
      locate a specific data directory.
    
There are a lot more things to know about PE file format, and
how to dig out useful informations, but lets stop here for now.


----------------------------------------------------------------------
5. Combine two things and we get the basic disassembler.
----------------------------------------------------------------------


Now I can figure out what is starting position of the code and
what is the RVA(relative virtual address), etc,...
So with this pedump.c and disassembler we built using preccx we can
report very basic things. Alas there are more things to do.

Well for reporting or printing there is print1.c fuction to look at.
The program is straight forward once you understand what is the parse
structure is ... But since I changed this one to solve the problem
which i willl tell you later, it looks quite complicated.

If there is anyone who want to print out disassembler as AT&T assembler
format you can just modify this print1.c and voila, you will get AT&T
format diassembler.



------------------------------------------------------------------------
6. What is the problem anyway?
------------------------------------------------------------------------
  
  =======
  case 1:
  =======

:00401574 55                      push ebp
:00401575 8BEC                    mov ebp, esp
:00401577 8B4508                  mov eax, dword[ebp+08]
:0040157A 8B550C                  mov edx, dword[ebp+0C]
:0040157D C70038154000            mov dword[eax], 00401538
:00401583 895004                  mov dword[eax+04], edx
:00401586 5D                      pop ebp
:00401587 C3                      ret

:00401588 83 00 00 00 03 00 30 00 04 00 00 00 7F 00 00 00   ......0.........
:00401598 3C 00 4C 00 00 00 00 00 00 00 00 00 00 00 00 00   <.L.............
:004015A8 05 00 00 00 04 00 00 00 F0 15 40 00 01 00 5C 00   ..........@...\.
:004015B8 63 6F 6E 73 74 72 65 61 6D 00 00 00 73 17 40 00   constream...s.@.
:004015C8 00 00 00 00 03 00 00 00 00 00 00 00 C3 17 40 00   ..............@.
:004015D8 00 00 00 00 0D 00 00 00 00 00 00 00 8D 16 40 00   ..............@.
:004015E8 08 00 00 00 00 00 00 00                           ........

  =======
  case 2:
  =======

:00402BCC 55                      push ebp
:00402BCD 8BEC                    mov ebp, esp
:00402BCF 8B4D0C                  mov ecx, dword[ebp+0C]
:00402BD2 8B5508                  mov edx, dword[ebp+08]
:00402BD5 8BC1                    mov eax, ecx
:00402BD7 83E003                  and eax, 00000003
:00402BDA FF2485E12B4000          jmp dword[4*eax+00402BE1]

:00402BE1 1B2C4000                DWORD 00402C1B
:00402BE5 F12B4000                DWORD 00402BF1
:00402BE9 FF2B4000                DWORD 00402BFF
:00402BED 0D2C4000                DWORD 00402C0D

:00402BF1 8A01                    mov al, byte[ecx]
:00402BF3 0AC0                    or al, al
:00402BF5 7461                    je 00402C58
:00402BF7 8802                    mov byte[edx], al
:00402BF9 83C101                  add ecx, 00000001
:00402BFC 83C201                  add edx, 00000001

Well look at the case 1 and case 2 carefully.
There is a big problem in two cases. Suddenly there appears
some kind of data block which have no relations what so ever,
and some addresses which looks like case jump addresses.
There are some more kind of these data blocks appear inside code block.
Some disassembler is not prepared to handle all of these cases.
W32dsm87.exe solves some case of data and code separation, but
it is not good at this point at all.
IDA(Interactive Disassembler) only disassembles code block 
which is referenced somewhere.
If IDA does not know whether it is referenced or not simply it
treats code block as data.
So either way it is unsatisfactory.


-------------------------------------------------------------------
7. Can we solve the problem?
-------------------------------------------------------------------

Well what can i say? Above two cases are taken from actual
disasembly listing of some program, so you can expect most part
of data and code seperation is done by disassem.exe.

What do i use to solve this problem? 
Do you remember one of structures i showed you before?
It is case jump block, and it looks like this.

   <case_jump_block>   =    "0xFF" "0x24" <sib> <label_start_position> <label>*

   <label_start_position>
                       =    <label>

   <label>             =    <double_word>

It takes care of case jump block(at least partially you know).
But how about case 1?
I used several heuristic methods to solve code and data seperation.
I proceed until there breaks some problem and erase the part 
which caused the problem. And restart from some point which is 
somehow known to be an entry point or a label point.
Is it that simple?	Yes, and no. 
Basic idea is simple, but many other things have been put together 
to make the program work right. If you have any questions about how 
it works after studying source itself,
I am happy to help you out.


-----------------------------------------------------------------------
8. What's next step?
-----------------------------------------------------------------------

Well I am working on next version of disassembler.
I don't know exactly what should be done and what should'nt.
I like to make diassembler iterative at least although it is not interactive.
This means if you found some part of data should be code block, then you can
tell disassembler about this and disassembler does dirty job for you.
Or you can tell disassembler some part of code is not code at all and it
should be data instead, then this guy does what you have told him.

There are still something I haven't done because of time constraint.
   More of string business, (this means I like to find more strings 
      and possibly crosss reference string data)
   Data cross referencing.
   Find some hidden(or indirect) call references.

I have to confess that actually I like to build a decompiler.
What is decompiler? Is it possible to make a decompiler?
Decompiler does what disassembler does but one step further.
It produces source programs(C or C++ or whatever) for given exe files.

There was some efforts to build a real life decompiler but
no one succeeded in building usable one.
I have some vague idea how this can be done, but there are some reasons 
to believe this is possible.



--------------------------------------------------------------------------
9. Some explanation about each files and directory.
--------------------------------------------------------------------------

First, preccx.zip contains most part of preccx package.
I deleted libarary part and something else which seems 
not that important, if you wish to obtain whole package
you may contact "http://www.comlab.ox.ac.uk:80/archive/redo/"
You can find the preccx package there.

Second, disasm.zip contains whole source for disassem.exe program. 
it consists of main.c, pedump.c, print1.c, g1.c, ccx.c, ccx.h, cc.bat
which is a batch file to compile sources using DJGPP compiler.

g1.y is the preccx script for disassembler. g1.c is the transformed 
C source file for g1.y.

intro.txt is what you read right now.
