PARK - parallel KRONOS architecture

                    Kuznetsov D., Tarasov E.


Abstract
--------

     This   paper  describes  our  understanding  of  parallel
systems.   We  suggest  the  parallel  computation  model  and
appropriate programming tools. The hardware architeture needed
to  support  such systems is based on distributed memory model
and the high level language proessor's concepts.

Keywords:
---------

     parallel systems,  distributed memory,  computation models,
     programming languages,  Modula-2,  Oberon, Kronos, C, Unix,
     transputer, INMOS.

---
INMOS (tm)  - is a tardemark of INMOS corp.
Unix (tm)   - is a tardemark of ATT
Kronos (tm) - is a tardemark of KRG

Introduction
------------

     This  paper  describes  the  huge-multiprocessor  project
which  is  working  out  in  Kronos  Research Group. The major
requisite for this work was Kronos microprocessor architecture
and software designed for it in Novosibirsk Computer Center at
last  years.  Kronos  is  microprocessor architecture based on
OISC   ideas.   From  one  hand  it  optimized  for  efficient
implementation  of  high  level  languages. From the other for
simplification  of  silicon  implementation  of microprocrssor
chip set. Original operating system Excelsior was designed for
Kronos  architecture.  Excelsior was complex, modern operating
system.  It  allow  Kronos Research Group follow-up some great
ideas  in  parallel  software  design. Optimized compilers for
Modula-2,  C  and Fortran-77 also designed for Kronos. We hope
all  this  will be a good begining (a half of good ending) for
decision   dificult  problem  of  huge  multiprocessor  system
design.   The   experience  of  transputer  computation  model
(invoked  by  INMOS) convinces us to try to find out other way
for flexible and usefull organization of parallel system.


General description of parallel system
--------------------------------------

     The   goal   of   any   parallel  computation  system  is
information  processing.  In  classic  models  such  as Turing
mashine   the   atoms   of   information   processed  strictly
sequential.  But  even  in  early  implementation  of cumputer
systems  bits  of  information  coupeled  in  portions  (named
"words").  All  bits  of word proccesed in parallel by special
device  called  "processor".  The processor may be the complex
device  that  includes  arithmetic logic unit, registers, cash
memory,  float point acselerator and so on. All this technical
details  are  very  interesting  and  may  be  the  subject of
individual  investigation.  Some  phases of data processing in
the  processor may be also executed in parallel. But our great
interest  is  directed  to the huge number processors parallel
system arcitecture.
     We  assume  that  a  lot  of  processors elements (nodes)
executes  their  individual  programs  for data processing and
comunicate  to  the  neighbours  for  syncronization  and data
exchange.  Some  subsets  of processors may realy share common
data (for example program code and constants). We will discuse
problems  of  organization  parallel work in such system. Base
element   of   PARK  system  will  be  KRONOS  microprocessor,
programming  in  Modula-P  language. The multiprocessor system
will be driven by PARIS operating environment discribed below.


Computation Model
-----------------

     PARK  is  the nodes set, linked in a net with appropriate
topology.  The  net topology optimization is not a goal of our
exploration,   and   will   be   disused  later  for  concrete
applications. The attainability (may be transit attainability)
of each node is the signle condition needed for our reasoning.
     Each  node  of  PARK  include processor and local memory.
Local memory connected to processor through the special memory
manager.  Memory  manager  includes  the  table driven virtual
memory  support  and facilities for access to the local memory
of  other  nodes.  The  node's processor may access to any non
local  memory  but  with  the  different  speed. Maximum speed
reached  for  local  memory.  Non  local  memory  access speed
depends of distance from processor to memory determined by net
topology and system overhead.
     Operating  environment  formed  as  a  set of independent
tasks.  Task's  independence  means that tasks never processed
the  shared data. The independence property nevertheless allow
tasks shared use of read-only-data, because such data never be
writen  by  this  tasks and so never be processed. So, PARK is
divided  resource for the task set. System cernel may schedule
the  PARK  resources  (processors, memories and so on) between
tasks.
     Task  itself  is  a  set  of  processes shared the task's
virtual  address  space.  The global virtual address space for
any  task  is  reflected  into the physical memories of system
nodes. The distance between each process of task and each data
determines  by net configuration, process to processor mapping
and  virtual to physical address reflection on processors used
in  the  task.  Last two factors must be optimized to minimize
the  distance between data and processes. This optimization is
individual for each task.
     Processes  formed a task demand the syncronization. Their
requirements  may  by  obtained  by  different ways. Our model
suggest  then  processes  of task may share task's data and so
they  require  the means for exclusevly access to common data.
The "signal" model of syncronization is satisfactory. "Signal"
in essence is queue of waiting processes with send-count (when
queue is empty). Sending signal activate first waiting process
or  increments  signal  send-count.  Waiting signal decrements
send-count  iff  it's  non zero and continue, or delay process
until  the  signal will be sended. This general syncronization
method  used  not  only  for  processes.  In the task a set or
special  signals  represents the processors used by this task.
Signal  from  this  set  is  the  ready-queue  for appropriate
processor.   When  operating  environment  decided  that  this
processor  is  ready for the task it simply take first waiting
process from its ready-queue.
     The execution unit in PARK architeture will be program as
a  collection  of  algorithms  and  data  structures.  Program
determines  task  execution, processes creation and algorithms
for them, resource allocation and usage.
     The  main  assumption about the program properties needed
for  efficient  execution based on locality of processes. This
assumption  means  that  most  significant  data trafic in any
process  of  the  task is trafic between process and its local
data.  Process's  local  data  resides  in  local  RAM  of the
appropriate  processor,  but  the  points  of  syncronization,
accepting  and  sending  data  to  other  processes and shared
access to common data structures are not very rarely. At worst
when  program  don't  satisfied  this  assumption (for example
writen for traditional common memory multiprocessor) it may be
slow but in any case will be executable!


PARIS - parallel instrumental system
------------------------------------

     The  problem  of  parallel  systems control is the corner
stone  of  our  project.  We  will  not  use  words 'operating
system',  because  its  semantic was greatly overload. We will
not  tought  abnormaly  high reliable system problems. We will
not  tought  very high level safety systems. It does not means
that  we ignore this problems, but we think, that main goal of
huge parallel operating environment is to provide instrumental
tools for creating such kind of systems.
     The  perfix instrumental means that tools that we foresee
are   oriented  to  professional  programmers,  future  system
developers.  The  super  high  level  system  for  safety  non
professional  parallel programming may be the goal of later on
project. We hope that such systems may be developed from PARIS
by  regular  way  used  N.Wirth for extracting Oberon language
from Modula-2.
     The  main  instrument for parallel programing in PARIS is
the  Modula-2  language  and  compiler  for it. The program in
Modula-2  designed as a set of modules. Processes runed by the
program  uses  procedures  from  modules  to express their its
execution algorithms and modules variables to access to global
data structures.
     There are three notions in PARIS system: tasks, processes
and   modules.  Modules  from  one  hand  essentialy  are  the
half-finished   products   to   construct  the  algorithms  of
individal  processes,  from  the  other hand they are the data
managers  and  provides the exclusive or shared access to data
structures.
     There  are  two ways for processes comunication. First is
access to guarded common data sctructures, managed by modules.
Second is message passing through sending and waiting signals.
Forthunatly both types of syncronization may be expressed with
Send/Wait  signal  operations and it is nessesary only add new
standard  type SIGNAL and operation with this type to existing
Modula-2 language.
     We  have  no  need to design new language for programming
the  kernel  of  operating  environment.  We  are  sure,  that
Modula-2 is the good choose for this goals.
     For  the  purposes of compatibility we will implement the
system calls similar to Unix-V operating system. It will allow
us   to   run   existing   C   language  compiler  for  Kronos
microprocessor  and  use  portable C-labrary for ordinary Unix
programs.  Of  course,  additional  libraries will present for
newly   designed  C-programs  specific  multiprocessor  system
facilities.

PARK - parallel Kronos architecture
-----------------------------------

                      Network structure.

     Special  network  of  links is used for data pass between
nodes.  Each  link  is  32-bit width bus, connected to several
nodes. Each node can be connected to a few links. If two nodes
are  connected to the same link then each of them can directly
access  memory of another. Access through the link uses global
physical  memory  address  consisted of two parts: node number
and  word  address  in node's local memory. Link controller of
each  node  provides  request  retranslation  from one link to
another  using map table loaded with information about network
topology. In such way node can access memory of any other node
in  the  system.  Reading  memory  request is processed in two
separate  steps: address transmission and data reception. This
is nessesary to prevent network interconnection deadlocks.


                       Node structure.

     Main  part  of  node  is  CPU.  It  is single chip 32-bit
microprocessor   Kronos   with   architecture   optimized  for
efficient   execution   of  programs  written  in  high  level
programming  languages.  All  computations  are  performed  on
hardware  arithmetic  stack.  Addressing  modes  reflect  data
structures  of  high  level programming languages. Instruction
set   includes   commands   for   access   to  global,  local,
intermediate and external variables, arrays, records. A number
of  commands  are  intended  for  recursive  procedure  calls,
processes and signals management.

     CPU  provides virtual memory support. Address translation
table   is  stored  in  node  local  memory.  Special  address
translation  cash decreases memory access time by holding last
translations'  results.  CPU  uses translation table only when
there  is no physical address of accessed memory page in cash.
Physical  address  may  refer  to  the  local RAM or to RAM of
another  node.  Realy  address translation cash is distributed
between   local   memory   controller   and   number  of  link
controllers.  All  of  them  try  to find translation in their
cashes  in  parallel.  If  no  one  found translation then CPU
calculates  needed  translation  and  puts  it into one of the
cashes.   Physical   address   determine   the  cash  in  wich
translation  will  be stored. So, only one controller may have
translation for the appropriate virtual address. Each task has
its   own  translation  table  and  the  kernel  of  operating
environment  must provide syncronous modification of the table
in nodes executing the task.

     Single  chip  memory  controller  manage array of dynamic
memory  chips.  It  contains  error  detection  and correction
circuit, data cash, memory refresh circuit.

     Disk   drives  and  another  peripheral  devices  can  be
connected  with  SCSI  controller  chip.  Several nodes may be
connected  to  single  SCSI  bus.  Controller  provides direct
memory access for input and output data.

Conclusion
----------

     Offered  parallel  architecture  PARK,  based on ideas of
distributed memory, combines advantages of shared memory model
with  high  speed  local  memory of message-passing transputer
model.   So  it  may  be  the  compromise  way  increases  the
application area of parallel architectures.
     The powerfull of Modula-2 language based on modular style
of  programing  permit  us  to hope that Modula-2 with minimal
extensions  is a good tool to design and implementation kernel
of operating envirtonment. Abstract data types incapsulated in
appropriate  set of modules may support very easy and elegance
tools for efficient parallel software design.
     We  are  very  much obliged to proffesor V. Kotov for his
important  influence  at  our  understanding of major parallel
problems.  We also thanks all members of Kronos Research Group
who are realy promoted this work.


                                       Thursday August 24 1989
                                       Kronos Research Group
                                       Computer Center
                                       Novosibirsk