f77 - conversion of hollerith integers to character

Discussion:

(too old to reply)

Lynn McGuire

2005-09-07 17:01:16 UTC

I have decided to convert all the Hollerith in our f77 code to character
strings. The Hollerith is stored in integers mostly but sometimes in
doubles. We have about 536,000 lines of code including comments
(probably 15 to 20%). We also have about 23,000 lines of c code.

Does anyone know of a tool to automate this ? It looks like a swamp
to me !

Thanks,
Lynn McGuire

glen herrmannsfeldt

2005-09-07 18:45:39 UTC

Permalink

Post by Lynn McGuire
I have decided to convert all the Hollerith in our f77 code to character
strings. The Hollerith is stored in integers mostly but sometimes in
doubles. We have about 536,000 lines of code including comments
(probably 15 to 20%). We also have about 23,000 lines of c code.

Doubles allow twice as many characters in one variable.

Post by Lynn McGuire
Does anyone know of a tool to automate this ? It looks like a swamp
to me !

I would probably do it in awk, though I am not sure that I would
recommend that to others. It would depend much on what the code
actually looks like.

Are any variables ever used both for holding characters and
numerical data in the same subroutine? That complicates any
automated method for separating them.

I would have a program make a pass through the program and keep
track of which variables are used in each subroutine for character
data. Then more passes through for any assignment statements using
previously identified variables, until I had found them all.
Then a pass to generate CHARACTER declarations and remove those
variables from other declarations. CHARACTER variables can still
be initialized in DATA statements, read/written in A4 or other A
formats, and assigned to variables.

You probably also need to find which ones are used in CALL statements
to change the matching SUBROUTINE. This would be related to finding
assignments using those variables.

Note that spaces are not required in many places in Fortran,
which means that simple parsing methods might not work. I would
be surprised to see it in quality code, though. If you find
statements like

INTEGERI,J,K

declaring variables I, J, and K, then you will need a
real Fortran parser, and somewhat more work to get it done.

-- glen

Lynn McGuire

2005-09-07 19:04:03 UTC

Permalink

Post by glen herrmannsfeldt
Doubles allow twice as many characters in one variable.

Cool ! Not ! <bg> BTW, I converted this program from single
precision to double precision about 4-5 years ago. We do not
store any A8 in any double, just A4, A3, A2 or A1. Of course,
there are many arrays of such.

Post by glen herrmannsfeldt
Are any variables ever used both for holding characters and
numerical data in the same subroutine? That complicates any
automated method for separating them.

Of course ! This is real fortran code. A-H and O-Z are reals,
I - N are integers. Anything else is fair game. Maybe 10% of the
variables are actually declared in our code.

Post by glen herrmannsfeldt
I would have a program make a pass through the program and keep
track of which variables are used in each subroutine for character
data. Then more passes through for any assignment statements using
previously identified variables, until I had found them all.
Then a pass to generate CHARACTER declarations and remove those
variables from other declarations. CHARACTER variables can still
be initialized in DATA statements, read/written in A4 or other A
formats, and assigned to variables.

Tricky. My single to double precision conversion program did this.

Post by glen herrmannsfeldt
You probably also need to find which ones are used in CALL statements
to change the matching SUBROUTINE. This would be related to finding
assignments using those variables.

Yup, this is the trickiest part. Making the arguments correspond on a
50 argument subroutine can be a real pain in the posterior.

Post by glen herrmannsfeldt
Note that spaces are not required in many places in Fortran,
which means that simple parsing methods might not work. I would
be surprised to see it in quality code, though. If you find
statements like
INTEGERI,J,K

Havent seen that in a long time, thank goodness. Doesnt mean
that we dont have it out there though.

Thanks,
Lynn

glen herrmannsfeldt

2005-09-07 19:50:05 UTC

Permalink

Post by Lynn McGuire

Post by glen herrmannsfeldt
Doubles allow twice as many characters in one variable.

I didn't use A8 much either. I once used double, or maybe
even COMPLEX*16 for a run time format. While the run
time format had to be in an array, I could copy it easier
in a single variable.

Post by Lynn McGuire

Post by glen herrmannsfeldt
Are any variables ever used both for holding characters and
numerical data in the same subroutine? That complicates any
automated method for separating them.

Of course ! This is real fortran code. A-H and O-Z are reals,
I - N are integers. Anything else is fair game. Maybe 10% of the
variables are actually declared in our code.

As far as I understand it, you have many variables that
are really constants, initialized in DATA statements.

There might also be variables read in with A4 format, to
compare to the constant variables. Maybe other temporary
variables used to fill an array before writing it out.

I suppose those temporaries could also be used as numeric
temporaries. I suppose, then, that I would first try to convert
so that no variables are ever used both ways by replacing
existing variables with two different variables.

(snip of multiple passes through the program)

Post by Lynn McGuire
Tricky. My single to double precision conversion program did this.

It would seem to be a similar problem.

Post by Lynn McGuire

Yup, this is the trickiest part. Making the arguments correspond on a
50 argument subroutine can be a real pain in the posterior.

But programs are good at counting, so it doesn't bother them
much if there are many arguments.

Post by Lynn McGuire

Havent seen that in a long time, thank goodness. Doesnt mean
that we dont have it out there though.

I have seen it in the output of programs that generate
Fortran code. Not so much in people written code.

I was remembering in another discussion the MORTRAN2 system
which converted the MORTRAN language, an improved Fortran,
to standard F66 code. As far as I know, though, MORTRAN2
is completely lost.

-- glen

Lynn McGuire

2005-09-07 21:38:43 UTC

Permalink

Post by glen herrmannsfeldt
There might also be variables read in with A4 format, to
compare to the constant variables. Maybe other temporary
variables used to fill an array before writing it out.

This program has a keyword input file that is read in by line in 80A1.

Keywords are compared for by a character at a time. This will
require much hand coding but will actually go fairly easily since
"integer line (80)" translates to "character*80 line".

Post by glen herrmannsfeldt
I suppose those temporaries could also be used as numeric
temporaries. I suppose, then, that I would first try to convert
so that no variables are ever used both ways by replacing
existing variables with two different variables.

I guess that I dont understand your thought here.

Post by glen herrmannsfeldt

Post by Lynn McGuire
Tricky. My single to double precision conversion program did this.

It would seem to be a similar problem.

I might be able to modify this processing program (was written in C
since I dont think AWK). I just need to come up with a set of rules.

We also have a dynamic data storage system that I am worried about.
It consists of data storage allocated in a large common block using
a union of double, two integers or two logical*4. Was quite inovative
back in the middle 1970s. I am wondering how a character*X would
fit into this union (to replace the A4's that are currently stored in
integers). Here is the data structure:

structure / type64 /
union
map
double precision d
end map
map
integer i
integer i_low
end map
map
logical l
integer l_low
end map
end union
end structure

Thanks,
Lynn

e p chandler

2005-09-07 22:01:39 UTC

Permalink

Post by Lynn McGuire

Post by glen herrmannsfeldt
There might also be variables read in with A4 format, to
compare to the constant variables. Maybe other temporary
variables used to fill an array before writing it out.

This program has a keyword input file that is read in by line in 80A1.
Keywords are compared for by a character at a time. This will
require much hand coding but will actually go fairly easily since
"integer line (80)" translates to "character*80 line".

I guess that I dont understand your thought here.

Suppose you have a variable named FOO that is used to hold Hollerith
data. Create another variable named FOO_X that is a character variable.
Global search and replace all FOO used in a character context with
FOO_X. Any references to numeric parts of FOO will be left undisturbed.
Of course if you are storing integer or real data into a "string" then
you have other problems.

I second Glen's idea of using a string processing language or string
utility to at least do some exploratory work. [AWK is very nice IMO for
this sort of stuff.]

Herman D. Knoble

2005-09-08 11:44:16 UTC

Permalink

Lynn: I think that Tidy: http://www.unb.ca/fredericton/science/chem/ajit/f_tidy.htm
does this.

Skip Knoble

On Wed, 7 Sep 2005 12:01:16 -0500, "Lynn McGuire" <***@nospam.com> wrote:

-|I have decided to convert all the Hollerith in our f77 code to character
-|strings. The Hollerith is stored in integers mostly but sometimes in
-|doubles. We have about 536,000 lines of code including comments
-|(probably 15 to 20%). We also have about 23,000 lines of c code.
-|
-|Does anyone know of a tool to automate this ? It looks like a swamp
-|to me !
-|
-|Thanks,
-|Lynn McGuire
-|

e p chandler

2005-09-08 14:36:46 UTC

Permalink

Post by Herman D. Knoble
-|I have decided to convert all the Hollerith in our f77 code to character
-|strings. The Hollerith is stored in integers mostly but sometimes in
-|doubles. We have about 536,000 lines of code including comments
-|(probably 15 to 20%). We also have about 23,000 lines of c code.
Lynn: I think that Tidy: http://www.unb.ca/fredericton/science/chem/ajit/f_tidy.htm
does this.

Very nice!

Tidy *does* convert the Hollerith to character strings, but it leaves
the *types* of the variables unchanged. It does generate a log file
which can be used to determine which variables need to have their types
changed.

glen herrmannsfeldt

2005-09-08 15:36:01 UTC

Permalink

Post by e p chandler

Very nice!
Tidy *does* convert the Hollerith to character strings, but it leaves
the *types* of the variables unchanged. It does generate a log file
which can be used to determine which variables need to have their types
changed.

So then you just need a program to parse the log file and make
the changes in the source. That sounds easier than what we discussed
yesterday.

-- glen

Lynn

2005-09-08 17:08:16 UTC

Permalink

Post by Herman D. Knoble
Lynn: I think that Tidy: http://www.unb.ca/fredericton/science/chem/ajit/f_tidy.htm
does this.

Nice ! I will try it out. I knew there must be others with this problem also.

Thanks,
Lynn

Walter Spector

2005-09-08 13:57:25 UTC

Permalink

Post by Lynn McGuire
I have decided to convert all the Hollerith in our f77 code to character
strings....
Does anyone know of a tool to automate this ? It looks like a swamp
to me !

It can be a swamp.

One occasionally sees codes which have a lot of CHARACTER*4 (and
sometimes CHARACTER*6 and CHARACTER*10) arrays. These are obvious relics
from a 'quick and dirty' Hollerith->character conversion. It may have been
quick, and is probably more reliable than Holleriths (which are usually
accompanied by a lot of non-Standard masking/shifting that can be
eliminated.) But some basic problems remain.

A systematic approach is needed. One question though: what compiler
are you targeting? My very first step would be to get the code into a form
that allows for more 'fearless' changing by getting it into a more
Fortran-90-like form. That is:

1.) Make sure all your callers and callees type/kind/rank match. This is
usually easy to do by creating a Fortran-90 MODULE which 'includes' all of
the subroutines and functions:

module all
contains
include 'sub_aa.f'
include 'sub_ab.f'
:
include 'sub_zz.f'
end module

The above takes about 30 seconds to create (using the ls -1 command and any modern
text editor.) More tedious is that one has to change all the END statements
to either END SUBROUTINE or END FUNCTION. But that can be largely automated
with a couple of simple shell scripts.

Another consideration in your case, I wouldn't be suprised to find a compiler
that would fail to grok 0.5m LOC in a single module. I assume that your code
is broken into many functional subgroups, and this would tend to indicate that
at least that many modules would be needed - corresponding to those subgroups.

You will be amazed at how many errors you never knew you had! But once repaired,
one can make fairly significant changes to the code and get immediate feedback
as to what broke.

2.) Make sure all your COMMON blocks match by making sure their definitions,
including declarations for the variables, at least reside in INCLUDE files
and used consistently throughout the code. (Better yet is to place the global
data in modules, but I don't view it as strictly necessary at this point.)

Again this may flesh out a number of bugs you never knew you had. I recommend
the above two steps for ANY code, and in my mind is the 'minimal f90 conversion'.

As an alternative to the above, some compiler environments provide static
analysers to check all of the above without making a lot of source changes.
(An example is ftnlint - which runs under IRIX.) There are 3rd party tools
to do the same. But I find the f90 approach to be the most trouble free,
portable, and integrated way to do it.

Once 1.) and 2.) have been completed, you will then be poised to start
looking at the Holleriths.

3.) Focus on one functional area at a time. Make a change to a data item, then
recompile to find all the places it breaks. It is easiest to start with data
items which are completely local to a routine, as opposed to ones which are
used in some global sense (e.g., through procedure calls or in COMMON.)

Some EQUIVALENCEing may be temporarily needed to equate the storage of a
new CHARACTER variable with the old INTEGER variables. Again a Fortran-90
compiler is generally better than a Fortran-77 compiler because the rules
for EQUIVALENCEing CHARACTER and non-CHARACTER entities were relaxed a bit
at F90.

Run your regression test base and fix problems. Then repeat step #3 as
needed.

Some things to watch out for (in no particular order):

1.) By the Fortran-66 Standard, Hollerith data is placed into a word
'left-justified, blank-filled'. This is important to keep in mind when
trying to understand the code.

As an extension, many compilers had/have 'left-justified, zero-fill'
and 'right-justified, zero-filled' variants of Hollerith constants
available. These were often used especially in masking/shifting code
to simplify the code.

2.) Sometimes you will find places where pages of Hollerith code can
literally be thrown out and replaced with just a few lines or even a
single intrinsic function call. This was especially true in places
where a lot of masking/shifting was going on to access packed characters
within words. In modern code one can use substring notation with hardly
a thought.

3.) ENCODE/DECODE roughly translate to READ/WRITE (or do I have that
backwards...)

4.) FORMAT statements:

4a.) When performing formatted I/O, you will have to change formats like
(20A4) to (A80) - or more desirably simply (A).

4b.) Don't waste time worrying about Hollerith strings in output formats.
Though considered obsolete, they are generally harmless. Their main
disadvantage over quoted strings is that one has to count the characters.

5.) Dummy args in subroutine/function containing CHARACTER strings.
One mistake a lot of folks make is to set the string length 'as large
as I'll ever expect it to be', when they SHOULD use an asterisk:

subroutine charsub (a, b)
implicit none
character(80) :: a ! Usually BAD
character(*) :: b ! Usually CORRECT

Mess the above up and you will get strange memory bashing problems.

Hopefully the above gives you enough to get started. I am sure others
can add to the list.

Walt
-...-
Walt Spector
(w6ws att earthlinkk dott nett)

Lynn

2005-09-08 18:28:08 UTC

Permalink

Post by Walter Spector
One occasionally sees codes which have a lot of CHARACTER*4 (and
sometimes CHARACTER*6 and CHARACTER*10) arrays. These are obvious relics
from a 'quick and dirty' Hollerith->character conversion. It may have been
quick, and is probably more reliable than Holleriths (which are usually
accompanied by a lot of non-Standard masking/shifting that can be
eliminated.) But some basic problems remain.

That is what I was considering doing. Sounds like character*x might be the
wrong way to do this.

Post by Walter Spector
A systematic approach is needed. One question though: what compiler

We currently use Open Watcom F77. It does have many F90 extensions
already but does not have interfaces or modules.

I am trying to port to Intel Visual Fortran 9.0 but have run into many, many
bugs which I am trying to sort thru.

Post by Walter Spector
are you targeting? My very first step would be to get the code into a form
that allows for more 'fearless' changing by getting it into a more
1.) Make sure all your callers and callees type/kind/rank match. This is
usually easy to do by creating a Fortran-90 MODULE which 'includes' all of
module all
contains
include 'sub_aa.f'
include 'sub_ab.f'
include 'sub_zz.f'
end module

Looks good ! I will try this.

Post by Walter Spector
The above takes about 30 seconds to create (using the ls -1 command and any modern
text editor.) More tedious is that one has to change all the END statements
to either END SUBROUTINE or END FUNCTION. But that can be largely automated
with a couple of simple shell scripts.

Ouch ! We are down to about 2500 files. A lot of files to change.

Post by Walter Spector
Another consideration in your case, I wouldn't be suprised to find a compiler
that would fail to grok 0.5m LOC in a single module. I assume that your code
is broken into many functional subgroups, and this would tend to indicate that
at least that many modules would be needed - corresponding to those subgroups.

I would not be surprised also.

Post by Walter Spector
You will be amazed at how many errors you never knew you had! But once repaired,
one can make fairly significant changes to the code and get immediate feedback
as to what broke.
2.) Make sure all your COMMON blocks match by making sure their definitions,
including declarations for the variables, at least reside in INCLUDE files
and used consistently throughout the code. (Better yet is to place the global
data in modules, but I don't view it as strictly necessary at this point.)

Am doing that also. We currently have 179 include files and are removing
all common blocks declarations from the code. Is tough since the member
names are not the same between each subroutine.

Post by Walter Spector
Again this may flesh out a number of bugs you never knew you had. I recommend
the above two steps for ANY code, and in my mind is the 'minimal f90 conversion'.

Sounds good.

Post by Walter Spector
5.) Dummy args in subroutine/function containing CHARACTER strings.
One mistake a lot of folks make is to set the string length 'as large
subroutine charsub (a, b)
implicit none
character(80) :: a ! Usually BAD
character(*) :: b ! Usually CORRECT
Mess the above up and you will get strange memory bashing problems.

yup, i've seen those.

Thanks,
Lynn

Walter Spector

2005-09-09 05:05:27 UTC

Permalink

Post by Lynn

That is what I was considering doing. Sounds like character*x might be the
wrong way to do this.

Well, it is a question of converting something like:

SUBROUTINE CGREATER (A, B, N, T)
INTEGER A(1), B(1)
INTEGER N
LOGICAL T

INTEGER C1, C2

INTEGER NBYPW, NBIPW, NBIPBY, IBYMSK
DATA NBYPW, NBIPW, NBIPBY/4, 32, 8/
DATA IBYMSK/O'177'/

DO 10, I=1, N
IF (A(I) .NE. B(I)) GO TO 20
10 CONTINUE
T = .FALSE.
RETURN

20 CONTINUE
DO, J=0, NBYPW-1
C1 = SHIFTR (A(I),NBIPW-J*NBIPBY) .AND. IBYMSK
C2 = SHIFTR (B(I),NBIPW-J*NBIPBY) .AND. IBYMSK
IF (C1 .GT. C2) GO TO 25
END DO
25 CONTINUE
T = J .LT. NBYPW
RETURN
END

(above done on the fly and illustrative. I am sure there are bugs...) To:

SUBROUTINE CGREATER (A, B, N, T)
character(4) :: A(1), B(1) ! Quick and dirty
INTEGER N
LOGICAL T

INTEGER C1, C2

DO 10, I=1, N
IF (A(I) .NE. B(I)) GO TO 20
10 CONTINUE
T = .FALSE.
RETURN

20 CONTINUE
DO, J=1, 4
C1 = a(i:i) ! But at least I get rid of all the
C2 = b(i:i) ! shifting and masking...
IF (C1 .GT. C2) GO TO 25
END DO
25 CONTINUE
T = J .LE. 4
RETURN
END

Or:
subroutine cgreater (a, b, t)
character(*), intent(in) :: a, b
logical, intent(out) :: t

t = a > b

end subroutine

I know which one I'd rather maintain...

Post by Lynn
.... We currently have 179 include files and are removing
all common blocks declarations from the code. Is tough since the member
names are not the same between each subroutine.

And this is exactly why it needs to be repaired prior to making large code changes.
Otherwise one can spend a lifetime chasing down buglets due to the inconsistencies.

Walt
(w6ws att earthlinkk dott nett)

Walter Spector

2005-09-10 17:12:54 UTC

Permalink

Post by Lynn
.... We currently have 179 include files and are removing
all common blocks declarations from the code. Is tough since the member
names are not the same between each subroutine...

One more suggestion: If you are having to change a lot of names in
the code, it may be better to use Fortran-90 MODULEs to contain your
COMMON data, the use renaming to minimize code changes.

For example suppose you had a COMMON block like the following:

REAL A, B, C
COMMON /MYDATA/ A, B, C

And in some routines the names are different:

REAL D, E, F
COMMON /MYDATA/ D, E, F

You decide which of the two makes the most sense. Then create
a module:

MODULE MYDATA_module
IMPLICIT NONE

REAL A, B, C
COMMON /MYDATA/ A, B, C ! Leave this until you don't need it anymore

END MODULE

Obviously in most case you can just USE the module. But in places
where the 2nd definition is used, you can do something like:

USE MYDATA_module, D=>A, E=>B, F=>C

If D, E, and F were used large numbers of times in the body of the
routine, then the renaming would save a lot of time.

Walt
(w6ws att earthlinkk dott nett)

Lynn McGuire

2005-09-12 16:41:12 UTC

Permalink

Post by Walter Spector
One more suggestion: If you are having to change a lot of names in
the code, it may be better to use Fortran-90 MODULEs to contain your
COMMON data, the use renaming to minimize code changes.

The problem is that in the subroutines with name changes in the
common blocks, the reason was that there were already local
variables using those names. Rather than change the names of
the local variables, the idiot integrating the common block chose
to modify the names of the variables in the common block.

This software has had several other software packages integrated
into it over the years. Yes, it has stitch marks and neck bolts.

Thanks,
Lynn

Walter Spector

2005-09-13 04:59:50 UTC

Permalink

Post by Lynn McGuire

Right. He had 'namespace collisions'.

Again, F90 modules give you a tool to deal with such collisions.
Reread my previous posting. In most routines one would simply
write:

USE MYDATA_module

But in routines where namespace problems exist, you can 'rename'
the items (on the fly, locally):

USE MYDATA_module, D=>A, E=>B, F=>C

Note that nothing has changed inside the module. It was seperately
compiled beforehand.

The F90 USE statement also allows you the option of only making
specific names visible:

USE MYDATA_module, ONLY: A

This might be useful if you really only needed one or a few of
the values in the module (and perhaps others that you really didn't
need anyway are causing naming collisions...)

Walt
-...-
Walt Spector
(w6ws att earthlinkk dott nett)