-encoding and -source in module-info.java

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

-encoding and -source in module-info.java

Jesse Glick
The encoding and source level of a module are fundamental attributes of its sources, without which you cannot reliably even parse a syntax tree, so I think they should be
declared in module-info.java. Otherwise it is left up to someone calling javac by hand, or a build script, to specify these options; that is potentially error-prone, and
means that tools which inspect sources (including but not limited to IDEs) need to have some separate mechanism for configuration of these attributes: you cannot just
hand them the sourcepath and let them run.

I am assuming that all files in the sourcepath use the same encoding and source level, which seems a reasonable restriction.


As to the source level, obviously given that JDK 8 will introduce module-info.java, "8" (or "1.8") seems like the right default value; but a syntax ought to be defined
for specifying a newer level, e.g.

   source 1.9; // or 9?

Furthermore I think that JDK 9+ versions of javac should keep the same default source level - you should need to explicitly mark what version of the Java language your
module expects. Otherwise a module might compile differently according to which version of javac was used, which is undesirable, and tools cannot guess what version you
meant. A little more verbosity here seems to be justified.

Whether the bytecode target (-target) should be specified in module-info.java is another question. I have seen projects built using -target 5 for JDK 5 compatibility but
also in a separate artifact using -target 6 for speed on JDK 6+ (split verifier). Probably the target level should default to the source level, and in the rare case that
you need to override this, you can do so using a javac command option - this has no impact on tools which just need to parse and analyze source files.


As to the encoding, something like

   encoding ISO-8859-2;

would suffice. The obvious problems for encoding are

1. What should the default value be? javac currently uses the platform default encoding, which IMHO is a horrible choice because it means that two people running javac
with the same parameters on the same files may be producing different classes and/or warning messages. I would suggest making UTF-8 be the default when compiling in
module mode (leaving the old behavior intact for legacy mode). For developers who want to keep sources in a different character set, adding one line per module-info.java
does not seem like much of a burden.

2. What is module-info.java itself encoded in? If not UTF-8, then you need to be able to reliably find the encoding declaration and then rescan the file in that encoding.
That is easy for most encodings (just do an initial scan in ISO-8859-1), including everything commonly used by developers AFAIK; a little trickier for UTF-16/32-type
encodings but possible by ignoring 0x00/0xFE/0xFF; and only fails on some mainframe charsets, old JIS variants, and dingbats (*). Even those rare cases are probably
guessable. [1]


(*) Demo program:

import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;
import java.util.Arrays;
public class CharsetTest {
     public static void main(String[] args) throws UnsupportedEncodingException {
         Charset raw = Charset.forName("ISO-8859-1");
         for (Charset c : Charset.availableCharsets().values()) {
             String text = "/* leading comment */\nmodule test {\n  encoding " + c.name() + ";\n}\n";
             byte[] encoded;
             try {
                 encoded = text.getBytes(c);
             } catch (UnsupportedOperationException x) {
                 System.out.println("cannot encode using " + c.name());
                 continue;
             }
             if (Arrays.equals(encoded, text.getBytes(raw))) {
                 System.out.println("OK in " + c.name());
             } else if (new String(encoded, raw).contains("  encoding " + c.name() + ";")) {
                 System.out.println("substring match in " + c.name());
                 dump(encoded);
             } else if (new String(encoded, raw).replace("\u0000", "").contains("  encoding " + c.name() + ";")) {
                 System.out.println("NUL-stripped match in " + c.name());
                 dump(encoded);
             } else {
                 System.out.println("garbled in " + c.name());
                 dump(encoded);
             }
         }
     }
     private static void dump(byte[] encoded) {
         for (byte b : encoded) {
             if (b >= 32 && b <= 126 || b == '\n' || b == '\r') {
                 System.out.write(b);
             } else if (b == 0) {
                 System.out.print('@');
             } else {
                 System.out.printf("\\%02X", b);
             }
         }
         System.out.println();
     }
     private CharsetTest() {}
}


[1] http://jchardet.sourceforge.net/
Reply | Threaded
Open this post in threaded view
|

Re: -encoding and -source in module-info.java

Neal Gafter
Overall, I agree it would be nice to place these options together in a file
for the compiler to consume.  I just don't think it should be a "Java
programming language source" file.

The idea of using module-info.java to specify compilation options seems
most tempting, but as long as it is a source file in the language, it must
be subject to those options as well.  Which means that specifying them
inside the file itself is pretty pointless.  As to the source level, what
can the language specification say other than that "8" is the only allowed
value?  And what can the next version of the language specification say
other than that "9" is the only allowed value?

On Tue, Jan 24, 2012 at 4:07 AM, Jesse Glick <[hidden email]> wrote:

> The encoding and source level of a module are fundamental attributes of
> its sources, without which you cannot reliably even parse a syntax tree, so
> I think they should be declared in module-info.java. Otherwise it is left
> up to someone calling javac by hand, or a build script, to specify these
> options; that is potentially error-prone, and means that tools which
> inspect sources (including but not limited to IDEs) need to have some
> separate mechanism for configuration of these attributes: you cannot just
> hand them the sourcepath and let them run.
>
> I am assuming that all files in the sourcepath use the same encoding and
> source level, which seems a reasonable restriction.
>
>
> As to the source level, obviously given that JDK 8 will introduce
> module-info.java, "8" (or "1.8") seems like the right default value; but a
> syntax ought to be defined for specifying a newer level, e.g.
>
>  source 1.9; // or 9?
>
> Furthermore I think that JDK 9+ versions of javac should keep the same
> default source level - you should need to explicitly mark what version of
> the Java language your module expects. Otherwise a module might compile
> differently according to which version of javac was used, which is
> undesirable, and tools cannot guess what version you meant. A little more
> verbosity here seems to be justified.
>
> Whether the bytecode target (-target) should be specified in
> module-info.java is another question. I have seen projects built using
> -target 5 for JDK 5 compatibility but also in a separate artifact using
> -target 6 for speed on JDK 6+ (split verifier). Probably the target level
> should default to the source level, and in the rare case that you need to
> override this, you can do so using a javac command option - this has no
> impact on tools which just need to parse and analyze source files.
>
>
> As to the encoding, something like
>
>  encoding ISO-8859-2;
>
> would suffice. The obvious problems for encoding are
>
> 1. What should the default value be? javac currently uses the platform
> default encoding, which IMHO is a horrible choice because it means that two
> people running javac with the same parameters on the same files may be
> producing different classes and/or warning messages. I would suggest making
> UTF-8 be the default when compiling in module mode (leaving the old
> behavior intact for legacy mode). For developers who want to keep sources
> in a different character set, adding one line per module-info.java does not
> seem like much of a burden.
>
> 2. What is module-info.java itself encoded in? If not UTF-8, then you need
> to be able to reliably find the encoding declaration and then rescan the
> file in that encoding. That is easy for most encodings (just do an initial
> scan in ISO-8859-1), including everything commonly used by developers
> AFAIK; a little trickier for UTF-16/32-type encodings but possible by
> ignoring 0x00/0xFE/0xFF; and only fails on some mainframe charsets, old JIS
> variants, and dingbats (*). Even those rare cases are probably guessable.
> [1]
>
>
> (*) Demo program:
>
> import java.io.**UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Arrays;
> public class CharsetTest {
>    public static void main(String[] args) throws
> UnsupportedEncodingException {
>        Charset raw = Charset.forName("ISO-8859-1");
>        for (Charset c : Charset.availableCharsets().**values()) {
>            String text = "/* leading comment */\nmodule test {\n  encoding
> " + c.name() + ";\n}\n";
>            byte[] encoded;
>            try {
>                encoded = text.getBytes(c);
>            } catch (UnsupportedOperationException x) {
>                System.out.println("cannot encode using " + c.name());
>                continue;
>            }
>            if (Arrays.equals(encoded, text.getBytes(raw))) {
>                System.out.println("OK in " + c.name());
>            } else if (new String(encoded, raw).contains("  encoding " +
> c.name() + ";")) {
>                System.out.println("substring match in " + c.name());
>                dump(encoded);
>            } else if (new String(encoded, raw).replace("\u0000",
> "").contains("  encoding " + c.name() + ";")) {
>                System.out.println("NUL-**stripped match in " + c.name());
>                dump(encoded);
>            } else {
>                System.out.println("garbled in " + c.name());
>                dump(encoded);
>            }
>        }
>    }
>    private static void dump(byte[] encoded) {
>        for (byte b : encoded) {
>            if (b >= 32 && b <= 126 || b == '\n' || b == '\r') {
>                System.out.write(b);
>            } else if (b == 0) {
>                System.out.print('@');
>            } else {
>                System.out.printf("\\%02X", b);
>            }
>        }
>        System.out.println();
>    }
>    private CharsetTest() {}
> }
>
>
> [1] http://jchardet.sourceforge.**net/ <http://jchardet.sourceforge.net/>
>
Reply | Threaded
Open this post in threaded view
|

Re: -encoding and -source in module-info.java

Eric Johnson
Of course, to me, this circles back to the point I've chimed in on with
others before. Module info shouldn't be in a file ending in .java. It
*isn't* part of the language, it is part of the metadata for using the
language.

-Eric.

On 1/24/12 10:35 PM, Neal Gafter wrote:
> The idea of using module-info.java to specify compilation options seems
> most tempting, but as long as it is a source file in the language, it must
> be subject to those options as well.  Which means that specifying them
> inside the file itself is pretty pointless.  As to the source level, what
> can the language specification say other than that "8" is the only allowed
> value?  And what can the next version of the language specification say
> other than that "9" is the only allowed value?
Reply | Threaded
Open this post in threaded view
|

Re: -encoding and -source in module-info.java

Jesse Glick
In reply to this post by Neal Gafter
On 01/24/2012 04:35 PM, Neal Gafter wrote:
> I agree it would be nice to place these options together in a file for the compiler to consume.  I just don't think it should be a "Java programming language
> source" file.

Well in the current Jigsaw prototype there is only one source of metadata, and that is module-info.java. Whether such a file extension/name/style is appropriate for
module metadata is a long-running argument on the list which I did not want to get into.

> as long as it is a source file in the language, it must be subject to those
> options as well.  Which means that specifying them inside the file itself is pretty pointless.

I do not think so. It is admittedly a bit tricky with respect to encoding, but the XML spec has managed this.

>  As to the source level, what can the language specification say other than
> that "8" is the only allowed value?

It could say that it must follow a decimal-type format (which may be more restrictive than module versions), and that 1.8 is the smallest legal value; but it could
probably just say that 1.8 is the only allowed value. (I am punting for now on whether the abbreviation "8" is legal in this context.)

>  what can the next version of the language specification say other than that "9" is the only allowed value?

I am not sure precisely how the language spec would be worded but it would say that for this revision of the spec, 1.9 is the only legal value. As to the actual compiler
in JDK 9, it would accept either 1.8 or 1.9, and control its behavior accordingly, just as if you had passed -source on the CLI.

It is only necessary to ensure that the module-info.java token structure is not changed so radically between releases that a single file could be interpreted in multiple
ways, which does not seem like a likely problem given that the much richer "regular" Java language token structure has not changed much since 1.0. A defense against this
kind of change is to force the encoding and/or source level to be specified in a special simplistic header (like a pragma) at the very top of the file outside the regular
token stream, though this would look inconsistent with the rest of the file.

Bear in mind that

<?xml version="1.0" encoding="UTF-8"?>
<čau/>

works fine; an XML parser supporting the 1.1 spec of course also reads version="1.0" files, adjusting its permitted syntax slightly according to the declared version, and
reads the rest of the file according to the declared encoding. Similarly, HTML files can specify both their encoding and specification level in band.
Reply | Threaded
Open this post in threaded view
|

Re: -encoding and -source in module-info.java

Remi Forax
On 01/24/2012 11:07 PM, Jesse Glick wrote:

> On 01/24/2012 04:35 PM, Neal Gafter wrote:
>> I agree it would be nice to place these options together in a file
>> for the compiler to consume.  I just don't think it should be a "Java
>> programming language
>> source" file.
>
> Well in the current Jigsaw prototype there is only one source of
> metadata, and that is module-info.java. Whether such a file
> extension/name/style is appropriate for module metadata is a
> long-running argument on the list which I did not want to get into.
>
>> as long as it is a source file in the language, it must be subject to
>> those
>> options as well.  Which means that specifying them inside the file
>> itself is pretty pointless.
>
> I do not think so. It is admittedly a bit tricky with respect to
> encoding, but the XML spec has managed this.

No, it doesn't work. As an example try to specify utf-16 as encoding of
an XML file.
I only works for US-ASCII, UTF8 and all encodings that contains the 128
first value of the ASCII
and encode them as a byte.

Rémi

Reply | Threaded
Open this post in threaded view
|

Re: -encoding and -source in module-info.java

Remi Forax
In reply to this post by Eric Johnson
On 01/24/2012 10:47 PM, Eric Johnson wrote:
> Of course, to me, this circles back to the point I've chimed in on
> with others before. Module info shouldn't be in a file ending in
> .java. It *isn't* part of the language, it is part of the metadata for
> using the language.
>
> -Eric.

There is no need to put compilation options in the module description file,
defining a module and compiling sources are two different tasks,
the module metadata describe requires/provides of a module that it.
Why do you want mix these to things together ?

I believe that Maven has the two things mixed only because there is
no module description in Java. When Java 8 will be out, it will be more
clear,

Rémi

>
> On 1/24/12 10:35 PM, Neal Gafter wrote:
>> The idea of using module-info.java to specify compilation options seems
>> most tempting, but as long as it is a source file in the language, it
>> must
>> be subject to those options as well.  Which means that specifying them
>> inside the file itself is pretty pointless.  As to the source level,
>> what
>> can the language specification say other than that "8" is the only
>> allowed
>> value?  And what can the next version of the language specification say
>> other than that "9" is the only allowed value?

Reply | Threaded
Open this post in threaded view
|

Re: -encoding and -source in module-info.java

Jesse Glick
In reply to this post by Remi Forax
On 01/24/2012 05:14 PM, Rémi Forax wrote:
> it doesn't work. As an example try to specify utf-16 as encoding of an XML file.

Works for me in Chrome and Xerces, even if there is no BOM.

http://www.w3.org/TR/2008/REC-xml-20081126/#sec-guessing
Reply | Threaded
Open this post in threaded view
|

Re: -encoding and -source in module-info.java

Remi Forax
On 01/25/2012 12:14 AM, Jesse Glick wrote:
> On 01/24/2012 05:14 PM, Rémi Forax wrote:
>> it doesn't work. As an example try to specify utf-16 as encoding of
>> an XML file.
>
> Works for me in Chrome and Xerces, even if there is no BOM.
>
> http://www.w3.org/TR/2008/REC-xml-20081126/#sec-guessing

Nice to know that it works in Chrome and Xerces for UTF16,
but it doesn't work with the parser used by my colleagues and
provided by a lowly company of Redmond.

Rémi


Reply | Threaded
Open this post in threaded view
|

Re: -encoding and -source in module-info.java

Jesse Glick
On 01/24/2012 06:41 PM, Rémi Forax wrote:
> Nice to know that it works in Chrome and Xerces for UTF16,
> but it doesn't work with the parser used by my colleagues and
> provided by a lowly company of Redmond.

I am merely using the XML prolog example to point out that it is possible to sniff an encoding declaration out of a file using that same encoding, even for the obscure
case of non-ASCII-superset encodings; XML parsers are not required to do so, since sometimes an encoding is specified out of band.

The cases are different technically because (1) XML defines no default encoding, whereas Jigsaw could and I think should, so a legal module-info.java not using UTF-8 must
specify its actual encoding which makes detection easier; (2) the XML prolog is required to be the first bytes in the file, which makes detection easier, whereas this
might be considered too ugly for module-info.java; (3) there would be one widely used reference implementation of the parser, namely javac, and that and other
implementations must pass the same TCK which I would expect to test the corner cases.
Reply | Threaded
Open this post in threaded view
|

Re: -encoding and -source in module-info.java

Alex Buckley
In reply to this post by Jesse Glick
Compilation units in the Java programming language consist of UTF-16
code units. No desire to change that. A compiler may support other
encodings but that's an implementation detail and so does not belong in
the Java SE platform's idea of a module declaration.

Documenting and/or enforcing the language level is a much trickier topic
that I doubt will be addressed by the module system. The API level
available to a program can theoretically be "configured" by depending on
a given version of java.base, but it's still up to a compiler to a)
switch its language level to match the desired API and b) switch its own
use of platform classes to match the desired API
(http://blogs.oracle.com/darcy/entry/bootclasspath_older_source).

Alex

On 1/24/2012 4:07 AM, Jesse Glick wrote:

> The encoding and source level of a module are fundamental attributes of
> its sources, without which you cannot reliably even parse a syntax tree,
> so I think they should be declared in module-info.java. Otherwise it is
> left up to someone calling javac by hand, or a build script, to specify
> these options; that is potentially error-prone, and means that tools
> which inspect sources (including but not limited to IDEs) need to have
> some separate mechanism for configuration of these attributes: you
> cannot just hand them the sourcepath and let them run.
>
> I am assuming that all files in the sourcepath use the same encoding and
> source level, which seems a reasonable restriction.
>
>
> As to the source level, obviously given that JDK 8 will introduce
> module-info.java, "8" (or "1.8") seems like the right default value; but
> a syntax ought to be defined for specifying a newer level, e.g.
>
> source 1.9; // or 9?
>
> Furthermore I think that JDK 9+ versions of javac should keep the same
> default source level - you should need to explicitly mark what version
> of the Java language your module expects. Otherwise a module might
> compile differently according to which version of javac was used, which
> is undesirable, and tools cannot guess what version you meant. A little
> more verbosity here seems to be justified.
>
> Whether the bytecode target (-target) should be specified in
> module-info.java is another question. I have seen projects built using
> -target 5 for JDK 5 compatibility but also in a separate artifact using
> -target 6 for speed on JDK 6+ (split verifier). Probably the target
> level should default to the source level, and in the rare case that you
> need to override this, you can do so using a javac command option - this
> has no impact on tools which just need to parse and analyze source files.
>
>
> As to the encoding, something like
>
> encoding ISO-8859-2;
>
> would suffice. The obvious problems for encoding are
>
> 1. What should the default value be? javac currently uses the platform
> default encoding, which IMHO is a horrible choice because it means that
> two people running javac with the same parameters on the same files may
> be producing different classes and/or warning messages. I would suggest
> making UTF-8 be the default when compiling in module mode (leaving the
> old behavior intact for legacy mode). For developers who want to keep
> sources in a different character set, adding one line per
> module-info.java does not seem like much of a burden.
>
> 2. What is module-info.java itself encoded in? If not UTF-8, then you
> need to be able to reliably find the encoding declaration and then
> rescan the file in that encoding. That is easy for most encodings (just
> do an initial scan in ISO-8859-1), including everything commonly used by
> developers AFAIK; a little trickier for UTF-16/32-type encodings but
> possible by ignoring 0x00/0xFE/0xFF; and only fails on some mainframe
> charsets, old JIS variants, and dingbats (*). Even those rare cases are
> probably guessable. [1]
>
>
> (*) Demo program:
>
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Arrays;
> public class CharsetTest {
> public static void main(String[] args) throws
> UnsupportedEncodingException {
> Charset raw = Charset.forName("ISO-8859-1");
> for (Charset c : Charset.availableCharsets().values()) {
> String text = "/* leading comment */\nmodule test {\n encoding " +
> c.name() + ";\n}\n";
> byte[] encoded;
> try {
> encoded = text.getBytes(c);
> } catch (UnsupportedOperationException x) {
> System.out.println("cannot encode using " + c.name());
> continue;
> }
> if (Arrays.equals(encoded, text.getBytes(raw))) {
> System.out.println("OK in " + c.name());
> } else if (new String(encoded, raw).contains(" encoding " + c.name() +
> ";")) {
> System.out.println("substring match in " + c.name());
> dump(encoded);
> } else if (new String(encoded, raw).replace("\u0000", "").contains("
> encoding " + c.name() + ";")) {
> System.out.println("NUL-stripped match in " + c.name());
> dump(encoded);
> } else {
> System.out.println("garbled in " + c.name());
> dump(encoded);
> }
> }
> }
> private static void dump(byte[] encoded) {
> for (byte b : encoded) {
> if (b >= 32 && b <= 126 || b == '\n' || b == '\r') {
> System.out.write(b);
> } else if (b == 0) {
> System.out.print('@');
> } else {
> System.out.printf("\\%02X", b);
> }
> }
> System.out.println();
> }
> private CharsetTest() {}
> }
>
>
> [1] http://jchardet.sourceforge.net/
Reply | Threaded
Open this post in threaded view
|

Re: -encoding and -source in module-info.java

Neal Gafter
In reply to this post by Jesse Glick
On Tue, Jan 24, 2012 at 2:07 PM, Jesse Glick <[hidden email]> wrote:
>
> I am not sure precisely how the language spec would be worded but it would
> say that for this revision of the spec, 1.9 is the only legal value


The compiler is required to obey the platform specifications, which
includes the language spec.  If "source 1.8" is illegal, then it must be
rejected.  And therefore a source file written for the previous version of
the language "source 1.8" is not legal in the latest version of the
language.  This is exactly the opposite of what Oracle should be trying to
achieve with version-to-version source compatibility.
Reply | Threaded
Open this post in threaded view
|

Re: -encoding and -source in module-info.java

Jesse Glick
On 01/24/2012 11:31 PM, Neal Gafter wrote:
> The compiler is required to obey the platform specifications, which includes the language spec.

...and _previous_ language specs.

>  If "source 1.8" is illegal, then it must be rejected.  And therefore a
> source file written for the previous version of the language "source 1.8" is not legal in the latest version of the language.

Yes, but it is legal in the previous version of the language, so javac from JDK 9 would compile the module without complaints as if you had passed -source 1.8 on the
command line. You just would not be able to use reified generics (or whatever source 1.9 brings you) in that module.
Reply | Threaded
Open this post in threaded view
|

Re: -encoding and -source in module-info.java

Jesse Glick
In reply to this post by Alex Buckley
On 01/24/2012 08:08 PM, Alex Buckley wrote:
> Compilation units in the Java programming language consist of UTF-16 code units.

Of course. That is a character set, not an encoding.

> so does not belong in the Java SE platform's idea of a module declaration.

Perhaps not; theoretically a compiler could be loading sources out of a database that provides Unicode strings natively. But javac invoked from the command line always
works on files, and these have a definite layout in the filesystem - including $root/module-info.java - and must have a specific encoding. This means that a directory
tree representing sources for a Java module has an unambiguous meaning in the Java language only if something in that tree declares the encoding. That would suggest
something like $root/encoding containing the (ASCII-encoded) name of the encoding.

Alternately, require modular Java sources to use a standard encoding, presumably UTF-8, as modern languages are tending to do.

> The API level available to a program can
> theoretically be "configured" by depending on a given version of java.base

(Or java.boot?) This is an interesting option newly available with Jigsaw - basically to tie the source level (and I suppose the target level) to that supported by the
platform itself. This is a sensible enough default; the question is whether some modules may still need to

1. Use a newer source level than the platform supports, while using the supported target level. Historically this has worked in certain cases, where third-party libraries
can fill in missing platform classes. Of course it cannot work in all or even most cases.

2. Use an older source level than the platform supports, whether using the recommended target level. Useful mainly in case new platform APIs are wanted, but a language
change was incompatible for this module.

3. Use an older source level and an older target level. Useful in case the module must be able to run on the older platform, but can use a handful of new APIs where
available (checked via reflection perhaps).

You are still left with the problem of a module which declares no explicit platform dependency: some default value must be used for the source level, and changing this
default according to which version of javac happens to be used to compile it is a very bad idea.

> it's still up to a compiler to a) switch its language level to match the desired API

The main issue here would be whether it is possible to determine the version of java.base/boot/whatever being requested before knowing for sure what version of the
language module-info.java is written in. If a new language spec radically changed how module versions are requested, it might be tricky to interpret the requires clauses
unambiguously.

> b) switch its own use of platform classes to match the desired API

This issue at least would finally be addressed by modular javac: your module would compile against whatever platform API it requests, I hope downloading older platform
modules on demand. In current Java projects it is generally impractical to pass -bootclasspath to a project with a CI-friendly build script - you would have to include a
copy of the target JDK's rt.jar in versioned project sources - so generally this is not done, yet -source must be passed for predictability, so everyone using JDK 7 javac
to compile gets the annoying warning mentioned in that blog. (Using -Xlint:-options is undesirable since then unrelated problems with javac options will be suppressed.)
Reply | Threaded
Open this post in threaded view
|

Re: -encoding and -source in module-info.java

Alex Buckley
In reply to this post by Jesse Glick
On 1/25/2012 7:14 AM, Jesse Glick wrote:
> On 01/24/2012 11:31 PM, Neal Gafter wrote:
>> The compiler is required to obey the platform specifications, which
>> includes the language spec.
>
> ...and _previous_ language specs.

No, a compiler that complies with Java SE n only needs to respect the
JLS in Java SE n. (Getting off topic now.)

Alex
Reply | Threaded
Open this post in threaded view
|

Re: -encoding and -source in module-info.java

Alex Buckley
In reply to this post by Jesse Glick
On 1/25/2012 7:47 AM, Jesse Glick wrote:
> On 01/24/2012 08:08 PM, Alex Buckley wrote:
>> Compilation units in the Java programming language consist of UTF-16
>> code units.
>
> Of course. That is a character set, not an encoding.

Certainly UTF-16 is an encoding. Recall the difference between code
points and code units.

> Perhaps not; theoretically a compiler could be loading sources out of a
> database that provides Unicode strings natively. But javac invoked from
> the command line always works on files, and these have a definite layout
> in the filesystem - including $root/module-info.java - and must have a
> specific encoding. This means that a directory tree representing sources
> for a Java module has an unambiguous meaning in the Java language only
> if something in that tree declares the encoding. That would suggest
> something like $root/encoding containing the (ASCII-encoded) name of the
> encoding.

All those things are implementation details and so don't belong in a
module declaration as defined by the JLS. (This applies regardless of
the syntax used for a module declaration.)

> Alternately, require modular Java sources to use a standard encoding,
> presumably UTF-8, as modern languages are tending to do.

Again, there is no desire to change the JLS to support encodings of
Unicode other than UTF-16.

> (Or java.boot?) This is an interesting option newly available with
> Jigsaw - basically to tie the source level (and I suppose the target
> level) to that supported by the platform itself. This is a sensible
> enough default; the question is whether some modules may still need to

We are well aware of these interactions.

> You are still left with the problem of a module which declares no
> explicit platform dependency: some default value must be used for the
> source level, and changing this default according to which version of
> javac happens to be used to compile it is a very bad idea.

If there's no explicit java.base dependence, Jigsaw specifies an
implicit dependence on the "current" platform's API (java.base). javac
will continue to assume the "current" platform's language level, so
there's a match.

Alex
Reply | Threaded
Open this post in threaded view
|

Re: -encoding and -source in module-info.java

Jesse Glick
On 01/25/2012 02:16 PM, Alex Buckley wrote:
> All those things are implementation details and so don't belong in a module declaration as defined by the JLS.

Then the encoding needs to be specified somehow in the source root outside of a module declaration (such as my proposed $root/encoding file). I do not really care whether
the JLS ever mentions it or not, so long as javac interprets disk files with a predictable encoding - so that all Java language tools which operate on files (*) will be
able to parse a given module source tree the same way. Currently they need out of band information to do so, and that is a significant problem.

>> Alternately, require modular Java sources to use a standard encoding,
>> presumably UTF-8, as modern languages are tending to do.
>
> there is no desire to change the JLS to support encodings of Unicode other than UTF-16.

The representation of surrogate code points using 16-bit code units is not what I am discussing. The "alternate" proposal was for disk files to be unconditionally
interpreted by command-line javac as being in UTF-8 format, as if '-encoding UTF-8' were passed to an earlier version of the tool. Obviously this is meaningless for the
hypothetical case discussed in the JLS of a program stored in a database, or the more practical case of a program using JSR 199 and providing a FileObject that supports
only a Reader and not an InputStream.

> If there's no explicit java.base dependence, Jigsaw specifies an implicit dependence on the "current" platform's API (java.base).

This is bad since it makes the interpretation of the source module dependent on the context in which it is interpreted. Among other problems, if the language level is
inferred from the version of the platform, using a different contextual platform than the original author intended could mean that the exact same source module not only
_behaves_ differently when run (which is to be expected) but _parses_ differently as well.


(*) Or sources otherwise grouped in the conventional way and potentially accessible via java.nio.file.FileSystem, such as inside a ZIP.
Reply | Threaded
Open this post in threaded view
|

Re: -encoding and -source in module-info.java

Peter Jensen
On 01/25/12 12:47, Jesse Glick wrote:

> On 01/25/2012 02:16 PM, Alex Buckley wrote:
>> All those things are implementation details and so don't belong in a
>> module declaration as defined by the JLS.
>
> Then the encoding needs to be specified somehow in the source root
> outside of a module declaration (such as my proposed $root/encoding
> file). I do not really care whether the JLS ever mentions it or not,
> so long as javac interprets disk files with a predictable encoding -
> so that all Java language tools which operate on files (*) will be
> able to parse a given module source tree the same way. Currently they
> need out of band information to do so, and that is a significant problem.
A Java compiler may support different encodings, it may support
compilation of source code written against different version of the JLS,
it may support generation of Java byte code relying on APIs as defined
by different versions of the JLS, ...

Specifying such options, in whatever format, means specifying common
capabilities where it has previously been open to the implementation.

While there might be sense in doing so, I fail to recognize why modules
make this any more, or less, of an issue. Did I miss an argument why
this is especially critical for modules? Or is it more a matter of
filling an existing void?

Reply | Threaded
Open this post in threaded view
|

Re: -encoding and -source in module-info.java

Jesse Glick
On 01/25/2012 06:36 PM, Peter Jensen wrote:
> A Java compiler may support different encodings

The Java platform specifies a minimum set of encodings which are guaranteed to be supported, so I would expect that these would be guaranteed for source file encoding
too. Whether it is prudent to permit a source tree to use an "optional" encoding, i.e. whether a compiler implemented in the JVM should support any extra encoding that
JVM actually supports, is another question.

> it may support compilation of source code written against different version of the JLS

That is why I recommended that the source level be physically declared in the source root. (Whether in module-info.java itself or in some other standardized place does
not make much practical difference, so long as either the JLS or the javac tool docs specify it.)

> Specifying such options, in whatever format, means specifying common capabilities where it has previously been open to the implementation.

-encoding and -source at least are considered "standard" options (not -X prefixed). Anyway a (file-based) compiler could not work at all unless it knew what the values of
these two fundamental parameters were: they are intrinsic properties of the sources themselves.

> [is this] especially critical for modules? Or is it more a matter of filling an existing void?

There is an existing void. The reason I bring this up in the context of Jigsaw is that in JDK 7 and earlier, a naked source tree could not be analyzed in any very useful
way at all because you would lack any information about its expected classpath: even if you optimistically guessed at encoding (ISO-8859-1?) and source level (1.7?), you
would get no further than a syntax parse. Even a wildcard import would be impossible to interpret. So in the existing language, any tool which wants to work with a source
tree at more than a superficial level - an IDE, a static analyzer, HTMLized code browser, structural grep, etc. - really has to be supplied a bunch of metadata (*) to
make any sense of it, especially a classpath. Given that, you might as well throw in an encoding and source level at the same time.

By contrast, under the current Jigsaw design a tool could get complete "classpath" information (**) right from $basedir/module-info.java (***), making it possible for the
first time to be run without any configuration beyond the location of the source root (or a file within that root) - but only if the encoding and source level can also be
obtained. It would seem a pity to have almost, but not quite, self-describing module sources in JDK 8.

My recent post about migrating -processorpath to module-info.java is on the same topic. While there are some processors which are elective for a given module - a
programmer or CI build might run them on some occasions or might not - many processors are integral to the function of a library used by the module. The processor might
perform mandatory semantic error checks, without which an IDE or refactoring tool would be unaware that it was about to make an invalid change. More critically, the
processor might generate Java sources which are required for regular project sources to compile - so without knowledge of which processors are expected to be run by
javac, a source-based tool would be left with unexplained missing symbols.

Those javac options which could be freely changed for a given source tree without really affecting the meaning of the input, such as -g or -Xmaxerrs or -Xprefer:source,
are irrelevant to source-based tools and do not belong in portable source metadata. How -target should be treated is debatable. -Xlint suboptions are a grey area in
between intrinsic and incidental attributes - a given module may be tuned to be "lint-free" for certain classes of warning only - so it would be nice if @SuppressWarnings
could be specified in module-info.java and considered to apply to all compilation units in the module (meaning it is OK to use -Xlint:all).


(*) Or be embedded in or otherwise integrated with Maven, which is the only broadly accepted metadata for Java sources today. While reading maven-compiler-plugin
parameters is cumbersome, it at least can be done pretty reliably. Script-oriented build tools (Ant, Gradle, ...) are more or less opaque to a tool which wants to know
how javac would have been invoked, and parsing proprietary project metadata from all popular IDEs would be challenging.

(**) At least a list of required module names and baseline versions, which can be translated to the equivalent of a classpath if those dependencies can be located
somehow. Assuming they must be present in the local cache repository (or in a preregistered remote repository) in order for javac to work, then any tool running on the
same machine could find them too.

(***) The same argument applies if module metadata were to be in some other format - XML, manifest, etc. - so long as javac mandates that this be physically located in a
specific place in the source tree, rather than referred to with a CLI option.
Reply | Threaded
Open this post in threaded view
|

Re: -encoding and -source in module-info.java

Reinier Zwitserloot
In reply to this post by Neal Gafter
+1 on this excellent idea.

In regards to neal's hammering on how 'source' is not particularly
conducive to inclusion in a language spec:

There's an easy anwer to this problem: The same answer used to skirt around
the classpath issue. The JLS would simply say that the format of this
directive is something along the lines of:

source 1.8;

where the exact format of the 'parameter' to source is:

Sequence(ZeroOrMore(Sequence(DIGITS, '.')), DIGITS)

and exactly 0 or 1 source directives may exist in module-info.java,
probably with some further restrictions on where this directive is legally
allowed to appear. I'll leave the exact details to be hashed out later; the
point is: The JLS does not have to convey legal values for this directive,
it merely needs to define the format for it. This parameter will be
interpreted by the compiler, and the compiler is then entirely free to
figure out what this means all by itself. At best, the spec can (and
probably should) declare that 1.8 (or 8) _MUST_ be a legal value, with
nothing said about any other value.

This is 100% analogous to how classpath is handled: The language spec is
quite clear on what "import java.util.Arrays;" means, and the language spec
is quite clear on how one should resolve the statement
"Arrays.asList(someInteger, someDouble, someInteger);" _once the compiler
knows the signatures present in java.util.Arrays_, but it gives absolutely
no hint whatsoever as to how the compiler is supposed to figure out what
those signatures are, when provided with only an import statement. Without
the spec defining how to get these signatures, the correct answer to 'how
should I translate Arrays.asList(a, b, c) to bytecode' is dependent on
external factors too, i.e. the spec alone cannot give a definitive answer.

The JLS does not even mention the classpath nor the JVMS which needs to be
used to parse the classfiles one would find there in order to find the
signatures of Arrays, which is necessary to correctly resolve method calls.
- Let alone specify any of these things.

Thus, we arrive at a perhaps somewhat uncomfortable fact: Given just the
JLS, it is impossible to write a compiler that can compile anything except
the simplest source files (namely: Ones with 0 dependencies in them, not
even a dependency on java.lang.String, as the JLS does not specify the
signatures present in the String class). It can't even give you an LST,
 because it cannot resolve method calls.

Given that, I don't see any problem with being similarly unspecified for
parameters to the 'source' directive that aren't "1.8". Specifying 'source'
that way (1.8 is legal, everything else - who knows?) thus doesn't change
anything about how the JLS itself is not actually enough, you need a
'meta-spec' a level above that to end up with a practically usable compiler.

The encoding issue can be similarly solved (do not actually define in the
JLS spec what to do with this directive, just define what it should look
like and add a note if needed to explain its general intent, leaving it
clear that it's up to the compiler to do sensible things with this
meta-information), but the case here is not quite as strong as for 'source'.

It might be a good idea to write a sister specification which lists minimum
legal compiler switches and defines what a compiler is supposed to do with
various meta-information. This specification should go into classpath,
sourcepath, warning levels for -Xlint (and push -Xlint out of -X territory.
In fact, this spec should list every non -X switch and not mention anything
about -X switches except the idea that -X itself is
implementation-specific), how the lightweight encoding directive is
supposed to be found, links to all the different JLS spec versions and
specifications on how to handle -source and -target (and the source
directive), and rules that any -encoding parameter (or directive) in a
given list MUST be parsed correctly in order to warrant the title 'java
compiler'. This spec can use 'can' and 'must' as appropriate. For example,
a java1.8 compatible compiler MUST understand 'source 1.8;' and MAY treat
'source 1.7;' as a directive to compile according to the JLS 1.7. If it
does not do so, then it MUST emit a 'not compatible with that source
version' error.



 --Reinier Zwitserloot



On Tue, Jan 24, 2012 at 22:35, Neal Gafter <[hidden email]> wrote:

> Overall, I agree it would be nice to place these options together in a file
> for the compiler to consume.  I just don't think it should be a "Java
> programming language source" file.
>
> The idea of using module-info.java to specify compilation options seems
> most tempting, but as long as it is a source file in the language, it must
> be subject to those options as well.  Which means that specifying them
> inside the file itself is pretty pointless.  As to the source level, what
> can the language specification say other than that "8" is the only allowed
> value?  And what can the next version of the language specification say
> other than that "9" is the only allowed value?
>
> On Tue, Jan 24, 2012 at 4:07 AM, Jesse Glick <[hidden email]>
> wrote:
>
> > The encoding and source level of a module are fundamental attributes of
> > its sources, without which you cannot reliably even parse a syntax tree,
> so
> > I think they should be declared in module-info.java. Otherwise it is left
> > up to someone calling javac by hand, or a build script, to specify these
> > options; that is potentially error-prone, and means that tools which
> > inspect sources (including but not limited to IDEs) need to have some
> > separate mechanism for configuration of these attributes: you cannot just
> > hand them the sourcepath and let them run.
> >
> > I am assuming that all files in the sourcepath use the same encoding and
> > source level, which seems a reasonable restriction.
> >
> >
> > As to the source level, obviously given that JDK 8 will introduce
> > module-info.java, "8" (or "1.8") seems like the right default value; but
> a
> > syntax ought to be defined for specifying a newer level, e.g.
> >
> >  source 1.9; // or 9?
> >
> > Furthermore I think that JDK 9+ versions of javac should keep the same
> > default source level - you should need to explicitly mark what version of
> > the Java language your module expects. Otherwise a module might compile
> > differently according to which version of javac was used, which is
> > undesirable, and tools cannot guess what version you meant. A little more
> > verbosity here seems to be justified.
> >
> > Whether the bytecode target (-target) should be specified in
> > module-info.java is another question. I have seen projects built using
> > -target 5 for JDK 5 compatibility but also in a separate artifact using
> > -target 6 for speed on JDK 6+ (split verifier). Probably the target level
> > should default to the source level, and in the rare case that you need to
> > override this, you can do so using a javac command option - this has no
> > impact on tools which just need to parse and analyze source files.
> >
> >
> > As to the encoding, something like
> >
> >  encoding ISO-8859-2;
> >
> > would suffice. The obvious problems for encoding are
> >
> > 1. What should the default value be? javac currently uses the platform
> > default encoding, which IMHO is a horrible choice because it means that
> two
> > people running javac with the same parameters on the same files may be
> > producing different classes and/or warning messages. I would suggest
> making
> > UTF-8 be the default when compiling in module mode (leaving the old
> > behavior intact for legacy mode). For developers who want to keep sources
> > in a different character set, adding one line per module-info.java does
> not
> > seem like much of a burden.
> >
> > 2. What is module-info.java itself encoded in? If not UTF-8, then you
> need
> > to be able to reliably find the encoding declaration and then rescan the
> > file in that encoding. That is easy for most encodings (just do an
> initial
> > scan in ISO-8859-1), including everything commonly used by developers
> > AFAIK; a little trickier for UTF-16/32-type encodings but possible by
> > ignoring 0x00/0xFE/0xFF; and only fails on some mainframe charsets, old
> JIS
> > variants, and dingbats (*). Even those rare cases are probably guessable.
> > [1]
> >
> >
> > (*) Demo program:
> >
> > import java.io.**UnsupportedEncodingException;
> > import java.nio.charset.Charset;
> > import java.util.Arrays;
> > public class CharsetTest {
> >    public static void main(String[] args) throws
> > UnsupportedEncodingException {
> >        Charset raw = Charset.forName("ISO-8859-1");
> >        for (Charset c : Charset.availableCharsets().**values()) {
> >            String text = "/* leading comment */\nmodule test {\n
>  encoding
> > " + c.name() + ";\n}\n";
> >            byte[] encoded;
> >            try {
> >                encoded = text.getBytes(c);
> >            } catch (UnsupportedOperationException x) {
> >                System.out.println("cannot encode using " + c.name());
> >                continue;
> >            }
> >            if (Arrays.equals(encoded, text.getBytes(raw))) {
> >                System.out.println("OK in " + c.name());
> >            } else if (new String(encoded, raw).contains("  encoding " +
> > c.name() + ";")) {
> >                System.out.println("substring match in " + c.name());
> >                dump(encoded);
> >            } else if (new String(encoded, raw).replace("\u0000",
> > "").contains("  encoding " + c.name() + ";")) {
> >                System.out.println("NUL-**stripped match in " + c.name
> ());
> >                dump(encoded);
> >            } else {
> >                System.out.println("garbled in " + c.name());
> >                dump(encoded);
> >            }
> >        }
> >    }
> >    private static void dump(byte[] encoded) {
> >        for (byte b : encoded) {
> >            if (b >= 32 && b <= 126 || b == '\n' || b == '\r') {
> >                System.out.write(b);
> >            } else if (b == 0) {
> >                System.out.print('@');
> >            } else {
> >                System.out.printf("\\%02X", b);
> >            }
> >        }
> >        System.out.println();
> >    }
> >    private CharsetTest() {}
> > }
> >
> >
> > [1] http://jchardet.sourceforge.**net/ <http://jchardet.sourceforge.net/
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: -encoding and -source in module-info.java

Alex Buckley
The JLS is not going to define syntax for a 'source' declaration and
then leave it up to compilers to interpret. It is bad enough that the
JLS defers to "host system" for resolving dependences, and that will be
fixed with a standard module system configured from the language. We're
not going to take two steps forward, then one step back.

The idea of explicit source levels _in the language_ comes up every few
years. I assume people want to use it as a backdoor way of removing
language features. But that is off-topic for this list.

On 1/30/2012 8:53 AM, Reinier Zwitserloot wrote:
> It might be a good idea to write a sister specification which lists minimum
> legal compiler switches and defines what a compiler is supposed to do with
> various meta-information. This specification should go into classpath,
> sourcepath, warning levels for -Xlint (and push -Xlint out of -X territory.

Sounds good. Please discuss further on compiler-dev, not jigsaw-dev.

Alex
12