HDFS file listing

Only using the Hadoop libraries I can list all the files that are in a subdirectory with this:

 // list all sites we have data for
 FileSystem fs = FileSystem.get(new Configuration());
 FileStatus status[] = fs.listStatus(new Path("hdfs:///dir/subdir/"));
 for ( FileStatus s : status ) {
     try {
         FileStatus[] metricFile = fs.listStatus(new Path(s.getPath().toString() + "/file.json"));
         logger.info("File: " + metricFile[0].getPath().toString());
     } catch ( IOException e ) {
         // there is no metric file

Since I use Spark for most applications I do I prefer this way of dealing with it:

SparkConf sc = new SparkConf().setAppName("Learning");
JavaSparkContext jsc = new JavaSparkContext(sc);
JavaPairRDD<String, String> allMetricFiles = jsc.wholeTextFiles("hdfs:///dir/subdir/*/file.json");
for ( Tuple2<String, String> each : allMetricFiles.toArray() ) {
	logger.info("Only metric file: " + each._1);

Published by


Java developper that loves photography and an excellent espresso

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s