HDFS file listing

Only using the Hadoop libraries I can list all the files that are in a subdirectory with this:

 // list all sites we have data for
 FileSystem fs = FileSystem.get(new Configuration());
 FileStatus status[] = fs.listStatus(new Path("hdfs:///dir/subdir/"));
 for ( FileStatus s : status ) {
     try {
         FileStatus[] metricFile = fs.listStatus(new Path(s.getPath().toString() + "/file.json"));
         logger.info("File: " + metricFile[0].getPath().toString());
     } catch ( IOException e ) {
         // there is no metric file
     }
 }

Since I use Spark for most applications I do I prefer this way of dealing with it:

SparkConf sc = new SparkConf().setAppName("Learning");
JavaSparkContext jsc = new JavaSparkContext(sc);
JavaPairRDD<String, String> allMetricFiles = jsc.wholeTextFiles("hdfs:///dir/subdir/*/file.json");
for ( Tuple2<String, String> each : allMetricFiles.toArray() ) {
	logger.info("Only metric file: " + each._1);
}
Advertisements

Published by

m5c

Java developper that loves photography and good coffee

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s