Hadoop

Google Dataflow Python ValueError: Unable to get the Filesystem for path gs://myprojetc/digport/ports.csv.gz

I am using google cloud to create an event on Cloud Storage to Big Query using Apache Beam pythons library. I was executing an ETL in the “DirectRunner” mode and found no issue. But later when I take everything on dataflow to execute found an error.

Below command used to upload the file and I can see my file is present at location same:-

gsutil cp datapip.csv.gz  gs://myproject/data/datapip.csv.gz

Sadly whenever I run below command to execute my pipeline in cloud mode get an error:-

python dfmypy.py -p myproject -b mybucket -d mydataset
Correcting timestamps and writing to BigQuery dataset flights
Traceback (most recent call last):
  File "df06.py", line 171, in <module>
    run(project=args['project'], bucket=args['bucket'], dataset=args['dataset'])
  File "df06.py", line 134, in run
    | 'airports:tz' >> beam.Map(lambda fields: (fields[0], addtimezone(fields[21], fields[26])))
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/textio.py", line 524, in __init__
    skip_header_lines=skip_header_lines)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/textio.py", line 119, in __init__
    validate=validate)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsource.py", line 121, in __init__
    self._validate()
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/options/value_provider.py", line 137, in _f
    return fnc(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsource.py", line 178, in _validate
    match_result = FileSystems.match([pattern], limits=[1])[0]
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filesystems.py", line 186, in match
    filesystem = FileSystems.get_filesystem(patterns[0])
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filesystems.py", line 92, in get_filesystem
    raise ValueError('Unable to get the Filesystem for path %s' % path)

Resolution:-

It looks like my apache-beam library installation might be incomplete. Running below command show up same error:-

pip install apache-beam[gcp]

After running below command I can execute my code successfully:-

sudo pip install apache-beam[gcp]

The apache-beam package doesn’t include all the stuff to read/write from GCP. To get all that, as well as the runner for being able to deploy your pipeline to CloudDataflow (the DataRunner), we’ll need to install it via sudo pip. It seems wired the error message is not clear, therefore I thought to write a small write-up to help someone.

Happy Machine Learning!

Leave a Reply

Your email address will not be published. Required fields are marked *